Hardware diagnostics and repair recommendations

Proprietary industry work. Described qualitatively, with no internal metrics or code.

Python
PyTorch
SageMaker
Time-series ML
Calibration

Problem

When a server in a large compute fleet fails, the recovery question is narrow and expensive: which component is at fault, and what is the right repair action? The signals that answer it are scattered across system event logs, memory diagnostics, crash reports, lifecycle events, host snapshots, and prior repair history. Each source is noisy, and the repair outcomes used as labels are themselves imperfect.

Approach

I work on the production science behind AWS EC2 hardware diagnostics and repair recommendations. The core is a set of models that fuse those heterogeneous sources into component-level recommendations, paired with the decision logic that turns a prediction into an action. The parts I care most about:

Telemetry fusion that aligns logs, diagnostics, and lifecycle events into a single view of a failed host.
Leakage-aware temporal evaluation, so a model is measured the way it will be used: predicting forward in time rather than peeking at outcomes it would not have at decision time.
Calibration and confidence-tiered routing, separating high-confidence automated actions from cases that need more diagnosis before anyone touches hardware.

Result

The system runs in production and drives real server repairs across the fleet, including advanced-accelerator hardware. It improved repair success over the prior baseline and shifted decisions away from unnecessary part replacement. The same funnel analysis shaped the team’s roadmap for where coverage and repair effectiveness should improve next.

This is proprietary industry work, so the description stays qualitative, with no internal metrics and no code to link.