Skip to content
John Hodge

← Projects

Hardware diagnostics and repair recommendations

Proprietary industry work. Described qualitatively, with no internal metrics or code.

Problem

When a server in a large compute fleet fails, the recovery question is narrow and expensive: which component is at fault, and what is the right repair action? The signals that answer it are scattered across system event logs, memory diagnostics, crash reports, lifecycle events, host snapshots, and prior repair history. Each source is noisy, and the repair outcomes used as labels are themselves imperfect.

Approach

I work on the production science behind AWS EC2 hardware diagnostics and repair recommendations. The core is a set of models that fuse those heterogeneous sources into component-level recommendations, paired with the decision logic that turns a prediction into an action. The parts I care most about:

Result

The system runs in production and drives real server repairs across the fleet, including advanced-accelerator hardware. It improved repair success over the prior baseline and shifted decisions away from unnecessary part replacement. The same funnel analysis shaped the team’s roadmap for where coverage and repair effectiveness should improve next.

This is proprietary industry work, so the description stays qualitative, with no internal metrics and no code to link.