GPU Utilization Metrics
Key Desiderata for a GPU Utilization Statistic¶
An ideal metric for GPU utilization should reward high average usage, penalize low/idle time, and account for stability (low variance).
- High mean utilization is good.
- Low variance (stable, not spiky) is good.
- Duration/time-normalization is important—long idle periods should penalize the score.
- Interpretable on [0, 1] or [0%, 100%] scale if possible.
Basic Statistics¶
Let \(u_1, u_2, ..., u_n\) be the sequence of GPU utilization percentages (sampled at regular intervals, in [0, 100]).
A. Mean Utilization¶
\[
\mu_u = \frac{1}{n} \sum_{i=1}^n u_i
\]
- Pros: Simple, intuitive.
- Cons: Can be misleading if you have brief spikes and lots of idle periods.
B. Standard Deviation (Variance)¶
\[
\sigma_u = \sqrt{ \frac{1}{n} \sum_{i=1}^n (u_i - \mu_u)^2 }
\]
- High σ: Utilization is unstable/spiky.
C. Proposed Robust Utilization Metric¶
1. “Effective Utilization” (Mean - λ × Std)¶
\[
\text{EffU} = \mu_u - \lambda \sigma_u
\]
- Where λ is a tradeoff factor (e.g., λ=1).
- Interpretation: Rewards high mean, penalizes high variance.
2. Fraction of Time Above Threshold¶
\[
\text{Frac}_{\theta} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{ u_i > \theta \}
\]
- e.g., θ=80%.
- What fraction of the run is “highly utilized”?
3. Area Under the Utilization Curve (AUC)¶
\[
\text{AUC}_u = \frac{1}{100 n} \sum_{i=1}^n u_i
\]
- Same as mean, but normalized to [0,1].
- AUC is also robust if your sampling interval is uniform.
- AUC only reflects “average work done,” not how the work was distributed in time. For hardware optimization and system diagnosis, you also want to know if the workload is steady or bursty, and how often the GPU is left waiting.
D. Composite “GPU Efficiency Score”¶
Let’s define a simple composite metric:
\[
\text{GPU Efficiency} = \frac{\text{Mean Util} - \sigma_u}{100}
\]
- Range: Can be negative (bad) or up to 1 (perfect: mean=100, std=0).
-
Interpretation:
-
1.0: Always 100%, perfectly steady.
- 0.8: Average 90%, std 10.
- Negative: mean is low and/or std is very high (spiky/idle).
Or:
\[
\text{GPU Utilization Score} = \frac{1}{100} \left( \alpha \cdot \mu_u + (1 - \alpha) \cdot \text{Frac}_{\theta} \right)
\]
Where α is a weight (e.g., 0.5), θ is a high-utilization threshold (e.g., 80%).
E. Time-Weighted Adjustment (if needed)¶
If intervals are not uniform, multiply each utilization by its interval and divide by total time:
\[
\text{TimeWeightedMean} = \frac{ \sum_{i=1}^n u_i \Delta t_i }{ \sum_{i=1}^n \Delta t_i }
\]
Critique/Limitations¶
- High mean but high variance may indicate batchiness, pipeline stalling—lower score with the above metric.
- High mean with low std is truly optimal (score near 1).
- Low mean and low std means consistently idle (score near 0 or negative).
- Composite metrics can be tuned (λ or α) to emphasize stability or average, depending on workload.