Interpreting GPU Timeline

Ideal timeline:

GPU: ████████████████████████████████████████████  (100% busy)
CPU: ██  ██  ██  ██  ██  ██  ██  ██  ██  ██  ██  (loading data)

CPU bottleneck:

GPU: ██    ██    ██    ██    ██    ██    ██      (idle gaps)
CPU: ████████████████████████████████████████████  (100% busy)
     ▲     ▲     ▲
     Gaps while waiting for data!

Memory transfer bottleneck:

GPU: ██      ██      ██      ██      ██      ██
MEM: ░░██████░░██████░░██████░░██████░░██████░░  (memcpy)
     ▲ Large memory transfers stalling GPU

Operation	Precision	Why?
Master weights	FP32	Accumulate tiny updates
Forward pass	FP16	Just math, speed matters
Loss scaling	FP32	Small values matter
Softmax	FP32	Numerical stability

Optimization	Speedup	Cumulative Time
Baseline	1.0x	120 min
+ num_workers=8	1.4x	86 min
+ Mixed precision (AMP)	1.9x	45 min
+ Larger batch (32→128)	2.3x	37 min
+ torch.compile	2.8x	31 min

Technique	Memory Usage	Batch Size
Baseline FP32	52 GB	OOM
FP16	26 GB	1
+ Gradient checkpointing	18 GB	2
+ Gradient accumulation	18 GB	8 (effective)
+ Flash Attention	14 GB	4

Profiling & Optimization

Week 13 · CS 203: Software Tools and Techniques for AI

The Performance Problem

The Optimization Mindset

The Doctor's Approach

Performance Metrics Overview

The Optimization Loop

Types of Bottlenecks

Profiling Tool Hierarchy

Quick Check: nvidia-smi

Python Profiling: cProfile

cProfile Example Output

Line-Level Profiling: line_profiler

Memory Profiling: memory_profiler

PyTorch Built-in Profiling

PyTorch Profiler Output

TensorBoard Profiler Visualization

Interpreting GPU Timeline

Data Loading Optimization

Data Loading Best Practices

Mixed Precision Training Theory

The Precision Goldilocks Zone

Automatic Mixed Precision (AMP)

AMP Implementation in PyTorch

AMP Best Practices

Memory Optimization: Gradient Checkpointing

Gradient Checkpointing in PyTorch

Gradient Accumulation

Compute Optimization: torch.compile

torch.compile Modes

Operator Fusion Example

Flash Attention

System-Level Optimization

Benchmarking Best Practices

Benchmarking Checklist

Common Performance Anti-Patterns

Common Performance Anti-Patterns (2)

Case Study: Training Speedup

Case Study: Memory Optimization

Profiling Workflow Summary

Optimization Priority

Tools Ecosystem Summary

Lab Preview

Key Takeaways

Interview Questions

Additional Resources