Step 05
Minimal CSV extraction
Create one tidy CSV row per game—for you only—without flattening the universe.
Run
Run after each small change. Tiny loops win.
uv run python -m src.scout You will touch
src/scout/(CSV extraction)data/derived/
Time
60–120 minutes
Do this (suggested order)
- Load all
data/raw/matches/*.jsonfiles. - For each match, find your participant by matching
puuid. - Build one row dict per match with a small set of columns (add columns one at a time).
- Write
data/derived/matches.csvwith consistent headers. - Run sanity checks: row count, win values, time order when sorted, no all-null columns.
You’ll practice
- Extract stable fields from nested JSON
- Design a derived dataset
- Write CSV without losing types
Explainers (for context, not homework)
- Caching sanity — Derived data is meant to be rebuildable
Build
Read
- Load all data/raw/matches/*.json
Select your row
- One row per match for your participant
Write
- data/derived/matches.csv with required columns
Check yourself
- CSV exists with same row count as match files processed
- Values look sane (no all-null columns)
If it breaks
- Selecting the wrong participant
- CSV headers off-by-one
- Mixing strings/ints in same column
Hints (spoilers)
Hint: build it like mini-quests (one column at a time)
Build the CSV like a checklist: get match_id + game_start + win first, then
add one column at a time. If a new column breaks things, you know exactly which one did it.
Bigger hint: row skeleton (small, repeatable)
A reasonable starting row
row = {
'match_id': match_id,
'game_start': match['info'].get('gameStartTimestamp'),
'win': bool(me.get('win')),
'champion': me.get('championName'),
'kills': me.get('kills'),
'deaths': me.get('deaths'),
'assists': me.get('assists'),
} Bigger hint: missing fields (use .get and keep moving)
Not every field exists in every match. Use .get() with defaults and allow “optional columns” to be
blank instead of crashing.
Unblock-me: mixed types (make win consistently 0/1 or True/False)
If some rows store win as true and others as \"true\", pandas will hate
you later. Pick one representation and stick to it.
Expected derived file
data/
derived/
matches.csv Suggested starter columns
Keep it small. You can always add more later.
match_id
game_start
champion
win
kills
deaths
assists
game_duration_s Sanity checks (quick)
- Row count == match files processed
- game_start increases when sorted
- win is only 0/1 or True/False
- champion looks like names, not IDs