Step 05

Minimal CSV extraction

Create one tidy CSV row per game—for you only—without flattening the universe.

Run

Run after each small change. Tiny loops win.

uv run python -m src.scout

You will touch

src/scout/ (CSV extraction)
data/derived/

Time

60–120 minutes

Do this (suggested order)

Load all data/raw/matches/*.json files.
For each match, find your participant by matching puuid.
Build one row dict per match with a small set of columns (add columns one at a time).
Write data/derived/matches.csv with consistent headers.
Run sanity checks: row count, win values, time order when sorted, no all-null columns.

You’ll practice

Extract stable fields from nested JSON
Design a derived dataset
Write CSV without losing types

Explainers (for context, not homework)

Caching sanity — Derived data is meant to be rebuildable

Build

Read

Load all data/raw/matches/*.json

Select your row

One row per match for your participant

Write

data/derived/matches.csv with required columns

Check yourself

CSV exists with same row count as match files processed
Values look sane (no all-null columns)

If it breaks

Selecting the wrong participant
CSV headers off-by-one
Mixing strings/ints in same column

Hints (spoilers)

Hint: build it like mini-quests (one column at a time)

Build the CSV like a checklist: get match_id + game_start + win first, then add one column at a time. If a new column breaks things, you know exactly which one did it.

Bigger hint: row skeleton (small, repeatable)

A reasonable starting row

row = {
  'match_id': match_id,
  'game_start': match['info'].get('gameStartTimestamp'),
  'win': bool(me.get('win')),
  'champion': me.get('championName'),
  'kills': me.get('kills'),
  'deaths': me.get('deaths'),
  'assists': me.get('assists'),
}

Bigger hint: missing fields (use .get and keep moving)

Not every field exists in every match. Use .get() with defaults and allow “optional columns” to be blank instead of crashing.

Unblock-me: mixed types (make win consistently 0/1 or True/False)

If some rows store win as true and others as \"true\", pandas will hate you later. Pick one representation and stick to it.

Expected derived file

data/
  derived/
    matches.csv

Suggested starter columns

Keep it small. You can always add more later.

match_id
game_start
champion
win
kills
deaths
assists
game_duration_s

Sanity checks (quick)

- Row count == match files processed
- game_start increases when sorted
- win is only 0/1 or True/False
- champion looks like names, not IDs