{
  "id": "demo-ml-pipeline/training-data-pipeline",
  "slug": "training-data-pipeline",
  "title": "Session: training-data-pipeline — 2026-01-20",
  "type": "source",
  "project": "demo-ml-pipeline",
  "date": "2026-01-20",
  "started": "2026-01-20T09:15:00+00:00",
  "ended": "2026-01-20T12:30:00+00:00",
  "model": "claude-opus-4-6",
  "cwd": "/Users/demo/code/demo-ml-pipeline",
  "git_branch": "main",
  "permission_mode": "default",
  "user_messages": "7",
  "tool_calls": "24",
  "tools_used": [
    "Read",
    "Write",
    "Edit",
    "Bash",
    "Glob",
    "Grep"
  ],
  "is_subagent": "false",
  "wikilinks_out": [
    "DataPipeline",
    "HuggingFaceDatasets",
    "demo-ml-pipeline",
    "tiktoken"
  ],
  "body_text": "Project: demo-ml-pipeline · Branch: main · Mode: default · Model: claude-opus-4-6\n\nSummary\n\nWrote a data-prep pipeline for a small text classification model. Loads Parquet from S3 (LocalStack in dev), deduplicates, tokenizes with the tiktoken cl100k_base encoder, and writes training shards in HuggingFace datasets format.\n\nConversation\n\nTurn 1 — User\n\nI need a data pipeline that takes our Parquet files and produces train/val shards suitable for a small BERT fine-tune.\n\nTurn 1 — Assistant\n\nI'll structure it as four stages:\n\n1. Load — pyarrow.parquet.read_table over an S3 prefix\n2. Dedupe — exact-match hash on (text, label)\n3. Tokenize — tiktoken cl100k_base; batch in chunks of 1024 for throughput\n4. Shard — 80/20 stratified split, write via datasets.Dataset.save_to_disk\n\nThe stages will be composable so we can run any subset (e.g. just re-tokenize without re-loading).\n\nTurn 2 — User\n\nWrite it.\n\nTurn 2 — Assistant\n\nDone. Each stage is a function in pipeline/stages/. There's a top-level pipeline.run() that reads a config.yaml and dispatches. Local test on 10k rows runs in 3.2s end to end.\n\nTools used in this turn:\n- Write: pipeline/stages/load.py (5-line preview)\n- Write: pipeline/stages/dedupe.py (5-line preview)\n- Write: pipeline/stages/tokenize.py (5-line preview)\n- Write: pipeline/stages/shard.py (5-line preview)\n- Bash: python -m pipeline --config configs/dev.yaml — exit 0\n\nConnections\n\n- demo-ml-pipeline — project\n- tiktoken — library\n- HuggingFaceDatasets — library\n- DataPipeline — concept",
  "sha256": "09fd8efe6c023c19",
  "source_url": "sessions/demo-ml-pipeline/training-data-pipeline.html"
}