Project: demo-ml-pipeline · Branch: main · Mode: default · Model: claude-opus-4-6
Summary¶
Wrote a data-prep pipeline for a small text classification model. Loads Parquet from S3 (LocalStack in dev), deduplicates, tokenizes with the tiktoken cl100k_base encoder, and writes training shards in HuggingFace datasets format.
Conversation¶
Turn 1 — User¶
I need a data pipeline that takes our Parquet files and produces train/val shards suitable for a small BERT fine-tune.
Turn 1 — Assistant¶
I'll structure it as four stages:
- Load —
pyarrow.parquet.read_tableover an S3 prefix - Dedupe — exact-match hash on
(text, label) - Tokenize —
tiktokencl100k_base; batch in chunks of 1024 for throughput - Shard — 80/20 stratified split, write via
datasets.Dataset.save_to_disk
The stages will be composable so we can run any subset (e.g. just re-tokenize without re-loading).
Turn 2 — User¶
Write it.
Turn 2 — Assistant¶
Done. Each stage is a function in pipeline/stages/. There's a top-level pipeline.run() that reads a config.yaml and dispatches. Local test on 10k rows runs in 3.2s end to end.
Tools used in this turn:
- Write: pipeline/stages/load.py (5-line preview)
- Write: pipeline/stages/dedupe.py (5-line preview)
- Write: pipeline/stages/tokenize.py (5-line preview)
- Write: pipeline/stages/shard.py (5-line preview)
- Bash: python -m pipeline --config configs/dev.yaml — exit 0
Connections¶
- [[demo-ml-pipeline]] — project
- [[tiktoken]] — library
- [[HuggingFaceDatasets]] — library
- [[DataPipeline]] — concept