Music Pipeline Consolidation Summary

Full HTML reader

Read the full artifact

Extracted abstract or opening context

This document tracks the consolidation of music pipeline code into `backend/cc-music-pipeline/python/`. | Old File | New Location | Status | |----------|--------------|--------| | `parse_soundcloud_likes.py` | `sources/soundcloud.py` | Merged | | `parse_soundcloud_v2.py` | `sources/soundcloud.py` | Merged | | `download_music.py` | `download/downloader.py` | Merged | | `download_music_to_gcs.py` | `storage/gcs.py` | Merged | | `process_all_tracks.py` | `pipeline.py` | Merged | | `process_music_list.py` | `pipeline.py` | Merged | | `process_soundcloud_likes.py` | `pipeline.py` | Merged | | `reprocess_soundcloud.py` | `cli.py research` | Merged | | `retry_failed_tracks.py` | `cli.py retry` | Merged | | `run_pipeline.py` | `cli.py ingest` | Merged | | `debug_search.py` | `search/youtube.py` | Merged | | `test_improved_search.py` | - | Test only, not needed | | `test_music_pipeline.py` | - | Test only, not needed | | `test_query_cleaning.py` | - | Test only, not needed | | `MUSIC_PIPELINE_ROADMAP.md` | Archived | Reference only | | `RESUME_DOWNLOADS.md` | - | Obsolete | All output data files have been copied to `backend/cc-music-pipeline/python/data/`: - `all_liked.txt` - Original SoundCloud likes (478 tracks) - `soundcloud_likes_youtube_results.json` - YouTube search results (286 found) - `download_progress.json` - Download progress tracking - `all_music_search_results.json` / `.csv` - Additional search results - `soundcloud_youtube_urls.txt` - Found YouTube URLs | Feature | Old (`tools/`) | New (`cc_music_ingestion`) | |---------|----------------|----------------------------| | SoundCloud parsing | Multiple scripts | `SoundCloudParser` class | | YouTube search | In-line code | `YouTubeSearcher` with query variants | | Download | Basic | Exponential backoff, retry logic | | GCS upload | Script-only | `GCSStorage` class with database | | Rate limiting | Manual | Automatic detection and backoff | | Progress tracking | JSON files | Unified progress system | | CLI | Multiple scripts | Single `cli.py` with subcommands | | Paste mode | N/A | New feature | | Cloud Run | Dockerfile.cloud | Integrated Dockerfile |

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.