Music Pipeline Consolidation Summary
| Old File | New Location | Status | |----------|--------------|--------| | `parse_soundcloud_likes.py` | `sources/soundcloud.py` | Merged | | `parse_soundcloud_v2.py` | `sources/soundcloud.py` | Merged | | `download_music.py` | `download/downloader.py` | Merged | | `download_music_to_gcs.py` | `storage/gcs.py` | Merged | | `process_all_tracks.py` | `pipeline.py` | Merged | | `process_music_list.py` | `pipeline.py` | Merged | | `process_soundcloud_likes.py` | `pipeline.py` | Merged | | `reprocess_soundcloud.py` | `
Full Public Reader
Music Pipeline Consolidation Summary
This document tracks the consolidation of music pipeline code into `backend/cc-music-pipeline/python/`.
Consolidated Files
From `tools/music-pipeline/`
| Old File | New Location | Status |
|---|---|---|
| `parse_soundcloud_likes.py` | `sources/soundcloud.py` | Merged |
| `parse_soundcloud_v2.py` | `sources/soundcloud.py` | Merged |
| `download_music.py` | `download/downloader.py` | Merged |
| `download_music_to_gcs.py` | `storage/gcs.py` | Merged |
| `process_all_tracks.py` | `pipeline.py` | Merged |
| `process_music_list.py` | `pipeline.py` | Merged |
| `process_soundcloud_likes.py` | `pipeline.py` | Merged |
| `reprocess_soundcloud.py` | `cli.py research` | Merged |
| `retry_failed_tracks.py` | `cli.py retry` | Merged |
| `run_pipeline.py` | `cli.py ingest` | Merged |
| `debug_search.py` | `search/youtube.py` | Merged |
| `test_improved_search.py` | - | Test only, not needed |
| `test_music_pipeline.py` | - | Test only, not needed |
| `test_query_cleaning.py` | - | Test only, not needed |
| `MUSIC_PIPELINE_ROADMAP.md` | Archived | Reference only |
| `RESUME_DOWNLOADS.md` | - | Obsolete |
Data Files Copied
All output data files have been copied to `backend/cc-music-pipeline/python/data/`:
- `all_liked.txt` - Original SoundCloud likes (478 tracks)
- `soundcloud_likes_youtube_results.json` - YouTube search results (286 found)
- `download_progress.json` - Download progress tracking
- `all_music_search_results.json` / `.csv` - Additional search results
- `soundcloud_youtube_urls.txt` - Found YouTube URLs
---
Files Safe to Remove
The following directories can be removed after verification:
# Remove redundant tools folder
rm -rf tools/music-pipeline/
# Alternative: Archive instead of delete
mv tools/music-pipeline/ archive/music-pipeline-legacy/Before Removing
Verify the new package works:
cd backend/cc-music-pipeline/python
# Test parsing
python -c "from cc_music_ingestion import SoundCloudParser; print('OK')"
# Test CLI
python cli.py status
# Test paste mode (dry run)
python cli.py paste --dry-run---
New Package Structure
backend/cc-music-pipeline/python/
├── cc_music_ingestion/ # Main package
│ ├── __init__.py # Package exports
│ ├── pipeline.py # Main orchestrator
│ ├── sources/ # Source parsers
│ │ ├── soundcloud.py # SoundCloud likes parser
│ │ └── playlist.py # Generic playlist parser
│ ├── search/ # Search engines
│ │ └── youtube.py # YouTube search with query cleaning
│ ├── download/ # Downloaders
│ │ └── downloader.py # yt-dlp with exponential backoff
│ └── storage/ # Storage backends
│ └── gcs.py # Google Cloud Storage
├── cli.py # Command-line interface
├── data/ # Historical data files
├── requirements.txt # Python dependencies
├── Dockerfile # Container image
├── cloudbuild.yaml # Cloud Build config
├── API.md # API documentation
└── CONSOLIDATION.md # This file---
Feature Comparison
| Feature | Old (`tools/`) | New (`cc_music_ingestion`) |
|---|---|---|
| SoundCloud parsing | Multiple scripts | `SoundCloudParser` class |
| YouTube search | In-line code | `YouTubeSearcher` with query variants |
| Download | Basic | Exponential backoff, retry logic |
| GCS upload | Script-only | `GCSStorage` class with database |
| Rate limiting | Manual | Automatic detection and backoff |
| Progress tracking | JSON files | Unified progress system |
| CLI | Multiple scripts | Single `cli.py` with subcommands |
| Paste mode | N/A | New feature |
| Cloud Run | Dockerfile.cloud | Integrated Dockerfile |
---
Migration Complete
Date: 2025-12-27
All functionality from `tools/music-pipeline/` has been consolidated into the new unified package at `backend/cc-music-pipeline/python/cc_music_ingestion/`.
The new package includes:
- Cleaner modular architecture
- Improved error handling and retry logic
- Unified CLI with paste mode
- Comprehensive API documentation
- Cloud Run deployment support
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/backend/cc-music-pipeline/python/CONSOLIDATION.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture