Lightweight Taste Training Plan: Teaching the LLM to Mashup Better¶
revision: 2
Executive Summary¶
This plan consolidates two existing documents into one actionable training strategy: - mixTaste.md: The comprehensive mixing intelligence spec (tempo, arrangement, harmonic, EQ/dynamics, genre rules) - Taste Model Synthesis: The ML architecture for a CatBoost pairwise ranker
The goal is a lightweight training run that improves the LLM's ability to generate high-quality RemixPlan objects -- the structured arrangement decisions that determine which stems play when, at what levels, with what transitions. The training does NOT touch the DSP/rendering pipeline; it only improves plan selection.
Core thesis: The biggest quality lever is not better audio processing -- it's better decisions about what to play when. A tasteful arrangement with basic processing beats a bad arrangement with pristine mastering every time.
1. What We're Training¶
1.1 The Decision Surface¶
The LLM currently generates a single RemixPlan via deterministic fallback. The trained system will:
- Generate 8-12 candidate plans by varying arrangement knobs
- Hard-filter invalid candidates (constraint violations)
- Score surviving candidates with a CatBoost pairwise ranker
- Select the best plan and render it
1.2 What a RemixPlan Controls¶
| Decision | Current | After Training |
|---|---|---|
| Section count & boundaries | Fixed 5-section template | 4-7 sections, varied by song analysis |
| Vocal entry timing | Always bar 16 | Adaptive: bar 8, 16, or 24 based on energy/genre |
| Energy arc | Flat-ish | Template-matched (Classic/EDM/Hip-Hop/DJ-Lift arc) |
| Breakdown placement | Fixed at ~60% | 50-70%, optimized per pair |
| Transition style | Always crossfade | Cut, crossfade, filter sweep, silence gap -- genre-appropriate |
| Per-section gains | Conservative uniform | Dynamic, section-appropriate |
| Peak moment placement | Accidental | Intentional, 55-80% of timeline |
1.3 What We're NOT Training¶
- No end-to-end audio generation model
- No large audio encoder fine-tuning
- No changes to the DSP/rendering pipeline
- No per-user personalization (MVP)
- No real-time GPU inference
2. Architecture¶
2.1 Candidate Generation¶
Generate candidates by controlled variation, not random sampling.
Arrangement families (4 for MVP): - Standard arc: intro -> build -> main -> breakdown -> peak/outro - Hook-first: short intro -> vocal hook -> verse -> peak -> outro - DJ Lift: build -> vocal in -> peak -> vocal out -> outro - Quick Hit: intro -> main vocal block -> short release
Variation knobs (crossed to produce candidates): | Knob | Values | Notes | |------|--------|-------| | Section count | 4, 5, 6, 7 | Conditioned on duration (60-120s) | | Vocal entry | 8, 16, 24 beats | Earlier for hook-first/DJ-lift, later for standard | | Breakdown position | 50%, 60%, 70% of timeline | | | Transition style | cut, crossfade, filter sweep, silence gap | Genre-biased default; filter sweep for EDM/Afrobeats, cut for hip-hop/grime | | Transition length | 2, 4, 8 beats | Bounded by section length | | Vocal/instrumental gain delta | -6dB, -3dB, 0dB, +3dB steps | Finer control for vocal prominence |
Deduplication: Hash on section labels + boundary beats + transition types + coarse gain bins. After deduplication, if fewer than 8 candidates survive, backfill with safer variants from underrepresented families. Cap at 16, target 8-12.
2.2 Hard Constraints (Must Pass Before Scoring)¶
These are non-negotiable rules derived from mixTaste.md and DJ rules. Any candidate violating these is rejected before scoring.
| Constraint | Source | Threshold |
|---|---|---|
| Contiguous non-overlapping sections | Structure validity | Monotonic boundaries, start=0 |
| Section minimum length | DJ rules | >= 8 beats for major sections, >= 4 beats minimum |
| Beat grid alignment | Phrasing rules | Boundaries on 4-beat multiples |
| MVP source split | Product constraint | One song = vocals, other = instrumentals |
| Tempo stretch safety | mixTaste 1.2 | Drums <= 12%, vocals <= 35%, other <= 40% |
| Pitch shift safety | mixTaste 3.1 | <= 4 semitones |
| Transition bounds | DJ rules | transition_beats <= half section length |
| True peak ceiling | Audio hygiene | <= -1.0 dBTP |
| LUFS window | Audio hygiene | Genre-conditional: lo-fi -16 to -12, EDM -12 to -9, default within 2 dB of -12 |
| LRA floor | Audio hygiene | Loudness Range >= 4 dB (prevents wall-of-sound) |
| Contrast requirement | DJ rules | At least one contrast event before peak with magnitude: stem count drop >= 2 OR energy drop >= 20% |
| No dual lead vocals | MVP constraint | One lead vocal at a time |
| Outro quality | DJ rules | Final section must be outro-labeled, >= 8 beats, energy below peak |
| Stem quality gate | Audio quality | No solo vocal sections when stem separation quality is below threshold (cross-bleed ratio < 0.7) |
2.3 Feature Engineering¶
Features are organized into 3 tiers by computational cost. Phase 1 uses only Tier 1 features (20-30 total). Tier 2 added in Phase 2 at 1500+ labels. Tier 3 deferred to Phase 3 preview reranker.
Tier 1: Metadata-only (computable from plan + cached song analysis, no rendering)
Group 1: Structure (10-12 features) - Section count, mean/std section duration, min section length - Phrase boundary hit rate (% boundaries on 16-beat multiples) - Section validity ratio (% sections >= 8 beats with monotonic bounds) - Vocal placement fit (intro/outro vocal mute compliance) - Arrangement family match score
Group 2: Energy Arc (8-10 features) - Section energy profile correlation vs template (classic/EDM/hip-hop/DJ-lift) - Peak timing score (max energy section at 55-80% of timeline) - Rise/fall sanity (verse->chorus delta in expected range +2 to +6 dB) - Contrast index (normalized variance of stem activity across sections) - Density contour smoothness (mean absolute change in active-stem count)
Group 4: Harmonic/Tempo Risk (6-8 features) - Camelot distance after pitch decision (0 = best, >3 penalized heavily) - Absolute pitch shift semitones - Total tempo stretch amount (per stem type) - Stretch direction penalty (slow-down penalized more than speed-up)
Group 7: Prompt Fit (5-8 features) - Energy level mismatch (prompt intent vs plan energy profile) - Structural preference match (prompt implies "build" vs "immediate") - Vocal prominence alignment (prompt emphasis vs vocal duty cycle) - Genre compatibility score (from cross-genre combination matrix) - Chaos flag (prompt asks for experimental vs plan conservatism)
Tier 2: Proxy features (require cached analysis data, not full rendering)
Group 3: Vocal Clarity Risk (8-10 features) - Expected vocal-to-masker ratio in 2-5 kHz (from pre-computed frequency profiles) - Masking duty cycle (% vocal-active frames with predicted VMR < -3 dB) - Vocal exposure window ratio (total vocal-active time / remix duration) - Sibilance harshness proxy (6-10 kHz vocal excess over mid band)
Group 6: Groove Coherence (15-20 features) - Vocal onset-to-nearest-beat offset median + IQR - Downbeat alignment score - Swing mismatch risk (genre swing profiles: source vs target delta) - Kick-snare timing relationship compatibility - Groove template correlation (source groove profile vs target beat grid) - Accent pattern alignment (rhythmic density match between vocal flow and instrumental groove) - Pocket depth proxy (onset deviation distribution shape -- tight vs loose)
Group 8: Spectral Balance (8-10 features) - Low-mid energy ratio (200-500 Hz accumulation risk from overlapping stems) - Bass collision index (sub-200 Hz energy overlap between instrumental bus stems) - Sub-bass mono correlation proxy - Spectral tilt (overall balance bright vs dark) - Stem separation quality score (cross-bleed ratio from Demucs, per stem)
Tier 3: Render-required features (need audio rendering, deferred to Phase 3)
Group 5: Transition Quality (10-12 features) - Boundary loudness jump (LUFS delta 500ms pre/post boundary) - Spectral discontinuity (flux z-score at boundaries vs local median) - Click risk (max sample derivative spike near boundaries) - Reverb tail truncation risk - Harmonic center shift at boundary - Crossfade low-frequency bump - Timbral continuity score
Group 9: Moment Quality (5-8 features, Phase 3) - Coincidental melodic alignment score (chroma correlation at key moments) - Energy contour correlation between vocal and instrumental at peaks - Lyric-music alignment proxy
Feature versioning: Every feature extraction run produces a versioned manifest (feature names, computation method, version hash). Training data always tagged with feature version. Never mix versions in one training set.
2.4 Scoring Model¶
Model: CatBoost pairwise ranker (single model for MVP)
Objective: Pairwise ranking loss (RankNet-style logistic on score difference)
MVP scoring approach: Feed all features (including heuristic subscores as input features) directly to CatBoost. Let the model learn the optimal combination rather than manually tuning weights between heuristic and learned components. The hybrid formula score(c) = I[hard_pass(c)] * (w*x(c) + r_theta(c)) - lambda*risk(c) is available as a fallback if CatBoost is unavailable, using hand-tuned heuristic weights.
Calibration: Fit Platt scaling on validation set after training. Define tie-break threshold tau in probability space (e.g., P(winner) < 0.6 triggers lower-risk preference). Report calibration curves per training run.
Selection policy: Pick top-1; if margin(top1, top2) < tau, choose lower-risk candidate.
Fallback: If model unavailable or all scored candidates below quality threshold, use deterministic baseline heuristic scorer.
Model loading: Load CatBoost model at server startup. Run dummy prediction to warm cache. Implement hot-reload via file mtime watching for model updates.
3. Data Collection Strategy¶
3.1 Label Sources (3 tiers)¶
Tier 1: Auto-generated labels (available immediately) - Hard constraint failures as automatic negatives - Heuristic rubric scores from mixTaste rules (used as weak labels, NOT mixed directly with human labels -- see 3.5) - Deterministic "obviously bad" rejection labels
Tier 2: Human pairwise labels (primary training signal) - A/B preference: "Which remix sounds better?" on 15-25s preview clips - Per-rubric item scores (1-5) with failure tags - Target: 500 pairwise labels in first 2-3 weeks (Phase 1 starting point)
Phase 1 feature count discipline: With 500 labels, use only 20-30 Tier 1 features. Scale to full feature set (80+) only at 1500+ labels in Phase 2.
Tier 3: Implicit behavioral signals (post-launch, weak) - Replay rate, listen-through %, regenerate rate - Treat as low-confidence unless corroborated by human labels
3.2 Labeling Protocol¶
Setup requirements: - Closed-back headphones or calibrated monitors (no laptop speakers) - Normalized playback monitor level per session - Loudness-matched pairs: All A/B pairs normalized to -12 LUFS (+/- 0.5 dB) before playback. Verify no LUFS-preference correlation in label data. - Blind to plan details and model version
Label format:
- Pairwise A/B winner + confidence (low/med/high)
- Per rubric item: integer 1-5 with anchors
- Failure tags (multi-select): muddy, vocal_buried, clashy_key, awkward_transition, fatiguing, timing_off, low_end_muddy, groove_off
Bias controls: - Position bias: Randomize A/B presentation order for every pair. Present 5-10% of pairs in both orders to measure and correct for position bias. - Loudness bias: Loudness-matched pairs as above. - Tie handling: Allow "no preference / tie" as explicit option. Ties excluded from pairwise training but included in calibration analysis.
Quality controls: - 10% repeated pairs to measure rater consistency - Minimum inter-rater agreement: Cohen's kappa >= 0.4 before training begins. If below threshold, refine rubric anchors and re-calibrate raters. - 3 calibration examples (bad/mid/good) at session start - Minimum 2-3 raters on seed benchmark set - Weekly disagreement review to improve rubric - Confidence-weighted loss: high-confidence labels weighted 1.0, medium 0.7, low 0.4
3.3 The Heuristic Rubric (Weak Labels)¶
Derived from mixTaste.md, these generate scores before any human labels exist:
| Dimension | Weight | What It Measures |
|---|---|---|
| Arrangement quality | 15% | Section pacing, phrasing, stem density choices |
| Energy arc | 15% | Builds/releases feel intentional |
| Vocal intelligibility | 15% | VMR in 2-5 kHz, masking risk |
| Harmonic fit | 15% | Key distance, pitch shift amount |
| Transition quality | 15% | Boundary smoothness, no jolts, timbral continuity |
| Groove coherence | 15% | Beat alignment, swing compatibility, pocket |
| Loudness/fatigue | 10% | LUFS accuracy, crest factor, LRA |
(Weights rebalanced from R1: transition and groove elevated from 10% each to 15% each; arrangement split reduced.)
3.4 Genre-Specific Taste Rules (from mixTaste.md)¶
The training data and features must respect genre-specific conventions:
Tempo compatibility (per mixTaste Section 1): - Direct match (< 8% stretch) is always preferred - Half/double-time is natural and widely understood - Polyrhythmic only at creativity 3+ (experimental) - Speed-up produces fewer artifacts than slow-down
Arrangement conventions (per mixTaste Sections 2, 5): - Hip-hop: 8/16-bar phrases, ad-libs fill gaps on beats 2/4 - EDM: vocals in breakdowns only, drops are instrumental - R&B: preserve laid-back timing, sustain through barlines - Rock: guitar takes vocal frequency range during solos - DnB: half-time vocals over full-time drums - Afrobeats: percussion builds over 16-32 bars, plateau energy - Reggaeton: near-flat high-energy curve driven by dembow
Mixing conventions (per mixTaste Sections 3, 4): - Key compatibility via Camelot wheel (adjacent keys safe) - EQ carving: vocal pocket at 2-5 kHz, instrumental ducks - Sidechain: pattern-aware (4/4 standard, syncopated simplified, breakbeat uses automation instead)
3.5 Label Source Mixing Strategy¶
Critical: Heuristic and human labels are NOT mixed directly. The training strategy is:
- Phase 0-1 (bootstrap): Train CatBoost on heuristic labels with source tag
heuristic. Weight at 0.2. - Phase 1 (human labels arrive): Add human labels with source tag
human. Weight at 1.0. - Validation: Always evaluate on human-only holdout. Report human-only pairwise accuracy separately from mixed accuracy.
- Residual learning: Optionally train on residual (human preference minus heuristic score) to learn where human taste diverges from rules.
- Data splits: All splits at request level (no request appears in multiple splits). This prevents within-request pair leakage that inflates metrics.
4. Training Loop¶
4.1 Phase 0: Foundation (Days 1-7)¶
| Day | Deliverable | Status |
|---|---|---|
| 1 | candidate_planner.py -- expand fallback plan into 8-12 variants across 4 families with structure hash dedupe |
✅ PR #49 |
| 2 | Hard constraint validation skeleton with failure reason codes + enum, wired into pipeline | ✅ PR #50 |
| 3 | Feature extraction module (first 20-30 Tier 1 features) with versioned manifest + logging schema | ✅ PR #51 + #53 |
| 4 | Baseline weighted heuristic scorer, integrated behind ab_taste_model_v1 flag with timeout wrapper |
✅ PR #52 + #53 |
| 5 | Minimal local A/B labeling script (loudness-matched, position-randomized) + first batch of 25-40 remix requests | ☐ |
| 6 | First CatBoost pairwise model trained on heuristic + auto labels | ☐ |
| 7 | End-to-end smoke tests, latency validation, go/no-go report | ☐ |
Week 1 exit criteria: - [x] Candidate generation (4 families) + hard constraints live behind feature flag - [x] Baseline scoring selects a winner deterministically - [x] Telemetry captures candidates, features, scores, selection rationale (full logging schema) - [ ] Labeling loop functional with initial pairwise data (loudness-matched, position-randomized) - [x] No regression when feature flag off - Taste stage wrapped in timeout with circuit breaker
Training run manifest (produced by every training run): - Config hash (hyperparameters, feature version, label sources) - Data hash (training set fingerprint) - Feature manifest (names, versions, computation methods) - Model artifact (versioned, with metadata) - Eval metrics (all offline metrics on validation set)
4.2 Phase 1: First Real Model (Weeks 2-3)¶
- Collect 500 human pairwise labels (rater kappa >= 0.4 verified)
- Train CatBoost pairwise model on 20-30 Tier 1 features + heuristic subscores as inputs
- Label source weights: human=1.0, heuristic=0.2, auto-negative=0.5
- Validate on frozen holdout (request-level split, 100+ tasks, 3-5 candidates each)
- Report human-only holdout accuracy separately
- Ship behind feature flag
- A/B: model winner vs fallback on held-out tasks, target >= 65% preference
- Log full flag configuration (all ab_* flags) with every training data point
4.3 Phase 2: Iteration (Weeks 4-6)¶
- Active-learning label queue (show uncertain pairs to raters)
- Scale to Tier 1 + Tier 2 features (40-60 total) once 1500+ labels collected
- Add Group 6 (expanded groove), Group 8 (spectral balance) features
- Per-genre calibration (monitor for genre bucket regressions)
- Retrain cadence: trigger-based (200+ new labels or drift detected), not fixed weekly. Use faster cadence (daily) in early Phase 2.
- Increase frozen holdout to 250+ tasks for per-genre statistical power
4.4 Phase 3: Optional Enhancements (Weeks 7-12)¶
- Top-2 preview reranker (render short previews of top 2, pick winner using Tier 3 render-required features)
- Moment quality features (Group 9)
- Personalization exploration
- Text/plan cross-encoder experiment (Family 2 model)
- Full feature set (80+ features) once 3000+ labels
5. Integration¶
5.1 New Files¶
| File | Purpose |
|---|---|
backend/src/musicmixer/services/candidate_planner.py |
Generate 8-12 candidate plans across 4 arrangement families |
backend/src/musicmixer/services/taste_model.py |
CatBoost ranker + heuristic fallback scorer + model loading/hot-reload |
backend/src/musicmixer/services/taste_features.py |
Tiered feature extraction module with versioned manifests |
backend/src/musicmixer/services/taste_constraints.py |
Hard constraint validation with failure code enum |
backend/scripts/label_ab.py |
Local A/B labeling script (loudness-matched, position-randomized) |
backend/scripts/train_taste.py |
Training script with manifest output and reproducibility |
5.2 Pipeline Insertion¶
Current: step 4 (generate_fallback_plan) -> step 5+ (render)
After: step 4a (generate candidates) -> step 4b (hard filter) -> step 4c (score + select) -> step 5+ (render)
Gated behind TASTE_MODEL_ENABLED flag. When off, existing fallback behavior is unchanged.
Timeout/circuit breaker: Entire taste stage wrapped in ThreadPoolExecutor future with 400ms hard timeout. Circuit breaker: 5 consecutive fallbacks disables taste stage for 10 minutes (configurable). On any error/timeout, fall back to deterministic baseline in <= 50ms.
Flag interaction: Log full ab_* flag configuration with every request and training data point. When any ab_* flag changes, re-evaluate model on holdout and retrain if metrics shift by > 2%.
5.3 Logging Schema¶
Per request, log (as Pydantic model):
- request_id, prompt, song A/B metadata (including raw analysis)
- feature_version (manifest hash)
- model_version (artifact hash)
- flag_config (all ab_* flags + TASTE_MODEL_ENABLED state)
- All candidate plans (serialized) with structure hashes
- Hard constraint pass/fail per candidate with failure reason enum
- Feature vectors (versioned) per candidate
- Ranker scores + score margins + calibrated probabilities
- Selected candidate ID and selection rationale
- Runtime metrics (generation latency, scoring latency, total taste-stage time)
- Fallback trigger flag + reason if applicable
6. Performance Targets¶
| Metric | Target |
|---|---|
| Candidate generation | <= 120ms CPU |
| Hard filtering | <= 20ms CPU |
| Feature extraction (Tier 1, 20-30 features) | <= 50ms CPU |
| Scoring (CatBoost) | <= 80ms CPU |
| Total taste-stage P95 | <= 300ms |
| Taste-stage hard timeout | 400ms |
| Fallback recovery | <= 50ms on any error |
| Model preference vs fallback | >= 65% in blind A/B |
| Constraint violation rate | 0% (hard gate) |
| Fallback trigger rate | <= 1% |
| CatBoost cold-start (server startup) | <= 500ms (model load + warm-up prediction) |
7. Cost¶
| Component | Cost Estimate |
|---|---|
| Inference (CPU ranking) | Near-zero marginal |
| Training (CPU, CatBoost, trigger-based) | $0-3/run on 8 vCPU |
| GPU (stem separation, existing) | Unchanged |
| Labeling (friends, manual) | Time only |
| Storage (features, checkpoints, logs) | < $5/month S3 |
| Total additional monthly | < $20 |
8. Evaluation Framework¶
8.1 Offline Metrics¶
- Pairwise accuracy on holdout (human-only and mixed)
- Top-1 preferred rate vs fallback baseline (primary metric -- only top-1 is rendered)
- Regret@1: How often does the model's top-1 lose to any other candidate in pairwise comparison
- Calibration curves (Platt-scaled probabilities vs observed win rates)
- Per-genre performance spread
8.2 Safety/Regression Checks¶
- Constraint violation rate (must be 0%)
- Per-genre regression monitoring
- Correlation checks: score vs loudness, stem density, transition count, tempo stretch amount, section count (detect reward hacking)
- Wall-of-sound detection (gain headroom utilization, mute ratio)
- Feature importance drift tracking across training runs
8.3 Online Metrics (Post-Launch)¶
- Regenerate rate (lower = better)
- Listen-through proxy (% of remix played)
- Replay/keep rate
- A/B preference in production (randomized)
9. What Not To Build Yet¶
- End-to-end generative audio model
- Large audio-text multimodal fine-tuning
- Real-time GPU reranking
- Per-user personalization
- Complex model registry platform
- Full RL/reward-model loop from implicit metrics
- Architecture rewrite of renderer/DSP pipeline
10. Risk Register¶
| Risk | Mitigation |
|---|---|
| Too few human labels early on | Heuristic weak labels bootstrap (weighted 0.2); CatBoost handles mixed-quality labels; start with 20-30 features |
| CatBoost ceiling too low | Feature engineering focus; cross-encoder (Family 2) ready as upgrade path |
| Candidate generation misses good plans | 4 arrangement families for MVP; monitor fallback-beats-model rate; expand families based on error analysis |
| Genre bias in training data | Per-genre holdout monitoring; stratified sampling; genre-conditional LUFS targets |
| Latency budget exceeded | Tier 1 features only in Phase 1; feature caching; 400ms hard timeout |
| Reward hacking (loud = "better") | Loudness-matched labeling; correlation checks; rubric diversity |
| Stem quality variation | Stem quality features + interaction terms; hard constraint against solo vocal with low-quality stems |
| Label noise from rater disagreement | Kappa >= 0.4 threshold; confidence-weighted loss; calibration examples |
| Flag interaction invalidates model | Full flag logging; re-eval on flag change; retrain trigger |
Appendix A: R1 Review Changes¶
Changes made from Round 1 expert review (ML Scientist, ML Engineer, Sound Engineer, Mixing Master):
| Issue | Fix Applied |
|---|---|
| C1: Features mislabeled as metadata-only | 3-tier feature system (metadata/proxy/render-required) with phase gates |
| C2: No low-frequency features | Added Group 8 (Spectral Balance) with mud, bass collision, sub-bass |
| C3: Label mixing strategy undefined | Section 3.5 defines source weighting, residual learning, request-level splits |
| C4: Stem quality not modeled | Group 8 includes stem quality score; hard constraint added for solo vocal gate |
| C5: Only 2 arrangement families | Expanded to 4 (added DJ Lift, Quick Hit) |
| C6: Groove features too thin | Group 6 expanded to 15-20 features covering pocket, density, accent patterns |
| M1-M12 | Train/test splits, rater agreement, loudness normalization, position bias, logging schema, timeout/circuit breaker, model loading, flag interactions, calibration, feature count discipline, transition coverage, genre LUFS |
| M13-M23 | Feature versioning, holdout size, retrain cadence, correct metrics, reward hacking, scoring simplification, prompt fit, rubric weights, reproducibility, contrast strength, outro quality |