Key Convergence Algorithm — Implementation Plan¶
Context¶
Right now when two songs are in different keys, the remix sounds like static/dissonance. The current pipeline detects keys (essentia/librosa) and the LLM sets key_source in the RemixPlan, but no pitch shifting is actually applied — rubberband_process() always receives semitones=0.
We need an automatic key convergence algorithm that: 1. Always shifts when keys differ (even 1 semitone — minor 2nd is the most dissonant interval) 2. Resolves major/minor mismatches via relative key conversion 3. Favors shifting instrumentals over vocals to protect audio quality 4. Handles rap/spoken vocals — LLM classifies vocal type, disables vocal shifting when vocals are unpitched 5. Warns the user when shift is large (distance 6), bails at 7+
This replaces LLM control of key matching — the algorithm runs automatically.
Design Decisions (from discussion)¶
| Decision | Choice | Rationale |
|---|---|---|
| Convergence method | Instrumental-favoring split (see table below) | Vocals degrade faster than instrumentals under pitch shift |
| LLM role | None — fully automatic | Algorithm handles both whether and how |
| Mode mismatch | Convert via relative key, try both directions, pick smallest total shift | Major + minor on same root sounds bad (disagreement on 3rd/6th/7th) |
| Skip threshold | None — always shift, even at 1 semitone | Minor 2nd is the most dissonant interval; R3 at ±1 is transparent |
| Safety limit | 4 semitones max per song (instrumentals), 2 max (vocals) | Expert-validated limits for Rubber Band R3 |
| Distance 6 | Warning via SSE → frontend confirmation → proceed or cancel | Pushing limits, user should know |
| Distance 7+ | Incompatible — do not attempt | Both sides would exceed safe limits |
| Same key (distance 0) | Skip — no processing needed | Nothing to fix |
| Rap/spoken vocals | LLM classifies vocal_type from filename + prompt |
When rap/spoken, vocals are exempt from pitch shifting — caps max distance at 4 |
Rap/Spoken Vocal Handling¶
Problem: Rap vocals are largely unpitched rhythmic speech. Pitch-shifting them degrades timbre for zero harmonic benefit. Additionally, the detected "key" of a rap song is the key of the beat — if we're discarding the beat and using only the vocals, that key is irrelevant.
Solution: The LLM interpreter classifies Song A's vocal type as part of its existing prompt interpretation step. New field in the LLM tool schema:
"vocal_type": {
"type": "string",
"enum": ["sung", "rap"],
"description": "Whether Song A's vocals are melodic/sung or rap/spoken word. Only flag 'rap' if the vocals are ENTIRELY rapped or spoken with NO melodic singing. If the artist sings at all (hooks, choruses, melodic sections), use 'sung' — the sung portions need key matching and the rapped portions tolerate the shift fine."
}
When vocal_type == "rap":
- Vocal stems are never pitch-shifted (shift_a is always 0)
- Max compatible distance drops from 6 → 4 (instrumentals absorb everything)
- Distance 5+ becomes "incompatible" instead of "warning"
- The algorithm otherwise works identically — just with the vocal column zeroed out
Why LLM classification works here: - The LLM already runs during the interpreter step and sees the filenames + user prompt - "The Notorious B.I.G. - Hypnotize.mp3" + "put Biggie's vocals over..." → the LLM knows this is rap - No additional compute, no external API calls, no new dependencies - For well-known artists (the common case), classification is near-certain from the filename alone - No existing DJ tool or mashup software does this — we'd be ahead of the field
Why the classification rule is "ONLY flag rap if entirely spoken":
- Artists who both rap and sing (Drake, Post Malone, Travis Scott) should be classified as "sung"
- If an artist sings hooks/choruses, those pitched sections genuinely benefit from key matching
- The rapped sections tolerate a ±1-2 semitone shift fine — slight timbre change, not catastrophic
- False negative (classifying rap as sung) → unnecessary but harmless vocal shift
- False positive (classifying sung as rap) → sung vocals don't get shifted, potential dissonance on melodic sections — this is the worse mistake, so we bias toward "sung"
Graceful failure modes:
- LLM unsure → defaults to "sung" (the safe choice — shift everything)
- LLM wrong (classifies sung as rap) → vocals don't shift, worst case is some dissonance at distances 1-4
- LLM wrong (classifies rap as sung) → vocals shift ±1-2 semitones, slight timbre change, no real harm
Why only Song A: Song A is the vocal source. Song B provides instrumentals — its vocal type doesn't matter since we're not using its vocals.
Backing vocals edge case: BS-RoFormer puts backing vocals in the "other" stem, not the vocal stem. For a pure rap track with no melodic content, there are rarely pitched backing vocals either. For hybrid artists (classified as "sung"), everything shifts normally. The "other" stem from Song A shifts with shift_a (same as vocals), keeping backing vocals aligned with lead vocals. The "other" stem from Song B shifts with shift_b (same as instrumentals).
Future enhancement: CREPE pitch confidence on the isolated vocal stem could auto-detect pitchedness (mean confidence ~0.5-0.6 threshold) and validate the LLM's classification. No custom model training needed.
Shift Allocation Table¶
Two modes based on the rap toggle. Instrumentals absorb shift first (up to 4), sung vocals take what's left (up to 2). Rap vocals never shift.
Default mode (sung vocals):
| Distance | Instrumentals | Vocals | Notes |
|---|---|---|---|
| 0 | 0 | 0 | Same key — skip |
| 1 | 1 | 0 | Minor 2nd — most dissonant, always fix |
| 2 | 2 | 0 | Clean |
| 3 | 3 | 0 | Instrumental preferred max |
| 4 | 4 | 0 | Instrumentals absorb all |
| 5 | 4 | 1 | Push instrumentals to 4 |
| 6 | 4 | 2 | Warning first — vocals at max |
| 7+ | — | — | Incompatible |
Rap mode (rap/spoken vocals toggled on):
| Distance | Instrumentals | Vocals | Notes |
|---|---|---|---|
| 0 | 0 | 0 | Same key — skip |
| 1 | 1 | 0 | Clean |
| 2 | 2 | 0 | Clean |
| 3 | 3 | 0 | Clean |
| 4 | 4 | 0 | Instrumentals at max |
| 5+ | — | — | Incompatible |
Why these limits:
- Vocals ±2 max: Formant preservation (R3 + --formant) is good but not perfect. At ±3, sibilants go metallic and sustained notes get watery. An automated system with no human QA should stay at ±2.
- Instrumentals ±4 max: No formant sensitivity, but transients and timbre degrade beyond this. Bass gets muddy at ±4 but is still acceptable.
- Drums exempt from pitch shifting: Largely unpitched — shifting smears transients and creates metallic artifacts on cymbals/hi-hats. Harmonic benefit is negligible.
How Key Conversion Works¶
The Problem¶
When Song A is in D major and Song B is in A minor, they use different note sets. Playing them together produces dissonance on the 3rd, 6th, and 7th scale degrees.
The Solution: Relative Key Conversion¶
Every major key has a relative minor 3 semitones below (same notes): - C major ↔ A minor (both use C D E F G A B) - D major ↔ B minor (both use D E F# G A B C#) - Eb major ↔ C minor (both use Eb F G Ab Bb C D)
Converting via relative key means reframing one song's key in terms of the other mode. This is a conceptual step — no audio processing. It lets us compute the convergence distance with both songs in the same mode.
Two-Path Evaluation (when modes differ)¶
Given Song A in D major and Song B in A minor:
Path 1: Convert minor → relative major - A minor's relative major = C major - Now comparing: D major vs C major = 2 semitones apart - Instrumental shifts 2, vocal shifts 0 → total audio shift: 2
Path 2: Convert major → relative minor - D major's relative minor = B minor - Now comparing: B minor vs A minor = 2 semitones apart - Instrumental shifts 2, vocal shifts 0 → total audio shift: 2
Pick whichever path gives the smallest total audio shift. If tied, either works.
The Math¶
Relative key conversion: - Minor → relative major: root +3 semitones (conceptually, not audio) - Major → relative minor: root -3 semitones (conceptually, not audio)
Then convergence using the shift allocation table — instrumentals absorb first, vocals take the remainder.
Key insight: The "conceptual conversion" and the "convergence shift" combine into a single actual audio shift per song. We compute both paths end-to-end and compare the actual audio shifts, not the intermediate steps.
What DJs and Mashup Artists Actually Do¶
From research on professional practice:
- Shift stems individually, never a pre-mixed signal. Isolated stems produce far fewer artifacts. We already have stems — this is our advantage.
- Vocals max ±3 semitones (±2 preferred). Formant distortion makes vocals sound unnatural beyond this.
- Instrumentals tolerate ±5 semitones (±3 preferred). Drums are unpitched, bass shifts well, guitars/synths lack formant sensitivity.
- Formant preservation is non-negotiable for vocals above ±1 semitone. We already use
--formantflag. - "Meet in the middle" is real but rare in software. No mainstream DJ tool does it automatically. The 2021 AAAI paper on mashup generation treats it as constrained optimization. We'd be ahead of existing tools.
- DJs shift the instrumental, not the vocals. Our table follows this same principle.
Algorithm (Pseudocode)¶
function compute_key_plan(key_a, scale_a, conf_a, mod_a, key_b, scale_b, conf_b, mod_b, rap_vocals=False):
# Gate: skip if key data missing or confidence too low
if key_a is None or scale_a is None or conf_a is None:
return KeyPlan(action="skip", shift_a=0, shift_b=0, target_key="", target_scale="", reason="missing key data for song A", distance=0)
if key_b is None or scale_b is None or conf_b is None:
return KeyPlan(action="skip", shift_a=0, shift_b=0, target_key="", target_scale="", reason="missing key data for song B", distance=0)
if conf_a < 0.40 or conf_b < 0.40:
return KeyPlan(action="skip", reason="low confidence")
if mod_a or mod_b:
return KeyPlan(action="skip", reason="modulation detected")
semi_a = note_to_semitone(key_a)
semi_b = note_to_semitone(key_b)
# Same mode — straightforward convergence
if scale_a == scale_b:
distance = chromatic_distance(semi_a, semi_b)
if distance == 0:
return KeyPlan(action="skip", reason="same key")
return build_shift_plan(semi_a, semi_b, distance, rap_vocals=rap_vocals)
# Different mode — try both conversion directions, pick cheapest
# Path 1: Convert minor to relative major (+3)
# Path 2: Convert major to relative minor (-3)
# For each path, compute the end-to-end audio shift per song
# Pick the path with smallest total shift
if scale_a == "minor":
# Path 1: convert A (minor) → relative major
path1_a = (semi_a + 3) % 12
path1_dist = chromatic_distance(path1_a, semi_b)
# Path 2: convert B (major) → relative minor
path2_b = (semi_b - 3) % 12
path2_dist = chromatic_distance(semi_a, path2_b) # A stays minor
else:
# Path 1: convert B (minor) → relative major
path1_b = (semi_b + 3) % 12
path1_dist = chromatic_distance(semi_a, path1_b)
# Path 2: convert A (major) → relative minor
path2_a = (semi_a - 3) % 12
path2_dist = chromatic_distance(path2_a, semi_b) # B stays minor
best_distance = min(path1_dist, path2_dist)
# Compute actual audio shifts using the shift allocation table
# The conceptual conversion (+3/-3) affects WHERE the target key lands,
# but the actual audio shift is from original key to final target
return build_shift_plan_from_best_path(semi_a, semi_b, best_path, best_distance, rap_vocals=rap_vocals)
function build_shift_plan(semi_a, semi_b, distance, rap_vocals=False):
"""Apply the shift allocation table."""
# Rap mode: vocals never shift, max distance is 4
if rap_vocals:
if distance > 4:
return KeyPlan(action="incompatible", reason="distance exceeds instrumental-only limit (rap vocals)")
SHIFT_TABLE = {1: (1, 0), 2: (2, 0), 3: (3, 0), 4: (4, 0)}
inst_shift, vocal_shift = SHIFT_TABLE[distance]
...compute signed shifts based on shortest path direction...
return KeyPlan(action="shift", shift_a=0, shift_b=inst_signed)
# Sung mode: full table with vocal overflow
if distance > 6:
return KeyPlan(action="incompatible")
SHIFT_TABLE = {
1: (1, 0), 2: (2, 0), 3: (3, 0),
4: (4, 0), 5: (4, 1), 6: (4, 2),
}
inst_shift, vocal_shift = SHIFT_TABLE[distance]
if distance == 6:
# Populate shift values even for warning — they're applied after user confirms
...compute signed shifts based on shortest path direction...
return KeyPlan(action="warning", shift_a=vocal_signed, shift_b=inst_signed, distance=distance)
# Determine direction (which way to shift each song)
# Instrumentals = Song B, Vocals = Song A (fixed convention)
# Shift toward each other on the chromatic circle
...compute signed shifts based on shortest path direction...
return KeyPlan(action="shift", shift_a=vocal_signed, shift_b=inst_signed)
function build_shift_plan_from_best_path(semi_a, semi_b, best_path, best_distance, rap_vocals=False):
"""Build a shift plan when modes differ, using the winning conversion path.
Applies the conceptual mode conversion (+3/-3) to get both songs into the
same mode space, then delegates to build_shift_plan() which already handles
same-mode distance computation and shift allocation.
"""
# 1. Apply conceptual mode conversion to get effective semitones
# major→minor = -3 semitones on chromatic circle (conceptual, not audio)
# minor→major = +3 semitones on chromatic circle (conceptual, not audio)
if best_path == "path1":
effective_a, effective_b = path1_converted_semi, semi_b
else:
effective_a, effective_b = semi_a, path2_converted_semi
# 2. Delegate to build_shift_plan() — both songs are now in the same mode space
# so the standard same-mode allocation logic applies directly
return build_shift_plan(effective_a, effective_b, best_distance, rap_vocals=rap_vocals)
Implementation Steps¶
Step 1: New module services/key_matching.py¶
Pure functions, no side effects. Contains:
note_to_semitone(key: str) -> int— Reuse_NOTE_TO_SEMITONEmapping fromtaste_constraints.py, adding missing enharmonics:"E#": 5and"B#": 0chromatic_distance(semi_a: int, semi_b: int) -> int— Min of clockwise/counterclockwisesigned_shift(from_semi: int, to_semi: int) -> int— Shortest signed shift (-6 to +6)compute_key_plan(key_a, scale_a, conf_a, mod_a, key_b, scale_b, conf_b, mod_b, rap_vocals=False) -> KeyPlan— The main algorithm.rap_vocalscomes from the LLM'svocal_typeclassificationKeyPlandataclass:@dataclass class KeyPlan: action: str # "skip" | "shift" | "warning" | "incompatible" shift_a: float # Semitones to shift Song A (vocals) audio (signed) shift_b: float # Semitones to shift Song B (instrumentals) audio (signed) target_key: str # The target key both songs converge toward target_scale: str # "major" or "minor" reason: str # Human-readable explanation distance: int # Original chromatic distance (after mode conversion)
action values:
- "skip" — no shift needed (same key, low confidence, modulation)
- "shift" — shift both songs per the allocation table (distance 1-5)
- "warning" — distance 6, needs user confirmation before shifting
- "incompatible" — distance 7+, cannot match
Step 2: Integrate into pipeline (services/pipeline.py)¶
After analysis and before tempo planning. The vocal_type comes from the LLM interpreter's output (Step 5):
# vocal_type is set by the LLM interpreter (default "sung" if not provided)
rap_vocals = intent_plan.vocal_type == "rap"
key_plan = compute_key_plan(
meta_a.key, meta_a.scale, meta_a.key_confidence, meta_a.has_modulation,
meta_b.key, meta_b.scale, meta_b.key_confidence, meta_b.has_modulation,
rap_vocals=rap_vocals,
)
if key_plan.action == "warning":
# Distance 6 — warn user, get confirmation
emit_progress(event_queue, {
"step": "key_warning",
"detail": f"Large key difference ({meta_a.key} {meta_a.scale} vs {meta_b.key} {meta_b.scale}). "
"The remix might not sound great. Continue anyway?",
"progress": current_progress,
"requires_confirmation": True,
}, session=session)
# Block until user responds (see Step 4)
elif key_plan.action == "incompatible":
emit_progress(event_queue, {
"step": "key_warning",
"detail": f"These songs are too far apart in key to match "
f"({meta_a.key} {meta_a.scale} vs {meta_b.key} {meta_b.scale}). "
"Continue without key matching?",
"progress": current_progress,
"requires_confirmation": True,
}, session=session)
Step 3: Pass semitones to rubberband in Step 9¶
Currently rubberband_process() is called with semitones=0. Change to:
Important: The existing pipeline gates the rubberband executor with conditions like if stretch_vocals: (only true when tempo adjustment is needed). This must be expanded to also trigger on key shifts, otherwise key-only shifts are silently skipped when tempos already match. Change the gating conditions to:
- Vocals: if stretch_vocals or key_plan.shift_a != 0:
- Instrumentals: if stretch_instrumentals or key_plan.shift_b != 0:
# Song A = vocals, Song B = instrumentals (fixed convention)
vocal_semitones = key_plan.shift_a if key_plan.action in ("shift", "warning") else 0
inst_semitones = key_plan.shift_b if key_plan.action in ("shift", "warning") else 0
# In the rubberband ThreadPoolExecutor:
for stem_name in vocal_stems:
is_vocal = (stem_name == "vocals")
is_drums = (stem_name == "drums")
# Drums exempt from pitch shifting
semitones = 0 if is_drums else vocal_semitones
futures[("vocal", stem_name)] = rb_executor.submit(
rubberband_process, vocal_audio[stem_name], sr,
vocal_meta.bpm, target_bpm,
semitones=semitones,
is_vocal=is_vocal,
)
for stem_name in inst_stems:
is_drums = (stem_name == "drums")
semitones = 0 if is_drums else inst_semitones
futures[("inst", stem_name)] = rb_executor.submit(
rubberband_process, inst_audio[stem_name], sr,
inst_meta.bpm, target_bpm,
semitones=semitones,
is_vocal=False,
)
Step 4: SSE confirmation flow for key warnings¶
The pipeline currently has no way to pause and wait for user input mid-execution.
Approach: Add a new SSE event type key_warning with requires_confirmation: True. Frontend shows a dialog. User responds via new API endpoint:
POST /api/remix/{session_id}/confirm-keywith body{"proceed": true/false}- Pipeline blocks on a
threading.Eventuntil confirmation arrives - If
proceed: trueand action was"warning"→ apply the shifts anyway - If
proceed: trueand action was"incompatible"→ continue without key matching - If
proceed: false→ cancel the remix - Timeout after 120 seconds → cancel
New API endpoint in api/remix.py:
@router.post("/api/remix/{session_id}/confirm-key")
async def confirm_key_match(session_id: str, body: KeyConfirmation):
session = request.app.state.sessions.get(session_id)
if session is None:
raise HTTPException(404, "Session not found")
session.key_confirmed = body.proceed
session.key_confirmation_event.set()
Pipeline waits:
if key_plan.action in ("warning", "incompatible"):
emit_progress(...)
# key_confirmation_event is initialized in SessionState field defaults — no lazy creation needed
confirmed = session.key_confirmation_event.wait(timeout=120)
if not confirmed or not session.key_confirmed:
emit_progress(event_queue, {"step": "cancelled", ...})
return
if key_plan.action == "incompatible":
key_plan = KeyPlan(action="skip", shift_a=0, shift_b=0, target_key="", target_scale="", reason="user accepted incompatible keys", distance=key_plan.distance)
elif key_plan.action == "warning":
key_plan.action = "shift" # User accepted — promote to normal shift so downstream code processes it
Step 5: Replace key_source with vocal_type in interpreter.py¶
Remove key_source:
- Remove key_source from REMIX_PLAN_TOOL schema (line 151-155) — and from "required" list (line 68)
- Remove _compute_key_guidance() (lines 663-695)
- Remove key_matching_available and key_matching_detail params from _build_system_prompt_blocks() (line 179-180)
- Remove key matching section from Section 8 (lines 342-346) — keep tempo matching section
- Remove key_source from _parse_intent_plan() (line 927)
- Update few-shot examples — remove "key_source": "none" from all three examples (lines 785, 840, ~884)
- Update interpret_prompt() — remove _key_available, key_matching_detail = _compute_key_guidance(...) call (lines 1191-1192) and corresponding args to _build_system_prompt_blocks() (lines 1205-1212)
- Update generate_fallback_plan() — remove key_source="none" (line 1435)
Add vocal_type:
- Add vocal_type to REMIX_PLAN_TOOL schema:
"vocal_type": {
"type": "string",
"enum": ["sung", "rap"],
"description": "Whether Song A's vocals are melodic/sung or rap/spoken word. Only flag 'rap' if the vocals are ENTIRELY rapped or spoken with NO melodic singing. If the artist sings at all (hooks, choruses, melodic sections), use 'sung' — the sung portions need key matching and the rapped portions tolerate the shift fine."
}
vocal_type to _parse_intent_plan() — extract from tool call, default to "sung"
- Add to few-shot examples — include "vocal_type": "sung" or "vocal_type": "rap" as appropriate
- Add brief guidance to system prompt — one sentence in Section 8: "Classify Song A's vocal_type as 'rap' only if the vocals are entirely rapped/spoken with no melodic singing."
- Update generate_fallback_plan() — add vocal_type="sung" (safe default)
Step 6: Update related code¶
models.py: Removekey_sourcefromRemixPlanandIntentPlandataclasses. Addvocal_type: str = "sung"toIntentPlan(values:"sung"or"rap", default"sung"). Delete the stale firstIntentSection/IntentPlandefinitions (around lines 188-209) and removekey_sourcefrom the surviving definitions. Addkey_confirmation_eventandkey_confirmedasSessionStatefield defaults (initialized at construction, not lazily in the pipeline):Also add thekey_confirmation_event: threading.Event = field(default_factory=threading.Event) key_confirmed: Optional[bool] = NoneKeyConfirmationPydantic model for the confirmation endpoint:class KeyConfirmation(BaseModel): proceed: booltaste_constraints.py: Removecheck_pitch_shift_semitones()— replaced bykey_matching.pylogic. Remove_key_semitone_distance()(moved to new module).taste_features.py: KEEP_camelot_distance()and_estimate_pitch_shift_semitones()— they serve taste scoring (candidate ranking), not audio shifting. Uses_CAMELOT_WHEEL, not_NOTE_TO_SEMITONE. However, remove allplan.key_sourcereads andkey_source=constructor args — these reference the deleted field.
Step 7: Update progress messages¶
Add key-matching status to SSE stream:
- "Analyzing keys..." (during analysis)
- "Keys matched: shifting instrumentals by +3, vocals by -1" (when shifting)
- "Keys already compatible — no shift needed" (when skipping, distance 0)
- "Large key difference — the remix might not sound great. Continue anyway?" (warning, distance 6)
- "These songs are too far apart in key to match. Continue without key matching?" (incompatible, 7+)
Step 8: Frontend Changes¶
- Types: Add
'key_warning'to theProgressStepunion type ProgressEventtype: Addrequires_confirmation?: booleanandkey_info?: { detected_keys: string, distance: number, recommendation: string }fieldsuseRemixProgress.ts: Handlekey_warningstep — pause progress display and show the confirmation dialog- API client: Add
POST /api/remix/{session_id}/confirm-keycall ({ proceed: boolean }body) - New UI component: Key confirmation dialog showing detected keys, distance, and recommendation text. Two buttons: "Continue anyway" (
proceed: true) and "Cancel" (proceed: false)
Files to Modify¶
| File | Change |
|---|---|
NEW: services/key_matching.py |
Core algorithm — compute_key_plan(), KeyPlan, shift allocation table, semitone utilities |
services/pipeline.py |
Integrate key plan after analysis, pass semitones to rubberband, drums exemption, SSE events, confirmation blocking |
services/interpreter.py |
Remove key_source, add vocal_type to tool schema. Remove _compute_key_guidance(), remove key matching from system prompt Section 8 (add one-sentence vocal_type guidance). Update _parse_intent_plan() to extract vocal_type (default "sung"). Update few-shot examples. Update interpret_prompt() and generate_fallback_plan() |
models.py |
Remove key_source from RemixPlan/IntentPlan, add key_confirmation_event/key_confirmed to SessionState field defaults, add KeyConfirmation(BaseModel) |
api/remix.py |
Add POST /api/remix/{session_id}/confirm-key endpoint |
services/taste_constraints.py |
Remove check_pitch_shift_semitones() and _key_semitone_distance() |
services/taste_features.py |
Keep _camelot_distance() and _estimate_pitch_shift_semitones() (taste scoring). No changes needed — uses _CAMELOT_WHEEL, not _NOTE_TO_SEMITONE |
services/candidate_planner.py |
Remove key_source='none' from RemixPlan(...) constructor call |
services/taste_constraints.py |
Delete check_pitch_shift_safety() function and its callsite in run_all_constraints(). Remove check_pitch_shift_semitones() callsite if still present. These validated the now-removed key_source field |
services/gain_mapper.py |
Remove key_source=intent.key_source (or similar) from RemixPlan(...) constructor call(s). The field no longer exists |
services/taste_model.py |
Remove all plan.key_source != "none" branches and key_source-dependent scoring logic. Replace with key_plan.action == "shift" signal where taste scoring needs to know if key convergence is active |
services/taste_features.py |
Remove all plan.key_source reads and key_source= constructor args. Keep _CAMELOT_WHEEL usage (taste scoring's own concern); only key_source references need removal |
tests/ |
Update all test files referencing key_source: remove from plan/intent constructors, delete check_pitch_shift_safety tests, update taste scoring test assertions. Files affected: test_interpreter.py, test_taste_constraints.py, test_taste_features.py, test_gain_mapper.py, test_taste_model.py, test_taste_stage.py, test_candidate_planner.py |
Verification¶
-
Unit tests for
key_matching.py: - Same key (distance 0) → skip - Distance 1 → instrumental shifts 1, vocal shifts 0 - Distance 3 → instrumental shifts 3, vocal shifts 0 - Distance 4 → instrumental shifts 4, vocal shifts 0 - Distance 5 → instrumental shifts 4, vocal shifts 1 - Distance 6 → warning action, instrumental 4, vocal 2 - Distance 7+ → incompatible - Different mode, try both conversion paths, pick smaller - Low confidence → skip - Modulation detected → skip - Enharmonic equivalents (C# vs Db) → same semitone value - Rap mode: distance 1-4 → instrumental shifts only, vocal always 0 - Rap mode: distance 5+ → incompatible (not warning) - Rap mode: distance 0 → skip (same as sung) -
Integration test: - Upload two songs with known different keys - Verify SSE stream includes key matching status - Verify rubberband receives correct semitone values - Verify drums are NOT pitch-shifted - Verify output audio is pitch-shifted
-
Warning/incompatible flow: - Distance 6: verify
key_warningSSE event withrequires_confirmation- User accepts → remix proceeds with shifts applied - User declines → remix cancelled - Distance 7+: verify incompatible message - User accepts → remix proceeds WITHOUT key matching - User declines → remix cancelled - Timeout (120s) → remix cancelled
Edge Cases¶
- One or both keys undetected (confidence < 0.40): Skip key matching silently
- One or both songs modulate: Skip key matching silently
- Enharmonic equivalents (C# vs Db): Handled by
_NOTE_TO_SEMITONEmapping both to same value - Both songs in same key: Skip, no processing (distance 0)
- Relative keys (A minor + C major): After mode conversion, distance is 0 → skip
- Drums stem: Exempt from pitch shifting in all cases
- Rap vocals (LLM classified): Vocal stems never shifted, max distance 4, distance 5+ incompatible
- Hybrid artists (rap + singing): LLM classifies as "sung" (safe default), all shifts apply normally
- LLM fails to classify: Defaults to "sung" — the safe choice (shifts everything, no dissonance risk)
- Backing vocals in "other" stem: Shift with the same group they belong to (Song A's "other" shifts with vocals, Song B's "other" shifts with instrumentals)
Follow-up Items (not in this implementation)¶
- Parallel keys special case: C major + C minor share same root and 5/7 notes. Might be okay without full conversion. Needs listening tests.
- Formant preservation on "other" stem: BS-RoFormer puts backing vocals in "other" — shifting without
--formantmay chipmunk them. Consider adding--formantto "other" stem processing. - Bass sensitivity: Pitch expert flagged bass getting muddy at ±3-4. May need lower cap for bass specifically.
- Post-shift EQ: Pitch shifting can boost/cut certain frequencies. A gentle EQ pass after shifting could restore balance.
- CREPE validation of LLM classification: Run CREPE pitch confidence on the isolated vocal stem (mean confidence ~0.5-0.6 threshold) to validate the LLM's
vocal_typeclassification. Could log warnings when LLM and CREPE disagree. - Key detection on stems instead of full mix: Run essentia/librosa on the separated instrumental stems rather than the full mix. More accurate for the harmonic content actually present in the remix, especially when discarding a rap song's beat.