Key Convergence Algorithm — Implementation Plan¶

Context¶

Right now when two songs are in different keys, the remix sounds like static/dissonance. The current pipeline detects keys (essentia/librosa) and the LLM sets key_source in the RemixPlan, but no pitch shifting is actually applied — rubberband_process() always receives semitones=0.

We need an automatic key convergence algorithm that: 1. Always shifts when keys differ (even 1 semitone — minor 2nd is the most dissonant interval) 2. Resolves major/minor mismatches via relative key conversion 3. Favors shifting instrumentals over vocals to protect audio quality 4. Handles rap/spoken vocals — LLM classifies vocal type, disables vocal shifting when vocals are unpitched 5. Warns the user when shift is large (distance 6), bails at 7+

This replaces LLM control of key matching — the algorithm runs automatically.

Design Decisions (from discussion)¶

Decision	Choice	Rationale
Convergence method	Instrumental-favoring split (see table below)	Vocals degrade faster than instrumentals under pitch shift
LLM role	None — fully automatic	Algorithm handles both whether and how
Mode mismatch	Convert via relative key, try both directions, pick smallest total shift	Major + minor on same root sounds bad (disagreement on 3rd/6th/7th)
Skip threshold	None — always shift, even at 1 semitone	Minor 2nd is the most dissonant interval; R3 at ±1 is transparent
Safety limit	4 semitones max per song (instrumentals), 2 max (vocals)	Expert-validated limits for Rubber Band R3
Distance 6	Warning via SSE → frontend confirmation → proceed or cancel	Pushing limits, user should know
Distance 7+	Incompatible — do not attempt	Both sides would exceed safe limits
Same key (distance 0)	Skip — no processing needed	Nothing to fix
Rap/spoken vocals	LLM classifies `vocal_type` from filename + prompt	When rap/spoken, vocals are exempt from pitch shifting — caps max distance at 4

Rap/Spoken Vocal Handling¶

Problem: Rap vocals are largely unpitched rhythmic speech. Pitch-shifting them degrades timbre for zero harmonic benefit. Additionally, the detected "key" of a rap song is the key of the beat — if we're discarding the beat and using only the vocals, that key is irrelevant.

Solution: The LLM interpreter classifies Song A's vocal type as part of its existing prompt interpretation step. New field in the LLM tool schema:

"vocal_type": {
    "type": "string",
    "enum": ["sung", "rap"],
    "description": "Whether Song A's vocals are melodic/sung or rap/spoken word. Only flag 'rap' if the vocals are ENTIRELY rapped or spoken with NO melodic singing. If the artist sings at all (hooks, choruses, melodic sections), use 'sung' — the sung portions need key matching and the rapped portions tolerate the shift fine."
}

When vocal_type == "rap": - Vocal stems are never pitch-shifted (shift_a is always 0) - Max compatible distance drops from 6 → 4 (instrumentals absorb everything) - Distance 5+ becomes "incompatible" instead of "warning" - The algorithm otherwise works identically — just with the vocal column zeroed out

Why LLM classification works here: - The LLM already runs during the interpreter step and sees the filenames + user prompt - "The Notorious B.I.G. - Hypnotize.mp3" + "put Biggie's vocals over..." → the LLM knows this is rap - No additional compute, no external API calls, no new dependencies - For well-known artists (the common case), classification is near-certain from the filename alone - No existing DJ tool or mashup software does this — we'd be ahead of the field

Why the classification rule is "ONLY flag rap if entirely spoken": - Artists who both rap and sing (Drake, Post Malone, Travis Scott) should be classified as "sung" - If an artist sings hooks/choruses, those pitched sections genuinely benefit from key matching - The rapped sections tolerate a ±1-2 semitone shift fine — slight timbre change, not catastrophic - False negative (classifying rap as sung) → unnecessary but harmless vocal shift - False positive (classifying sung as rap) → sung vocals don't get shifted, potential dissonance on melodic sections — this is the worse mistake, so we bias toward "sung"

Graceful failure modes: - LLM unsure → defaults to "sung" (the safe choice — shift everything) - LLM wrong (classifies sung as rap) → vocals don't shift, worst case is some dissonance at distances 1-4 - LLM wrong (classifies rap as sung) → vocals shift ±1-2 semitones, slight timbre change, no real harm

Why only Song A: Song A is the vocal source. Song B provides instrumentals — its vocal type doesn't matter since we're not using its vocals.

Backing vocals edge case: BS-RoFormer puts backing vocals in the "other" stem, not the vocal stem. For a pure rap track with no melodic content, there are rarely pitched backing vocals either. For hybrid artists (classified as "sung"), everything shifts normally. The "other" stem from Song A shifts with shift_a (same as vocals), keeping backing vocals aligned with lead vocals. The "other" stem from Song B shifts with shift_b (same as instrumentals).

Future enhancement: CREPE pitch confidence on the isolated vocal stem could auto-detect pitchedness (mean confidence ~0.5-0.6 threshold) and validate the LLM's classification. No custom model training needed.

Shift Allocation Table¶

Two modes based on the rap toggle. Instrumentals absorb shift first (up to 4), sung vocals take what's left (up to 2). Rap vocals never shift.

Default mode (sung vocals):

Distance	Instrumentals	Vocals	Notes
0	0	0	Same key — skip
1	1	0	Minor 2nd — most dissonant, always fix
2	2	0	Clean
3	3	0	Instrumental preferred max
4	4	0	Instrumentals absorb all
5	4	1	Push instrumentals to 4
6	4	2	Warning first — vocals at max
7+	—	—	Incompatible

Rap mode (rap/spoken vocals toggled on):

Distance	Instrumentals	Vocals	Notes
0	0	0	Same key — skip
1	1	0	Clean
2	2	0	Clean
3	3	0	Clean
4	4	0	Instrumentals at max
5+	—	—	Incompatible

Why these limits: - Vocals ±2 max: Formant preservation (R3 + --formant) is good but not perfect. At ±3, sibilants go metallic and sustained notes get watery. An automated system with no human QA should stay at ±2. - Instrumentals ±4 max: No formant sensitivity, but transients and timbre degrade beyond this. Bass gets muddy at ±4 but is still acceptable. - Drums exempt from pitch shifting: Largely unpitched — shifting smears transients and creates metallic artifacts on cymbals/hi-hats. Harmonic benefit is negligible.

How Key Conversion Works¶

The Problem¶

When Song A is in D major and Song B is in A minor, they use different note sets. Playing them together produces dissonance on the 3rd, 6th, and 7th scale degrees.

The Solution: Relative Key Conversion¶

Every major key has a relative minor 3 semitones below (same notes): - C major ↔ A minor (both use C D E F G A B) - D major ↔ B minor (both use D E F# G A B C#) - Eb major ↔ C minor (both use Eb F G Ab Bb C D)

Converting via relative key means reframing one song's key in terms of the other mode. This is a conceptual step — no audio processing. It lets us compute the convergence distance with both songs in the same mode.

Two-Path Evaluation (when modes differ)¶

Given Song A in D major and Song B in A minor:

Path 1: Convert minor → relative major - A minor's relative major = C major - Now comparing: D major vs C major = 2 semitones apart - Instrumental shifts 2, vocal shifts 0 → total audio shift: 2

Path 2: Convert major → relative minor - D major's relative minor = B minor - Now comparing: B minor vs A minor = 2 semitones apart - Instrumental shifts 2, vocal shifts 0 → total audio shift: 2

Pick whichever path gives the smallest total audio shift. If tied, either works.

The Math¶

Relative key conversion: - Minor → relative major: root +3 semitones (conceptually, not audio) - Major → relative minor: root -3 semitones (conceptually, not audio)

Then convergence using the shift allocation table — instrumentals absorb first, vocals take the remainder.

Key insight: The "conceptual conversion" and the "convergence shift" combine into a single actual audio shift per song. We compute both paths end-to-end and compare the actual audio shifts, not the intermediate steps.

What DJs and Mashup Artists Actually Do¶

From research on professional practice:

Shift stems individually, never a pre-mixed signal. Isolated stems produce far fewer artifacts. We already have stems — this is our advantage.
Vocals max ±3 semitones (±2 preferred). Formant distortion makes vocals sound unnatural beyond this.
Instrumentals tolerate ±5 semitones (±3 preferred). Drums are unpitched, bass shifts well, guitars/synths lack formant sensitivity.
Formant preservation is non-negotiable for vocals above ±1 semitone. We already use --formant flag.
"Meet in the middle" is real but rare in software. No mainstream DJ tool does it automatically. The 2021 AAAI paper on mashup generation treats it as constrained optimization. We'd be ahead of existing tools.
DJs shift the instrumental, not the vocals. Our table follows this same principle.

Algorithm (Pseudocode)¶

function compute_key_plan(key_a, scale_a, conf_a, mod_a, key_b, scale_b, conf_b, mod_b, rap_vocals=False):

    # Gate: skip if key data missing or confidence too low
    if key_a is None or scale_a is None or conf_a is None:
        return KeyPlan(action="skip", shift_a=0, shift_b=0, target_key="", target_scale="", reason="missing key data for song A", distance=0)
    if key_b is None or scale_b is None or conf_b is None:
        return KeyPlan(action="skip", shift_a=0, shift_b=0, target_key="", target_scale="", reason="missing key data for song B", distance=0)
    if conf_a < 0.40 or conf_b < 0.40:
        return KeyPlan(action="skip", reason="low confidence")
    if mod_a or mod_b:
        return KeyPlan(action="skip", reason="modulation detected")

    semi_a = note_to_semitone(key_a)
    semi_b = note_to_semitone(key_b)

    # Same mode — straightforward convergence
    if scale_a == scale_b:
        distance = chromatic_distance(semi_a, semi_b)
        if distance == 0:
            return KeyPlan(action="skip", reason="same key")
        return build_shift_plan(semi_a, semi_b, distance, rap_vocals=rap_vocals)

    # Different mode — try both conversion directions, pick cheapest

    # Path 1: Convert minor to relative major (+3)
    # Path 2: Convert major to relative minor (-3)
    # For each path, compute the end-to-end audio shift per song
    # Pick the path with smallest total shift

    if scale_a == "minor":
        # Path 1: convert A (minor) → relative major
        path1_a = (semi_a + 3) % 12
        path1_dist = chromatic_distance(path1_a, semi_b)
        # Path 2: convert B (major) → relative minor
        path2_b = (semi_b - 3) % 12
        path2_dist = chromatic_distance(semi_a, path2_b)  # A stays minor
    else:
        # Path 1: convert B (minor) → relative major
        path1_b = (semi_b + 3) % 12
        path1_dist = chromatic_distance(semi_a, path1_b)
        # Path 2: convert A (major) → relative minor
        path2_a = (semi_a - 3) % 12
        path2_dist = chromatic_distance(path2_a, semi_b)  # B stays minor

    best_distance = min(path1_dist, path2_dist)

    # Compute actual audio shifts using the shift allocation table
    # The conceptual conversion (+3/-3) affects WHERE the target key lands,
    # but the actual audio shift is from original key to final target
    return build_shift_plan_from_best_path(semi_a, semi_b, best_path, best_distance, rap_vocals=rap_vocals)


function build_shift_plan(semi_a, semi_b, distance, rap_vocals=False):
    """Apply the shift allocation table."""

    # Rap mode: vocals never shift, max distance is 4
    if rap_vocals:
        if distance > 4:
            return KeyPlan(action="incompatible", reason="distance exceeds instrumental-only limit (rap vocals)")
        SHIFT_TABLE = {1: (1, 0), 2: (2, 0), 3: (3, 0), 4: (4, 0)}
        inst_shift, vocal_shift = SHIFT_TABLE[distance]
        ...compute signed shifts based on shortest path direction...
        return KeyPlan(action="shift", shift_a=0, shift_b=inst_signed)

    # Sung mode: full table with vocal overflow
    if distance > 6:
        return KeyPlan(action="incompatible")
    SHIFT_TABLE = {
        1: (1, 0), 2: (2, 0), 3: (3, 0),
        4: (4, 0), 5: (4, 1), 6: (4, 2),
    }
    inst_shift, vocal_shift = SHIFT_TABLE[distance]

    if distance == 6:
        # Populate shift values even for warning — they're applied after user confirms
        ...compute signed shifts based on shortest path direction...
        return KeyPlan(action="warning", shift_a=vocal_signed, shift_b=inst_signed, distance=distance)

    # Determine direction (which way to shift each song)
    # Instrumentals = Song B, Vocals = Song A (fixed convention)
    # Shift toward each other on the chromatic circle
    ...compute signed shifts based on shortest path direction...

    return KeyPlan(action="shift", shift_a=vocal_signed, shift_b=inst_signed)


function build_shift_plan_from_best_path(semi_a, semi_b, best_path, best_distance, rap_vocals=False):
    """Build a shift plan when modes differ, using the winning conversion path.

    Applies the conceptual mode conversion (+3/-3) to get both songs into the
    same mode space, then delegates to build_shift_plan() which already handles
    same-mode distance computation and shift allocation.
    """

    # 1. Apply conceptual mode conversion to get effective semitones
    #    major→minor = -3 semitones on chromatic circle (conceptual, not audio)
    #    minor→major = +3 semitones on chromatic circle (conceptual, not audio)

    if best_path == "path1":
        effective_a, effective_b = path1_converted_semi, semi_b
    else:
        effective_a, effective_b = semi_a, path2_converted_semi

    # 2. Delegate to build_shift_plan() — both songs are now in the same mode space
    #    so the standard same-mode allocation logic applies directly
    return build_shift_plan(effective_a, effective_b, best_distance, rap_vocals=rap_vocals)

Implementation Steps¶

Step 1: New module `services/key_matching.py`¶

Pure functions, no side effects. Contains:

note_to_semitone(key: str) -> int — Reuse _NOTE_TO_SEMITONE mapping from taste_constraints.py, adding missing enharmonics: "E#": 5 and "B#": 0
chromatic_distance(semi_a: int, semi_b: int) -> int — Min of clockwise/counterclockwise
signed_shift(from_semi: int, to_semi: int) -> int — Shortest signed shift (-6 to +6)
compute_key_plan(key_a, scale_a, conf_a, mod_a, key_b, scale_b, conf_b, mod_b, rap_vocals=False) -> KeyPlan — The main algorithm. rap_vocals comes from the LLM's vocal_type classification

KeyPlan dataclass:

@dataclass
class KeyPlan:
    action: str          # "skip" | "shift" | "warning" | "incompatible"
    shift_a: float       # Semitones to shift Song A (vocals) audio (signed)
    shift_b: float       # Semitones to shift Song B (instrumentals) audio (signed)
    target_key: str      # The target key both songs converge toward
    target_scale: str    # "major" or "minor"
    reason: str          # Human-readable explanation
    distance: int        # Original chromatic distance (after mode conversion)

action values: - "skip" — no shift needed (same key, low confidence, modulation) - "shift" — shift both songs per the allocation table (distance 1-5) - "warning" — distance 6, needs user confirmation before shifting - "incompatible" — distance 7+, cannot match

Step 2: Integrate into pipeline (`services/pipeline.py`)¶

After analysis and before tempo planning. The vocal_type comes from the LLM interpreter's output (Step 5):

# vocal_type is set by the LLM interpreter (default "sung" if not provided)
rap_vocals = intent_plan.vocal_type == "rap"

key_plan = compute_key_plan(
    meta_a.key, meta_a.scale, meta_a.key_confidence, meta_a.has_modulation,
    meta_b.key, meta_b.scale, meta_b.key_confidence, meta_b.has_modulation,
    rap_vocals=rap_vocals,
)

if key_plan.action == "warning":
    # Distance 6 — warn user, get confirmation
    emit_progress(event_queue, {
        "step": "key_warning",
        "detail": f"Large key difference ({meta_a.key} {meta_a.scale} vs {meta_b.key} {meta_b.scale}). "
                  "The remix might not sound great. Continue anyway?",
        "progress": current_progress,
        "requires_confirmation": True,
    }, session=session)
    # Block until user responds (see Step 4)

elif key_plan.action == "incompatible":
    emit_progress(event_queue, {
        "step": "key_warning",
        "detail": f"These songs are too far apart in key to match "
                  f"({meta_a.key} {meta_a.scale} vs {meta_b.key} {meta_b.scale}). "
                  "Continue without key matching?",
        "progress": current_progress,
        "requires_confirmation": True,
    }, session=session)

Step 3: Pass semitones to rubberband in Step 9¶

Currently rubberband_process() is called with semitones=0. Change to:

Important: The existing pipeline gates the rubberband executor with conditions like if stretch_vocals: (only true when tempo adjustment is needed). This must be expanded to also trigger on key shifts, otherwise key-only shifts are silently skipped when tempos already match. Change the gating conditions to: - Vocals: if stretch_vocals or key_plan.shift_a != 0: - Instrumentals: if stretch_instrumentals or key_plan.shift_b != 0:

# Song A = vocals, Song B = instrumentals (fixed convention)
vocal_semitones = key_plan.shift_a if key_plan.action in ("shift", "warning") else 0
inst_semitones = key_plan.shift_b if key_plan.action in ("shift", "warning") else 0

# In the rubberband ThreadPoolExecutor:
for stem_name in vocal_stems:
    is_vocal = (stem_name == "vocals")
    is_drums = (stem_name == "drums")
    # Drums exempt from pitch shifting
    semitones = 0 if is_drums else vocal_semitones
    futures[("vocal", stem_name)] = rb_executor.submit(
        rubberband_process, vocal_audio[stem_name], sr,
        vocal_meta.bpm, target_bpm,
        semitones=semitones,
        is_vocal=is_vocal,
    )

for stem_name in inst_stems:
    is_drums = (stem_name == "drums")
    semitones = 0 if is_drums else inst_semitones
    futures[("inst", stem_name)] = rb_executor.submit(
        rubberband_process, inst_audio[stem_name], sr,
        inst_meta.bpm, target_bpm,
        semitones=semitones,
        is_vocal=False,
    )

Step 4: SSE confirmation flow for key warnings¶

The pipeline currently has no way to pause and wait for user input mid-execution.

Approach: Add a new SSE event type key_warning with requires_confirmation: True. Frontend shows a dialog. User responds via new API endpoint:

POST /api/remix/{session_id}/confirm-key with body {"proceed": true/false}
Pipeline blocks on a threading.Event until confirmation arrives
If proceed: true and action was "warning" → apply the shifts anyway
If proceed: true and action was "incompatible" → continue without key matching
If proceed: false → cancel the remix
Timeout after 120 seconds → cancel

New API endpoint in api/remix.py:

@router.post("/api/remix/{session_id}/confirm-key")
async def confirm_key_match(session_id: str, body: KeyConfirmation):
    session = request.app.state.sessions.get(session_id)
    if session is None:
        raise HTTPException(404, "Session not found")
    session.key_confirmed = body.proceed
    session.key_confirmation_event.set()

Pipeline waits:

if key_plan.action in ("warning", "incompatible"):
    emit_progress(...)
    # key_confirmation_event is initialized in SessionState field defaults — no lazy creation needed
    confirmed = session.key_confirmation_event.wait(timeout=120)
    if not confirmed or not session.key_confirmed:
        emit_progress(event_queue, {"step": "cancelled", ...})
        return
    if key_plan.action == "incompatible":
        key_plan = KeyPlan(action="skip", shift_a=0, shift_b=0, target_key="", target_scale="", reason="user accepted incompatible keys", distance=key_plan.distance)
    elif key_plan.action == "warning":
        key_plan.action = "shift"  # User accepted — promote to normal shift so downstream code processes it

Step 5: Replace `key_source` with `vocal_type` in `interpreter.py`¶

Remove key_source: - Remove key_source from REMIX_PLAN_TOOL schema (line 151-155) — and from "required" list (line 68) - Remove _compute_key_guidance() (lines 663-695) - Remove key_matching_available and key_matching_detail params from _build_system_prompt_blocks() (line 179-180) - Remove key matching section from Section 8 (lines 342-346) — keep tempo matching section - Remove key_source from _parse_intent_plan() (line 927) - Update few-shot examples — remove "key_source": "none" from all three examples (lines 785, 840, ~884) - Update interpret_prompt() — remove _key_available, key_matching_detail = _compute_key_guidance(...) call (lines 1191-1192) and corresponding args to _build_system_prompt_blocks() (lines 1205-1212) - Update generate_fallback_plan() — remove key_source="none" (line 1435)

Add vocal_type: - Add vocal_type to REMIX_PLAN_TOOL schema:

"vocal_type": {
    "type": "string",
    "enum": ["sung", "rap"],
    "description": "Whether Song A's vocals are melodic/sung or rap/spoken word. Only flag 'rap' if the vocals are ENTIRELY rapped or spoken with NO melodic singing. If the artist sings at all (hooks, choruses, melodic sections), use 'sung' — the sung portions need key matching and the rapped portions tolerate the shift fine."
}

- Add vocal_type to _parse_intent_plan() — extract from tool call, default to "sung" - Add to few-shot examples — include "vocal_type": "sung" or "vocal_type": "rap" as appropriate - Add brief guidance to system prompt — one sentence in Section 8: "Classify Song A's vocal_type as 'rap' only if the vocals are entirely rapped/spoken with no melodic singing." - Update generate_fallback_plan() — add vocal_type="sung" (safe default)

models.py: Remove key_source from RemixPlan and IntentPlan dataclasses. Add vocal_type: str = "sung" to IntentPlan (values: "sung" or "rap", default "sung"). Delete the stale first IntentSection/IntentPlan definitions (around lines 188-209) and remove key_source from the surviving definitions. Add key_confirmation_event and key_confirmed as SessionState field defaults (initialized at construction, not lazily in the pipeline):
```
key_confirmation_event: threading.Event = field(default_factory=threading.Event)
key_confirmed: Optional[bool] = None
```
Also add the KeyConfirmation Pydantic model for the confirmation endpoint:
```
class KeyConfirmation(BaseModel):
    proceed: bool
```
taste_constraints.py: Remove check_pitch_shift_semitones() — replaced by key_matching.py logic. Remove _key_semitone_distance() (moved to new module).
taste_features.py: KEEP _camelot_distance() and _estimate_pitch_shift_semitones() — they serve taste scoring (candidate ranking), not audio shifting. Uses _CAMELOT_WHEEL, not _NOTE_TO_SEMITONE. However, remove all plan.key_source reads and key_source= constructor args — these reference the deleted field.

Step 7: Update progress messages¶

Add key-matching status to SSE stream: - "Analyzing keys..." (during analysis) - "Keys matched: shifting instrumentals by +3, vocals by -1" (when shifting) - "Keys already compatible — no shift needed" (when skipping, distance 0) - "Large key difference — the remix might not sound great. Continue anyway?" (warning, distance 6) - "These songs are too far apart in key to match. Continue without key matching?" (incompatible, 7+)

Step 8: Frontend Changes¶

Types: Add 'key_warning' to the ProgressStep union type
ProgressEvent type: Add requires_confirmation?: boolean and key_info?: { detected_keys: string, distance: number, recommendation: string } fields
useRemixProgress.ts: Handle key_warning step — pause progress display and show the confirmation dialog
API client: Add POST /api/remix/{session_id}/confirm-key call ({ proceed: boolean } body)
New UI component: Key confirmation dialog showing detected keys, distance, and recommendation text. Two buttons: "Continue anyway" (proceed: true) and "Cancel" (proceed: false)

Files to Modify¶

File	Change
NEW: `services/key_matching.py`	Core algorithm — `compute_key_plan()`, `KeyPlan`, shift allocation table, semitone utilities
`services/pipeline.py`	Integrate key plan after analysis, pass semitones to rubberband, drums exemption, SSE events, confirmation blocking
`services/interpreter.py`	Remove `key_source`, add `vocal_type` to tool schema. Remove `_compute_key_guidance()`, remove key matching from system prompt Section 8 (add one-sentence vocal_type guidance). Update `_parse_intent_plan()` to extract `vocal_type` (default `"sung"`). Update few-shot examples. Update `interpret_prompt()` and `generate_fallback_plan()`
`models.py`	Remove `key_source` from `RemixPlan`/`IntentPlan`, add `key_confirmation_event`/`key_confirmed` to `SessionState` field defaults, add `KeyConfirmation(BaseModel)`
`api/remix.py`	Add `POST /api/remix/{session_id}/confirm-key` endpoint
`services/taste_constraints.py`	Remove `check_pitch_shift_semitones()` and `_key_semitone_distance()`
`services/taste_features.py`	Keep `_camelot_distance()` and `_estimate_pitch_shift_semitones()` (taste scoring). No changes needed — uses `_CAMELOT_WHEEL`, not `_NOTE_TO_SEMITONE`
`services/candidate_planner.py`	Remove `key_source='none'` from `RemixPlan(...)` constructor call
`services/taste_constraints.py`	Delete `check_pitch_shift_safety()` function and its callsite in `run_all_constraints()`. Remove `check_pitch_shift_semitones()` callsite if still present. These validated the now-removed `key_source` field
`services/gain_mapper.py`	Remove `key_source=intent.key_source` (or similar) from `RemixPlan(...)` constructor call(s). The field no longer exists
`services/taste_model.py`	Remove all `plan.key_source != "none"` branches and `key_source`-dependent scoring logic. Replace with `key_plan.action == "shift"` signal where taste scoring needs to know if key convergence is active
`services/taste_features.py`	Remove all `plan.key_source` reads and `key_source=` constructor args. Keep `_CAMELOT_WHEEL` usage (taste scoring's own concern); only `key_source` references need removal
`tests/`	Update all test files referencing `key_source`: remove from plan/intent constructors, delete `check_pitch_shift_safety` tests, update taste scoring test assertions. Files affected: `test_interpreter.py`, `test_taste_constraints.py`, `test_taste_features.py`, `test_gain_mapper.py`, `test_taste_model.py`, `test_taste_stage.py`, `test_candidate_planner.py`

Verification¶

Unit tests for key_matching.py: - Same key (distance 0) → skip - Distance 1 → instrumental shifts 1, vocal shifts 0 - Distance 3 → instrumental shifts 3, vocal shifts 0 - Distance 4 → instrumental shifts 4, vocal shifts 0 - Distance 5 → instrumental shifts 4, vocal shifts 1 - Distance 6 → warning action, instrumental 4, vocal 2 - Distance 7+ → incompatible - Different mode, try both conversion paths, pick smaller - Low confidence → skip - Modulation detected → skip - Enharmonic equivalents (C# vs Db) → same semitone value - Rap mode: distance 1-4 → instrumental shifts only, vocal always 0 - Rap mode: distance 5+ → incompatible (not warning) - Rap mode: distance 0 → skip (same as sung)
Integration test: - Upload two songs with known different keys - Verify SSE stream includes key matching status - Verify rubberband receives correct semitone values - Verify drums are NOT pitch-shifted - Verify output audio is pitch-shifted
Warning/incompatible flow: - Distance 6: verify key_warning SSE event with requires_confirmation - User accepts → remix proceeds with shifts applied - User declines → remix cancelled - Distance 7+: verify incompatible message - User accepts → remix proceeds WITHOUT key matching - User declines → remix cancelled - Timeout (120s) → remix cancelled

Edge Cases¶

One or both keys undetected (confidence < 0.40): Skip key matching silently
One or both songs modulate: Skip key matching silently
Enharmonic equivalents (C# vs Db): Handled by _NOTE_TO_SEMITONE mapping both to same value
Both songs in same key: Skip, no processing (distance 0)
Relative keys (A minor + C major): After mode conversion, distance is 0 → skip
Drums stem: Exempt from pitch shifting in all cases
Rap vocals (LLM classified): Vocal stems never shifted, max distance 4, distance 5+ incompatible
Hybrid artists (rap + singing): LLM classifies as "sung" (safe default), all shifts apply normally
LLM fails to classify: Defaults to "sung" — the safe choice (shifts everything, no dissonance risk)
Backing vocals in "other" stem: Shift with the same group they belong to (Song A's "other" shifts with vocals, Song B's "other" shifts with instrumentals)

Follow-up Items (not in this implementation)¶

Parallel keys special case: C major + C minor share same root and 5/7 notes. Might be okay without full conversion. Needs listening tests.
Formant preservation on "other" stem: BS-RoFormer puts backing vocals in "other" — shifting without --formant may chipmunk them. Consider adding --formant to "other" stem processing.
Bass sensitivity: Pitch expert flagged bass getting muddy at ±3-4. May need lower cap for bass specifically.
Post-shift EQ: Pitch shifting can boost/cut certain frequencies. A gentle EQ pass after shifting could restore balance.
CREPE validation of LLM classification: Run CREPE pitch confidence on the isolated vocal stem (mean confidence ~0.5-0.6 threshold) to validate the LLM's vocal_type classification. Could log warnings when LLM and CREPE disagree.
Key detection on stems instead of full mix: Run essentia/librosa on the separated instrumental stems rather than the full mix. More accurate for the harmonic content actually present in the remix, especially when discarding a rap song's beat.