Key Convergence Algorithm — Implementation Plan

Key Convergence Algorithm — Implementation Plan

Context

Right now when two songs are in different keys, the remix sounds like static/dissonance. The current pipeline detects keys (essentia/librosa) and the LLM sets key_source in the RemixPlan, but no pitch shifting is actually appliedrubberband_process() always receives semitones=0.

We need an automatic key convergence algorithm that: 1. Always shifts when keys differ (even 1 semitone — minor 2nd is the most dissonant interval) 2. Resolves major/minor mismatches via relative key conversion 3. Favors shifting instrumentals over vocals to protect audio quality 4. Handles rap/spoken vocals — LLM classifies vocal type, disables vocal shifting when vocals are unpitched 5. Warns the user when shift is large (distance 6), bails at 7+

This replaces LLM control of key matching — the algorithm runs automatically.


Design Decisions (from discussion)

Decision Choice Rationale
Convergence method Instrumental-favoring split (see table below) Vocals degrade faster than instrumentals under pitch shift
LLM role None — fully automatic Algorithm handles both whether and how
Mode mismatch Convert via relative key, try both directions, pick smallest total shift Major + minor on same root sounds bad (disagreement on 3rd/6th/7th)
Skip threshold None — always shift, even at 1 semitone Minor 2nd is the most dissonant interval; R3 at ±1 is transparent
Safety limit 4 semitones max per song (instrumentals), 2 max (vocals) Expert-validated limits for Rubber Band R3
Distance 6 Warning via SSE → frontend confirmation → proceed or cancel Pushing limits, user should know
Distance 7+ Incompatible — do not attempt Both sides would exceed safe limits
Same key (distance 0) Skip — no processing needed Nothing to fix
Rap/spoken vocals LLM classifies vocal_type from filename + prompt When rap/spoken, vocals are exempt from pitch shifting — caps max distance at 4

Rap/Spoken Vocal Handling

Problem: Rap vocals are largely unpitched rhythmic speech. Pitch-shifting them degrades timbre for zero harmonic benefit. Additionally, the detected "key" of a rap song is the key of the beat — if we're discarding the beat and using only the vocals, that key is irrelevant.

Solution: The LLM interpreter classifies Song A's vocal type as part of its existing prompt interpretation step. New field in the LLM tool schema:

"vocal_type": {
    "type": "string",
    "enum": ["sung", "rap"],
    "description": "Whether Song A's vocals are melodic/sung or rap/spoken word. Only flag 'rap' if the vocals are ENTIRELY rapped or spoken with NO melodic singing. If the artist sings at all (hooks, choruses, melodic sections), use 'sung' — the sung portions need key matching and the rapped portions tolerate the shift fine."
}

When vocal_type == "rap": - Vocal stems are never pitch-shifted (shift_a is always 0) - Max compatible distance drops from 6 → 4 (instrumentals absorb everything) - Distance 5+ becomes "incompatible" instead of "warning" - The algorithm otherwise works identically — just with the vocal column zeroed out

Why LLM classification works here: - The LLM already runs during the interpreter step and sees the filenames + user prompt - "The Notorious B.I.G. - Hypnotize.mp3" + "put Biggie's vocals over..." → the LLM knows this is rap - No additional compute, no external API calls, no new dependencies - For well-known artists (the common case), classification is near-certain from the filename alone - No existing DJ tool or mashup software does this — we'd be ahead of the field

Why the classification rule is "ONLY flag rap if entirely spoken": - Artists who both rap and sing (Drake, Post Malone, Travis Scott) should be classified as "sung" - If an artist sings hooks/choruses, those pitched sections genuinely benefit from key matching - The rapped sections tolerate a ±1-2 semitone shift fine — slight timbre change, not catastrophic - False negative (classifying rap as sung) → unnecessary but harmless vocal shift - False positive (classifying sung as rap) → sung vocals don't get shifted, potential dissonance on melodic sections — this is the worse mistake, so we bias toward "sung"

Graceful failure modes: - LLM unsure → defaults to "sung" (the safe choice — shift everything) - LLM wrong (classifies sung as rap) → vocals don't shift, worst case is some dissonance at distances 1-4 - LLM wrong (classifies rap as sung) → vocals shift ±1-2 semitones, slight timbre change, no real harm

Why only Song A: Song A is the vocal source. Song B provides instrumentals — its vocal type doesn't matter since we're not using its vocals.

Backing vocals edge case: BS-RoFormer puts backing vocals in the "other" stem, not the vocal stem. For a pure rap track with no melodic content, there are rarely pitched backing vocals either. For hybrid artists (classified as "sung"), everything shifts normally. The "other" stem from Song A shifts with shift_a (same as vocals), keeping backing vocals aligned with lead vocals. The "other" stem from Song B shifts with shift_b (same as instrumentals).

Future enhancement: CREPE pitch confidence on the isolated vocal stem could auto-detect pitchedness (mean confidence ~0.5-0.6 threshold) and validate the LLM's classification. No custom model training needed.

Shift Allocation Table

Two modes based on the rap toggle. Instrumentals absorb shift first (up to 4), sung vocals take what's left (up to 2). Rap vocals never shift.

Default mode (sung vocals):

Distance Instrumentals Vocals Notes
0 0 0 Same key — skip
1 1 0 Minor 2nd — most dissonant, always fix
2 2 0 Clean
3 3 0 Instrumental preferred max
4 4 0 Instrumentals absorb all
5 4 1 Push instrumentals to 4
6 4 2 Warning first — vocals at max
7+ Incompatible

Rap mode (rap/spoken vocals toggled on):

Distance Instrumentals Vocals Notes
0 0 0 Same key — skip
1 1 0 Clean
2 2 0 Clean
3 3 0 Clean
4 4 0 Instrumentals at max
5+ Incompatible

Why these limits: - Vocals ±2 max: Formant preservation (R3 + --formant) is good but not perfect. At ±3, sibilants go metallic and sustained notes get watery. An automated system with no human QA should stay at ±2. - Instrumentals ±4 max: No formant sensitivity, but transients and timbre degrade beyond this. Bass gets muddy at ±4 but is still acceptable. - Drums exempt from pitch shifting: Largely unpitched — shifting smears transients and creates metallic artifacts on cymbals/hi-hats. Harmonic benefit is negligible.


How Key Conversion Works

The Problem

When Song A is in D major and Song B is in A minor, they use different note sets. Playing them together produces dissonance on the 3rd, 6th, and 7th scale degrees.

The Solution: Relative Key Conversion

Every major key has a relative minor 3 semitones below (same notes): - C major ↔ A minor (both use C D E F G A B) - D major ↔ B minor (both use D E F# G A B C#) - Eb major ↔ C minor (both use Eb F G Ab Bb C D)

Converting via relative key means reframing one song's key in terms of the other mode. This is a conceptual step — no audio processing. It lets us compute the convergence distance with both songs in the same mode.

Two-Path Evaluation (when modes differ)

Given Song A in D major and Song B in A minor:

Path 1: Convert minor → relative major - A minor's relative major = C major - Now comparing: D major vs C major = 2 semitones apart - Instrumental shifts 2, vocal shifts 0 → total audio shift: 2

Path 2: Convert major → relative minor - D major's relative minor = B minor - Now comparing: B minor vs A minor = 2 semitones apart - Instrumental shifts 2, vocal shifts 0 → total audio shift: 2

Pick whichever path gives the smallest total audio shift. If tied, either works.

The Math

Relative key conversion: - Minor → relative major: root +3 semitones (conceptually, not audio) - Major → relative minor: root -3 semitones (conceptually, not audio)

Then convergence using the shift allocation table — instrumentals absorb first, vocals take the remainder.

Key insight: The "conceptual conversion" and the "convergence shift" combine into a single actual audio shift per song. We compute both paths end-to-end and compare the actual audio shifts, not the intermediate steps.

What DJs and Mashup Artists Actually Do

From research on professional practice:


Algorithm (Pseudocode)

function compute_key_plan(key_a, scale_a, conf_a, mod_a, key_b, scale_b, conf_b, mod_b, rap_vocals=False):

    # Gate: skip if key data missing or confidence too low
    if key_a is None or scale_a is None or conf_a is None:
        return KeyPlan(action="skip", shift_a=0, shift_b=0, target_key="", target_scale="", reason="missing key data for song A", distance=0)
    if key_b is None or scale_b is None or conf_b is None:
        return KeyPlan(action="skip", shift_a=0, shift_b=0, target_key="", target_scale="", reason="missing key data for song B", distance=0)
    if conf_a < 0.40 or conf_b < 0.40:
        return KeyPlan(action="skip", reason="low confidence")
    if mod_a or mod_b:
        return KeyPlan(action="skip", reason="modulation detected")

    semi_a = note_to_semitone(key_a)
    semi_b = note_to_semitone(key_b)

    # Same mode — straightforward convergence
    if scale_a == scale_b:
        distance = chromatic_distance(semi_a, semi_b)
        if distance == 0:
            return KeyPlan(action="skip", reason="same key")
        return build_shift_plan(semi_a, semi_b, distance, rap_vocals=rap_vocals)

    # Different mode — try both conversion directions, pick cheapest

    # Path 1: Convert minor to relative major (+3)
    # Path 2: Convert major to relative minor (-3)
    # For each path, compute the end-to-end audio shift per song
    # Pick the path with smallest total shift

    if scale_a == "minor":
        # Path 1: convert A (minor) → relative major
        path1_a = (semi_a + 3) % 12
        path1_dist = chromatic_distance(path1_a, semi_b)
        # Path 2: convert B (major) → relative minor
        path2_b = (semi_b - 3) % 12
        path2_dist = chromatic_distance(semi_a, path2_b)  # A stays minor
    else:
        # Path 1: convert B (minor) → relative major
        path1_b = (semi_b + 3) % 12
        path1_dist = chromatic_distance(semi_a, path1_b)
        # Path 2: convert A (major) → relative minor
        path2_a = (semi_a - 3) % 12
        path2_dist = chromatic_distance(path2_a, semi_b)  # B stays minor

    best_distance = min(path1_dist, path2_dist)

    # Compute actual audio shifts using the shift allocation table
    # The conceptual conversion (+3/-3) affects WHERE the target key lands,
    # but the actual audio shift is from original key to final target
    return build_shift_plan_from_best_path(semi_a, semi_b, best_path, best_distance, rap_vocals=rap_vocals)


function build_shift_plan(semi_a, semi_b, distance, rap_vocals=False):
    """Apply the shift allocation table."""

    # Rap mode: vocals never shift, max distance is 4
    if rap_vocals:
        if distance > 4:
            return KeyPlan(action="incompatible", reason="distance exceeds instrumental-only limit (rap vocals)")
        SHIFT_TABLE = {1: (1, 0), 2: (2, 0), 3: (3, 0), 4: (4, 0)}
        inst_shift, vocal_shift = SHIFT_TABLE[distance]
        ...compute signed shifts based on shortest path direction...
        return KeyPlan(action="shift", shift_a=0, shift_b=inst_signed)

    # Sung mode: full table with vocal overflow
    if distance > 6:
        return KeyPlan(action="incompatible")
    SHIFT_TABLE = {
        1: (1, 0), 2: (2, 0), 3: (3, 0),
        4: (4, 0), 5: (4, 1), 6: (4, 2),
    }
    inst_shift, vocal_shift = SHIFT_TABLE[distance]

    if distance == 6:
        # Populate shift values even for warning — they're applied after user confirms
        ...compute signed shifts based on shortest path direction...
        return KeyPlan(action="warning", shift_a=vocal_signed, shift_b=inst_signed, distance=distance)

    # Determine direction (which way to shift each song)
    # Instrumentals = Song B, Vocals = Song A (fixed convention)
    # Shift toward each other on the chromatic circle
    ...compute signed shifts based on shortest path direction...

    return KeyPlan(action="shift", shift_a=vocal_signed, shift_b=inst_signed)


function build_shift_plan_from_best_path(semi_a, semi_b, best_path, best_distance, rap_vocals=False):
    """Build a shift plan when modes differ, using the winning conversion path.

    Applies the conceptual mode conversion (+3/-3) to get both songs into the
    same mode space, then delegates to build_shift_plan() which already handles
    same-mode distance computation and shift allocation.
    """

    # 1. Apply conceptual mode conversion to get effective semitones
    #    major→minor = -3 semitones on chromatic circle (conceptual, not audio)
    #    minor→major = +3 semitones on chromatic circle (conceptual, not audio)

    if best_path == "path1":
        effective_a, effective_b = path1_converted_semi, semi_b
    else:
        effective_a, effective_b = semi_a, path2_converted_semi

    # 2. Delegate to build_shift_plan() — both songs are now in the same mode space
    #    so the standard same-mode allocation logic applies directly
    return build_shift_plan(effective_a, effective_b, best_distance, rap_vocals=rap_vocals)

Implementation Steps

Step 1: New module services/key_matching.py

Pure functions, no side effects. Contains:

action values: - "skip" — no shift needed (same key, low confidence, modulation) - "shift" — shift both songs per the allocation table (distance 1-5) - "warning" — distance 6, needs user confirmation before shifting - "incompatible" — distance 7+, cannot match

Step 2: Integrate into pipeline (services/pipeline.py)

After analysis and before tempo planning. The vocal_type comes from the LLM interpreter's output (Step 5):

# vocal_type is set by the LLM interpreter (default "sung" if not provided)
rap_vocals = intent_plan.vocal_type == "rap"

key_plan = compute_key_plan(
    meta_a.key, meta_a.scale, meta_a.key_confidence, meta_a.has_modulation,
    meta_b.key, meta_b.scale, meta_b.key_confidence, meta_b.has_modulation,
    rap_vocals=rap_vocals,
)

if key_plan.action == "warning":
    # Distance 6 — warn user, get confirmation
    emit_progress(event_queue, {
        "step": "key_warning",
        "detail": f"Large key difference ({meta_a.key} {meta_a.scale} vs {meta_b.key} {meta_b.scale}). "
                  "The remix might not sound great. Continue anyway?",
        "progress": current_progress,
        "requires_confirmation": True,
    }, session=session)
    # Block until user responds (see Step 4)

elif key_plan.action == "incompatible":
    emit_progress(event_queue, {
        "step": "key_warning",
        "detail": f"These songs are too far apart in key to match "
                  f"({meta_a.key} {meta_a.scale} vs {meta_b.key} {meta_b.scale}). "
                  "Continue without key matching?",
        "progress": current_progress,
        "requires_confirmation": True,
    }, session=session)

Step 3: Pass semitones to rubberband in Step 9

Currently rubberband_process() is called with semitones=0. Change to:

Important: The existing pipeline gates the rubberband executor with conditions like if stretch_vocals: (only true when tempo adjustment is needed). This must be expanded to also trigger on key shifts, otherwise key-only shifts are silently skipped when tempos already match. Change the gating conditions to: - Vocals: if stretch_vocals or key_plan.shift_a != 0: - Instrumentals: if stretch_instrumentals or key_plan.shift_b != 0:

# Song A = vocals, Song B = instrumentals (fixed convention)
vocal_semitones = key_plan.shift_a if key_plan.action in ("shift", "warning") else 0
inst_semitones = key_plan.shift_b if key_plan.action in ("shift", "warning") else 0

# In the rubberband ThreadPoolExecutor:
for stem_name in vocal_stems:
    is_vocal = (stem_name == "vocals")
    is_drums = (stem_name == "drums")
    # Drums exempt from pitch shifting
    semitones = 0 if is_drums else vocal_semitones
    futures[("vocal", stem_name)] = rb_executor.submit(
        rubberband_process, vocal_audio[stem_name], sr,
        vocal_meta.bpm, target_bpm,
        semitones=semitones,
        is_vocal=is_vocal,
    )

for stem_name in inst_stems:
    is_drums = (stem_name == "drums")
    semitones = 0 if is_drums else inst_semitones
    futures[("inst", stem_name)] = rb_executor.submit(
        rubberband_process, inst_audio[stem_name], sr,
        inst_meta.bpm, target_bpm,
        semitones=semitones,
        is_vocal=False,
    )

Step 4: SSE confirmation flow for key warnings

The pipeline currently has no way to pause and wait for user input mid-execution.

Approach: Add a new SSE event type key_warning with requires_confirmation: True. Frontend shows a dialog. User responds via new API endpoint:

New API endpoint in api/remix.py:

@router.post("/api/remix/{session_id}/confirm-key")
async def confirm_key_match(session_id: str, body: KeyConfirmation):
    session = request.app.state.sessions.get(session_id)
    if session is None:
        raise HTTPException(404, "Session not found")
    session.key_confirmed = body.proceed
    session.key_confirmation_event.set()

Pipeline waits:

if key_plan.action in ("warning", "incompatible"):
    emit_progress(...)
    # key_confirmation_event is initialized in SessionState field defaults — no lazy creation needed
    confirmed = session.key_confirmation_event.wait(timeout=120)
    if not confirmed or not session.key_confirmed:
        emit_progress(event_queue, {"step": "cancelled", ...})
        return
    if key_plan.action == "incompatible":
        key_plan = KeyPlan(action="skip", shift_a=0, shift_b=0, target_key="", target_scale="", reason="user accepted incompatible keys", distance=key_plan.distance)
    elif key_plan.action == "warning":
        key_plan.action = "shift"  # User accepted — promote to normal shift so downstream code processes it

Step 5: Replace key_source with vocal_type in interpreter.py

Remove key_source: - Remove key_source from REMIX_PLAN_TOOL schema (line 151-155) — and from "required" list (line 68) - Remove _compute_key_guidance() (lines 663-695) - Remove key_matching_available and key_matching_detail params from _build_system_prompt_blocks() (line 179-180) - Remove key matching section from Section 8 (lines 342-346) — keep tempo matching section - Remove key_source from _parse_intent_plan() (line 927) - Update few-shot examples — remove "key_source": "none" from all three examples (lines 785, 840, ~884) - Update interpret_prompt() — remove _key_available, key_matching_detail = _compute_key_guidance(...) call (lines 1191-1192) and corresponding args to _build_system_prompt_blocks() (lines 1205-1212) - Update generate_fallback_plan() — remove key_source="none" (line 1435)

Add vocal_type: - Add vocal_type to REMIX_PLAN_TOOL schema:

"vocal_type": {
    "type": "string",
    "enum": ["sung", "rap"],
    "description": "Whether Song A's vocals are melodic/sung or rap/spoken word. Only flag 'rap' if the vocals are ENTIRELY rapped or spoken with NO melodic singing. If the artist sings at all (hooks, choruses, melodic sections), use 'sung' — the sung portions need key matching and the rapped portions tolerate the shift fine."
}
- Add vocal_type to _parse_intent_plan() — extract from tool call, default to "sung" - Add to few-shot examples — include "vocal_type": "sung" or "vocal_type": "rap" as appropriate - Add brief guidance to system prompt — one sentence in Section 8: "Classify Song A's vocal_type as 'rap' only if the vocals are entirely rapped/spoken with no melodic singing." - Update generate_fallback_plan() — add vocal_type="sung" (safe default)

Step 7: Update progress messages

Add key-matching status to SSE stream: - "Analyzing keys..." (during analysis) - "Keys matched: shifting instrumentals by +3, vocals by -1" (when shifting) - "Keys already compatible — no shift needed" (when skipping, distance 0) - "Large key difference — the remix might not sound great. Continue anyway?" (warning, distance 6) - "These songs are too far apart in key to match. Continue without key matching?" (incompatible, 7+)

Step 8: Frontend Changes

  1. Types: Add 'key_warning' to the ProgressStep union type
  2. ProgressEvent type: Add requires_confirmation?: boolean and key_info?: { detected_keys: string, distance: number, recommendation: string } fields
  3. useRemixProgress.ts: Handle key_warning step — pause progress display and show the confirmation dialog
  4. API client: Add POST /api/remix/{session_id}/confirm-key call ({ proceed: boolean } body)
  5. New UI component: Key confirmation dialog showing detected keys, distance, and recommendation text. Two buttons: "Continue anyway" (proceed: true) and "Cancel" (proceed: false)

Files to Modify

File Change
NEW: services/key_matching.py Core algorithm — compute_key_plan(), KeyPlan, shift allocation table, semitone utilities
services/pipeline.py Integrate key plan after analysis, pass semitones to rubberband, drums exemption, SSE events, confirmation blocking
services/interpreter.py Remove key_source, add vocal_type to tool schema. Remove _compute_key_guidance(), remove key matching from system prompt Section 8 (add one-sentence vocal_type guidance). Update _parse_intent_plan() to extract vocal_type (default "sung"). Update few-shot examples. Update interpret_prompt() and generate_fallback_plan()
models.py Remove key_source from RemixPlan/IntentPlan, add key_confirmation_event/key_confirmed to SessionState field defaults, add KeyConfirmation(BaseModel)
api/remix.py Add POST /api/remix/{session_id}/confirm-key endpoint
services/taste_constraints.py Remove check_pitch_shift_semitones() and _key_semitone_distance()
services/taste_features.py Keep _camelot_distance() and _estimate_pitch_shift_semitones() (taste scoring). No changes needed — uses _CAMELOT_WHEEL, not _NOTE_TO_SEMITONE
services/candidate_planner.py Remove key_source='none' from RemixPlan(...) constructor call
services/taste_constraints.py Delete check_pitch_shift_safety() function and its callsite in run_all_constraints(). Remove check_pitch_shift_semitones() callsite if still present. These validated the now-removed key_source field
services/gain_mapper.py Remove key_source=intent.key_source (or similar) from RemixPlan(...) constructor call(s). The field no longer exists
services/taste_model.py Remove all plan.key_source != "none" branches and key_source-dependent scoring logic. Replace with key_plan.action == "shift" signal where taste scoring needs to know if key convergence is active
services/taste_features.py Remove all plan.key_source reads and key_source= constructor args. Keep _CAMELOT_WHEEL usage (taste scoring's own concern); only key_source references need removal
tests/ Update all test files referencing key_source: remove from plan/intent constructors, delete check_pitch_shift_safety tests, update taste scoring test assertions. Files affected: test_interpreter.py, test_taste_constraints.py, test_taste_features.py, test_gain_mapper.py, test_taste_model.py, test_taste_stage.py, test_candidate_planner.py

Verification

  1. Unit tests for key_matching.py: - Same key (distance 0) → skip - Distance 1 → instrumental shifts 1, vocal shifts 0 - Distance 3 → instrumental shifts 3, vocal shifts 0 - Distance 4 → instrumental shifts 4, vocal shifts 0 - Distance 5 → instrumental shifts 4, vocal shifts 1 - Distance 6 → warning action, instrumental 4, vocal 2 - Distance 7+ → incompatible - Different mode, try both conversion paths, pick smaller - Low confidence → skip - Modulation detected → skip - Enharmonic equivalents (C# vs Db) → same semitone value - Rap mode: distance 1-4 → instrumental shifts only, vocal always 0 - Rap mode: distance 5+ → incompatible (not warning) - Rap mode: distance 0 → skip (same as sung)

  2. Integration test: - Upload two songs with known different keys - Verify SSE stream includes key matching status - Verify rubberband receives correct semitone values - Verify drums are NOT pitch-shifted - Verify output audio is pitch-shifted

  3. Warning/incompatible flow: - Distance 6: verify key_warning SSE event with requires_confirmation - User accepts → remix proceeds with shifts applied - User declines → remix cancelled - Distance 7+: verify incompatible message - User accepts → remix proceeds WITHOUT key matching - User declines → remix cancelled - Timeout (120s) → remix cancelled


Edge Cases


Follow-up Items (not in this implementation)