i love music. always have, always will.
people talk a lot about malleable software these days but one thing that fascinates me is malleable music. what if we could take our favorite songs and make them our own?
at the moment my favorite song is all along the watchtower by jimi hendrix. what if i could sprinkle some reggae on top? or blend it with some rap?
now don't get too excited, i didn't build this(yet!). regardless of copyright, something like this wou;d require tons of training to get right. and even then, who knows
instead, i decided to build something simple. a mashup creator. take lyrics from one song, instrumentals from another. mix them together, match the key, BPM. wa-lah.
now, its not perfect. but theres a glimmer of the future inside of it.
// maybe move later??
two fundamental question that arose in my mind while building this was - how much of DJing/mashing is taste? can we give LLMs 'taste'?
manufacturing taste in LLMs: upsides and limitations(these two can be bettter)¶
how i built a music mashup web app - all from a few python libraries and prompting sonnet-4 to act as a DJ
the goal¶
build a mashup mixer that takes vocals from one song, instrumentals from another and mixes the together. AND make it sound good - handle EQ balancing, key convergence, BPM stretching. all to produce a product that sounds like an average DJ made(again fix later).
architecture
(excalidraw high-level diagram later) ![[Screenshot 2026-03-07 at 3.11.32 PM.png]]
first, it is important to ground our knowledge what is a song made up of, what are its parts/pieces?
- vocals
- lyrics
- key
- instruments(stems)
- key
- relative DB
- tempo
- 120 BPM vs 80 BPM
- composition
- intro
- chorus
- outtro
part 1 - making two songs play together¶
stem splitting¶
initially, i just had to get it working. how do we strip vocals from one song and overlay it with instrumentals from another? what are the pros/cons of different methods.
initially, i tried using htdemucs_ft stem splitter from meta but it only gave us 4 stems: - vocals, drum, bass, other
this limited our ability to analyze and adjust different instruments. so we switched to the BS-Roformer-SW which has 6 stems: - vocals, drum, bass, guitar, piano, other
BS-Roformer both allowed for better song manipulation AND better separation quality. - stat about BS-Roformer vs htdemucs
audio mixing¶
after we split the stems up, we have to combine them back together.
i could have used pydub, but it only has 16 bit integers. this meant that we would lose percision at every step - quiet details like the trickle of a guitar might get lost in the mix.
instead i used Float32 - this gives you WAY more precision, the guitar trickle actually stays the same.
LUFS normalization¶
BPM detection + tempo matching¶
how it sounded after part 1:
part 2 - teaching it to be a DJ¶
lyrics
how it sounded after part 2:
part 3¶
EQ balancing handling backing vocals
part 4¶
dynamic EQ key convergence
part 5 iterations
lessons learned/reflections
- claude does not have musical taste and does not have a good musical judgement ability
- initially, i had it spin up audio engineer experts, to teach me things and consult on decisions. this did not work well.
- consulting my brother and friends was more valuable.