learned high level stuff¶
quantization
how inference works
tradeoffs between different providers
finetuning with LORA for simple image gen¶
my vision was simple
you type a word, and you get a beautiful pixel image generated. - 'love' would show two people huggin, etc.
first i tried to get it working on stable diffusion with just a prompt. i tried a few different prompts and images but it desparately tried to fill in the figure, and the images were not consistent. to get the model to consistently product images of the style i wanted, i needed to finetune it.
![[Screenshot 2026-03-11 at 3.49.36 PM.png]]
so i worked with claude to come up with a plan.
rough architecture defined on day 1 with claude: 1. LLM interpretation layer 1. qwen 2.5 3B for taking a word and turning it into a prompt 2. fine tuned stable diffusion model to generate these kind of images 3. particle animation engine - p5.js 4. build into web app
the first day, i just wanted to get the interpretation layer and the image model working. no particle animation yet.
prompted with GPT to get an ideal image working
![[examplePerfect.png]]
ran a python script to generate 2 variations of 88 prompts(176 images total).
then i selected the images i found to be most tasteful.
some were bad...
![[Screenshot 2026-03-11 at 6.58.29 PM.png]] ![[Screenshot 2026-03-11 at 6.57.09 PM.png]] ![[Screenshot 2026-03-11 at 6.57.26 PM.png]]
and some were good...![[Screenshot 2026-03-11 at 6.58.11 PM.png]] ![[Screenshot 2026-03-11 at 6.57.43 PM.png]]
we needed at least 30 images for the fine tuning to be effective. i had to run multiple batches because the prompt was lossy.
I tried updating the prompt, but it honestly turned out worse.
so i had it generate 176 images once more, and selected the ones i liked most
THEN - i fine tuned with LORA
steps:
- we took the prompt and the image, and stored the curated ones in a folder.
- Install — diffusers, xformers, bitsandbytes for 8-bit adam
- Upload data — mount Google Drive, copy your curated/ folder to local for faster I/O
- Add trigger word — prepends prtkl, to every caption
- Download script — the advanced diffusers DreamBooth script (supports per-image captions)
- Train — ~1000 steps, rank 32, fp16, all T4 memory optimizations enabled. Checkpoints every 250 steps so you don't lose progress. Validation image generated periodically so you can see progress.
- Check output — verify files were saved
- Test — loads the LoRA and generates 5 test prompts that weren't in training data
- Save to Drive — backup the LoRA before Colab disconnects
reasoning on epochs, learning rate, batch size:
- batch size is 1, but effectivley 4. T4 on colab can only fit 1 image per step, but we can simulate a batch pof 4 buy accumulating gradients across 4 steps.
- learning rate is 1e-4. this is standard for LoRA fine tuning
- steps at 500. initially claude had sugggested 1000 steps, but this is too high, had it do more research. typically LoRAs with ~30 images is 4-17 epochs(with 29 images, this equates to ~500 steps)
CLAUDE Is not good with training models.....
![[Screenshot 2026-03-12 at 5.04.54 PM.png]]
TI: creates brand new token embeddings for prtkl. learns to recognize prtkl and know to generate that style image.
how it works at a high level:
- take a real training image
- add a known amoutn of random noise to it
- feed the model the noisy image, caption and noise level
- model predicts WHAT noise was added
- has to identify the random splatters and be liek this is off
- compare predicted noise to actual noise, calculate loss. mean((predicted noise - actual noise) ^2 )
- gradient is how much weight would need to change to increase the MSE.
- THEN we step opposite way with learning rate:
- new weight = oldWeight - (learningRate * gradient)
- we do this for each weight in the smaller LoRa matrices at every layer - encoder, middle, decoder.
- this happens all at once per training step
during training:
- model overfit.
- at 500 steps, was too much, images looked like this:
- pogod around way too much
![[Screenshot 2026-03-12 at 9.18.57 PM.png]]
![[Screenshot 2026-03-12 at 9.20.37 PM.png]]
cause: 1. TI model tried to learn a new embedding for prtkl from scratch. 1. was adding new vectors in text encoders vocab....which fucked up model bc it had to use special tokens. 2. conflicted with the lora work, prtkl is just a rare topken it already handles, solution: - pure LoRA: "prtkl" is just a rare token that SDXL's tokenizer already handles. The LoRA learns "when I see this token in the text, produce the particle style"
then it got even worse: ![[Screenshot 2026-03-13 at 1.22.18 AM.png]] too few steps, learning rate was too low.
modified model: 1. to take 500 steps, and 2. increased rank to 32. 3. Added --train_text_encoder
- Low rank (e.g., 16) — fewer parameters, learns broader/simpler patterns. Cheaper, faster, less risk of overfitting. But may not capture enough detail for a distinctive style.
- Higher rank (e.g., 32, 64) — more parameters, can encode finer distinctions. Better for styles that are visually specific or unlike anything in the base model.
results were better, but not great. the model still struggled to photorealistically convert the images. ![[Screenshot 2026-03-13 at 11.36.32 AM.png]]
i worked with claude to discuss options: 1. higher rank, longer training run 2. use TI again, but this time train for longer, with higher rank 1. this is where we got the most consistent results so far, mayeb we just had to ramp it up 3. ramp up learning rate
- validation prompt throughout with image not in training data - 'figure dancing'
consensus: 1. switch back to TI - creates embeddings from scratch, helps models generate net new images in a certain style 2. but this time use pivotal tuning, this is TI + LoRA combined in a single run: 1. TI handles the trigger word → style mapping (fresh embedding, no prior to fight) 2. LoRA handles the U-Net visual generation (learning the actual particle rendering) 3. First 50% of training: both adapt together. Last 50%: TI freezes, LoRA refines.
changes made: 1. change optimizer from adamw to prodigy 2. Training: --train_text_encoder → --train_text_encoder_ti (pivotal tuning — TI + LoRA) 3. Steps: 500 → 1000 4. LR scheduler: cosine → constant (Prodigy handles its own adaptation) 5. Inference: guidance_scale 7.5 → 5.0, loads TI embeddings, uses learned tokens
adamW vs prodigy: - adamW has you pick learning rate, and then applies it - momentum: running average of recent gradients. smooths out noise so you pogo - variance - running avergae of squared gradients - prodigy - "ill figure out the learning rate myself" - does everything adam does but estimates the optimal learning rate from the gradients as training progresses
we went with prodigy here to figure out what learning rate was best.
no risks on doing 1000 steps since we have checkpoints at each one.
why are we using TI? - without it, something like 'prtkl' is encoded into existing subtokens. - then the LoRA net we trained has to fight against existing meanings and generate image - i.e. pr might be associated with price, print, etc.
much better! ![[Screenshot 2026-03-13 at 2.47.16 PM.png]]
(1000 steps)
adamDw on right, prodigy on left
here the images are actually getting there.
but they still had the wrong background: ![[Screenshot 2026-03-13 at 2.49.55 PM.png]]
no problem though, we just have to specify in prompt white background to ensure it works.
looked through all the steps, for both adamw and prodigy, 1000 hgad the best quality.
adamw was slightly better at 1000 than prodigy, so went with that.
![[Screenshot 2026-03-13 at 3.16.28 PM.png]]
then decided to train again on 2000 steps adamW to see if it would get better. it was close to what we were looking for, but not fully there YET.
![[Pasted image 20260314103555.png]]
past 1300 the model stopped improving and starting at 1700 it began to use blue color and it started getting worse:
1700: ![[Pasted image 20260314103627.png]]
our next options: 1. more training data 2. learning rate 1. lower learning rate 2. rank 1. go lower, OR go higher if its not capturing enough style detail
overall bottlenecked on training data. had to get more samples instead of just 29.
so got 21 more images, THEN adjusted some parameters: - rank=16 - learning_rate=7e-5 - text_encoder_lr=2e-4 - train_text_encoder_ti_frac=0.3 - max_train_steps=2200
![[Screenshot 2026-03-14 at 2.15.55 AM.png]]
a few additional learnings: 1. backgrounds were not all the same color, on the next training run we no longer had background color issue bc
another place later¶
No, it works for text models too — but the issue is more fundamental: training loss for diffusion models doesn't reliably correlate with output quality.
Here's why:
- What the loss measures: how well the model predicts the noise that was added to a training image. It's a per-pixel MSE on noise predictions.
- What you actually care about: style fidelity, compositional coherence, prompt adherence, not-overfitting. These are perceptual and semantic qualities that the loss function doesn't capture.
A model can have a lower loss because it's memorizing training images pixel-for-pixel — that looks like "improvement" to the loss function but is actually overfitting. Conversely, a slightly higher loss might mean the model is generalizing well across varied prompts.
This is why checkpoint sweeps with visual comparison are the standard approach for diffusion fine-tuning. The loss curve is still useful as a sanity check (we used it to confirm no instability or collapse), but it can't tell you "step 1400 looks better than step 1700."
For text models (LLMs) it's a bit better because perplexity/loss more directly maps to generation quality — but even there, eval benchmarks matter more than raw loss for judging real capability.