blogpost for learning AI this week

learned high level stuff

quantization

how inference works

tradeoffs between different providers

finetuning with LORA for simple image gen

my vision was simple

you type a word, and you get a beautiful pixel image generated. - 'love' would show two people huggin, etc.

first i tried to get it working on stable diffusion with just a prompt. i tried a few different prompts and images but it desparately tried to fill in the figure, and the images were not consistent. to get the model to consistently product images of the style i wanted, i needed to finetune it.

![[Screenshot 2026-03-11 at 3.49.36 PM.png]]

so i worked with claude to come up with a plan.

rough architecture defined on day 1 with claude: 1. LLM interpretation layer 1. qwen 2.5 3B for taking a word and turning it into a prompt 2. fine tuned stable diffusion model to generate these kind of images 3. particle animation engine - p5.js 4. build into web app

the first day, i just wanted to get the interpretation layer and the image model working. no particle animation yet.

prompted with GPT to get an ideal image working

![[examplePerfect.png]]

ran a python script to generate 2 variations of 88 prompts(176 images total).

then i selected the images i found to be most tasteful.

some were bad...

![[Screenshot 2026-03-11 at 6.58.29 PM.png]] ![[Screenshot 2026-03-11 at 6.57.09 PM.png]] ![[Screenshot 2026-03-11 at 6.57.26 PM.png]]

and some were good...![[Screenshot 2026-03-11 at 6.58.11 PM.png]] ![[Screenshot 2026-03-11 at 6.57.43 PM.png]]

we needed at least 30 images for the fine tuning to be effective. i had to run multiple batches because the prompt was lossy.

I tried updating the prompt, but it honestly turned out worse.

so i had it generate 176 images once more, and selected the ones i liked most

THEN - i fine tuned with LORA

steps:

  1. we took the prompt and the image, and stored the curated ones in a folder.
  2. Install — diffusers, xformers, bitsandbytes for 8-bit adam
  3. Upload data — mount Google Drive, copy your curated/ folder to local for faster I/O
  4. Add trigger word — prepends prtkl, to every caption
  5. Download script — the advanced diffusers DreamBooth script (supports per-image captions)
  6. Train — ~1000 steps, rank 32, fp16, all T4 memory optimizations enabled. Checkpoints every 250 steps so you don't lose progress. Validation image generated periodically so you can see progress.
  7. Check output — verify files were saved
  8. Test — loads the LoRA and generates 5 test prompts that weren't in training data
  9. Save to Drive — backup the LoRA before Colab disconnects

reasoning on epochs, learning rate, batch size:

  1. batch size is 1, but effectivley 4. T4 on colab can only fit 1 image per step, but we can simulate a batch pof 4 buy accumulating gradients across 4 steps.
  2. learning rate is 1e-4. this is standard for LoRA fine tuning
  3. steps at 500. initially claude had sugggested 1000 steps, but this is too high, had it do more research. typically LoRAs with ~30 images is 4-17 epochs(with 29 images, this equates to ~500 steps)

CLAUDE Is not good with training models.....

![[Screenshot 2026-03-12 at 5.04.54 PM.png]]

TI: creates brand new token embeddings for prtkl. learns to recognize prtkl and know to generate that style image.

how it works at a high level:

  1. take a real training image
  2. add a known amoutn of random noise to it
  3. feed the model the noisy image, caption and noise level
  4. model predicts WHAT noise was added
    1. has to identify the random splatters and be liek this is off
  5. compare predicted noise to actual noise, calculate loss. mean((predicted noise - actual noise) ^2 )
  6. gradient is how much weight would need to change to increase the MSE.
  7. THEN we step opposite way with learning rate:
    1. new weight = oldWeight - (learningRate * gradient)
  8. we do this for each weight in the smaller LoRa matrices at every layer - encoder, middle, decoder.
  9. this happens all at once per training step

during training:

  1. model overfit.
  2. at 500 steps, was too much, images looked like this:
  3. pogod around way too much

![[Screenshot 2026-03-12 at 9.18.57 PM.png]]

![[Screenshot 2026-03-12 at 9.20.37 PM.png]]

cause: 1. TI model tried to learn a new embedding for prtkl from scratch. 1. was adding new vectors in text encoders vocab....which fucked up model bc it had to use special tokens. 2. conflicted with the lora work, prtkl is just a rare topken it already handles, solution: - pure LoRA: "prtkl" is just a rare token that SDXL's tokenizer already handles. The LoRA learns "when I see this token in the text, produce the particle style"

then it got even worse: ![[Screenshot 2026-03-13 at 1.22.18 AM.png]] too few steps, learning rate was too low.

modified model: 1. to take 500 steps, and 2. increased rank to 32. 3. Added --train_text_encoder

results were better, but not great. the model still struggled to photorealistically convert the images. ![[Screenshot 2026-03-13 at 11.36.32 AM.png]]

i worked with claude to discuss options: 1. higher rank, longer training run 2. use TI again, but this time train for longer, with higher rank 1. this is where we got the most consistent results so far, mayeb we just had to ramp it up 3. ramp up learning rate

  1. validation prompt throughout with image not in training data - 'figure dancing'

consensus: 1. switch back to TI - creates embeddings from scratch, helps models generate net new images in a certain style 2. but this time use pivotal tuning, this is TI + LoRA combined in a single run: 1. TI handles the trigger word → style mapping (fresh embedding, no prior to fight) 2. LoRA handles the U-Net visual generation (learning the actual particle rendering) 3. First 50% of training: both adapt together. Last 50%: TI freezes, LoRA refines.

changes made: 1. change optimizer from adamw to prodigy 2. Training: --train_text_encoder → --train_text_encoder_ti (pivotal tuning — TI + LoRA) 3. Steps: 500 → 1000 4. LR scheduler: cosine → constant (Prodigy handles its own adaptation) 5. Inference: guidance_scale 7.5 → 5.0, loads TI embeddings, uses learned tokens

adamW vs prodigy: - adamW has you pick learning rate, and then applies it - momentum: running average of recent gradients. smooths out noise so you pogo - variance - running avergae of squared gradients - prodigy - "ill figure out the learning rate myself" - does everything adam does but estimates the optimal learning rate from the gradients as training progresses

we went with prodigy here to figure out what learning rate was best.

no risks on doing 1000 steps since we have checkpoints at each one.

why are we using TI? - without it, something like 'prtkl' is encoded into existing subtokens. - then the LoRA net we trained has to fight against existing meanings and generate image - i.e. pr might be associated with price, print, etc.

much better! ![[Screenshot 2026-03-13 at 2.47.16 PM.png]]

(1000 steps)

adamDw on right, prodigy on left

here the images are actually getting there.

but they still had the wrong background: ![[Screenshot 2026-03-13 at 2.49.55 PM.png]]

no problem though, we just have to specify in prompt white background to ensure it works.

looked through all the steps, for both adamw and prodigy, 1000 hgad the best quality.

adamw was slightly better at 1000 than prodigy, so went with that.

![[Screenshot 2026-03-13 at 3.16.28 PM.png]]

then decided to train again on 2000 steps adamW to see if it would get better. it was close to what we were looking for, but not fully there YET.

![[Pasted image 20260314103555.png]]

past 1300 the model stopped improving and starting at 1700 it began to use blue color and it started getting worse:

1700: ![[Pasted image 20260314103627.png]]

our next options: 1. more training data 2. learning rate 1. lower learning rate 2. rank 1. go lower, OR go higher if its not capturing enough style detail

overall bottlenecked on training data. had to get more samples instead of just 29.

so got 21 more images, THEN adjusted some parameters: - rank=16 - learning_rate=7e-5 - text_encoder_lr=2e-4 - train_text_encoder_ti_frac=0.3 - max_train_steps=2200

![[Screenshot 2026-03-14 at 2.15.55 AM.png]]

a few additional learnings: 1. backgrounds were not all the same color, on the next training run we no longer had background color issue bc

another place later

No, it works for text models too — but the issue is more fundamental: training loss for diffusion models doesn't reliably correlate with output quality.

Here's why:

A model can have a lower loss because it's memorizing training images pixel-for-pixel — that looks like "improvement" to the loss function but is actually overfitting. Conversely, a slightly higher loss might mean the model is generalizing well across varied prompts.

This is why checkpoint sweeps with visual comparison are the standard approach for diffusion fine-tuning. The loss curve is still useful as a sanity check (we used it to confirm no instability or collapse), but it can't tell you "step 1400 looks better than step 1700."

For text models (LLMs) it's a bit better because perplexity/loss more directly maps to generation quality — but even there, eval benchmarks matter more than raw loss for judging real capability.