Fine tuning stable diffusion to create particle art

my vision was simple, get stable diffusion to consistently reproduce abstract anthropomorphic figures made up of particles.

you type a word, and you get a beautiful particle image generated. - i.e. 'love' would show the image below.

![[examplePerfect.png]]

the below follows what i did and lessons learned along the way. if you are also planning to fine tune an SDXL with LORA to do abstract art, this should save you 10 hours+ of iteration.

high level architecture ![[Screenshot 2026-03-19 at 8.02.04 PM.png]]

see the excalidraw here for more details

setup¶

training framework: diffuser's train_dreambooth_lora_sdxl_advanced.py — hugging face's advanced dreambooth + LoRA + TI training script for SDXL.

TI: creates new token embeddings with no prior meaning
LoRA: small trainable matrix we tack onto the SDXL transformer
dreambooth: is a fine-tuning approach to bind a trigger token to the style you want. here we combine it with LORA + TI

compute: Modal — runs on an A10G GPU (24GB VRAM). $30 free credits(as of mar '26), which was more than enough for training + inference.

setup steps: 1. install uv (python package manager) 2. uv run modal setup — authenticates your modal account 3. clone the repo: git clone https://github.com/aaronw122/particleArt 1. or create ur own, up to you :)

training data format:

images/curated/
  daily_life_001_v0.png        ← 1024x1024 training image
  daily_life_001_v0.txt        ← caption: "a figure lying on their side, curled in a sleeping position"
  gesture_and_action_005_v0.png
  gesture_and_action_005_v0.txt ← caption: "a figure bending down to pick something up"

each .png gets a matching .txt file with a scene description. use an image generator (GPT, Midjourney, Stable Diffusion) to create training data. you may only find 10% of images are acceptable, so generate a LOT to choose from.

run training:

uv run modal run train_modal_adamw.py

download weights:

modal volume get lora-output-adamw-v4 /results/ ./lora_output/

run inference (generates images from multiple checkpoints):

uv run modal run generate_modal_sweep.py

principles¶

1. generate as much high quality training data as possible¶

for something abstract, 30 images isn't enough. you want at least 50, if not more.

see below for how to generate training data

2. use gpt OR stable diffusion to create an image prompt for generating training data¶

by doing this, you also prove to yourself if you even need to fine tune the model. ask yourself if the model consistently produce what you need with a prompt?

if the answer is yes, then you don't need to fine tune.

here's the prompt i used in chatGPT to generate the images:

"Sparse black particle flecks on a pure white background forming {scene}. Minimal, lots of negative space. No gray tones, no shading — just scattered black dots/flecks suggesting the form. Abstract, not realistic."

but, only 10% of the images met my standards.

since the hit rate was so low, i tried to get GPT to hone in on the exact style i was looking for:

"Sparse black particle flecks on a pure white background. Subject: an abstract human figure {scene}. Only small scattered dots — no lines, no strokes, no outlines. Approximately 200-300 small dots total. Keep 70% of the canvas empty white space. All elements — figures and objects — equally sparse and light. No element bolder or denser than any other."

this(and a couple iterations after) actually produced fewer acceptable images.

some prompts need to stay vague for now, as gpt doesn't yet have the lexicon to appropriately generate images from.

3. use TI(textual inversion)!¶

Textual inversion adds new embeddings to the lookup table, that start with zero prior meaning.

similar to LLMs, SDXL has a lookup table that maps tokens to vectors:

token ID	string	embedding
0	!	[0.023, -0.156, -0.008, ...]
...
1542	dog	[0.192, -0.372, -0.291, ...]

at the bottom, we add two(or more) fresh rows with random vectors that can be trained during fine tuning:

tokenId	string	embedding
...
49410	`<s0>`	[0.841, -0.725, 0.519, ...]
49411	`<s1>`	[-0.104, 0.667, -0.291, ...]

these two token weights get adjusted during fine tuning for a fraction of total steps (controlled by TI frac — in my case, 0.5, so the first half of training). then they freeze and the LoRA matrices continue to update.

in training, these tokens are mapped to the trigger word TOK. whenever the fine tuned model sees TOK in a prompt, it activates the freshly trained style embeddings.

in sum: - without TI, the text encoder has no way to represent 'particle figure art' as a concept. - then the LoRA net we trained has to fight against existing meanings of a word and generate image - With TI: brand new embeddings are created and meaning is assigned to the word.

4. you will likely need to iterate on learning rate, rank, TI frac, and other params before you get the right outcome.¶

here is a table with the params i finalized through many iterations with codex and claude:

Param	Value	Why
LR (UNet)	9e-5	Balances style binding vs overfitting. 7e-5 was too weak, 1e-4(standard) overfit.
text_encoder_lr	2.5e-4	Strong token-style attachment, slightly conservative for 50-image dataset.
scheduler	constant	Cosine let SDXL's priors creep back in late training. Constant maintains pressure.
warmup	100	Less warmup = learns style sooner.
TI frac	0.5	TI needed more steps to learn the full style concept, before was 0.3
max_steps	1900	Higher LR converges faster. Peak expected at 1450-1700.
noise_offset	0.0357	Fixes diffusion's luminance bias — needed for pure white/black.
mixed_precision	bf16	fp16 caused color rounding errors (blue dot artifacts). bf16 more stable.
rank	32	16 did not produce a large enough matrix for the model to effectively learn the new style

rank: determines the size of the LORA matrix we staple onto SDXL, the greater the rank, the more capacity to capture style nuance. BUT also more risk of overfitting.

this was the outcome of 5+ rounds of iteration on the training runs. each of them i ran into different issues and had to optimize.

i frequently ran into rate limiting issues with google colab, and there was not enough VRAM on Colab's T4 GPU (15GB) for SDXL LoRA training. Modal's A10G was much better (24GB VRAM). see setup section above for how to get started.

6. develop consistent evaluation criteria before any model training¶

create a testing suite for tracking progress.
1. i would recommend 1-2 novel prompts that were not part of the training data to ensure the style generalizes well. then 4-5 more to stress test range once your confident in the style.
add checkpoints in training and look at images at each one.
1. every 100 steps is usually a good baseline.
2. checkpoints also ensure you can resume training if something fails OR go back to older weights.
run for more steps than you expect — it is not as simple as text models where you can optimize for loss. you care about perceptual qualities that loss can't capture.
- ![[Screenshot 2026-03-12 at 9.20.37 PM.png]]
- a loss curve like this is normal
evaluating image models is more subjective, i evaluated myself with criteria i built up overtime:
1. does the overall style match what i want?
2. is the background correct? (mine kept drifting off-white)
3. are colors accurate? (got blue dots instead of black)
4. is the rendering style right? (3D when i wanted 2D)

7. use negative prompts at inference¶

negative prompts steer the model away from unwanted qualities. here's what i used at inference:

prompt: "TOK, a figure [doing X], white background"
negative prompt: "photorealistic, detailed, shading, gradient, gray, color, dense, beige, tan, sepia, parchment,
  warm tones, blue, colored dots, 3D, lighting"

note: negative prompts are only used at inference, not during training. training captions are just scene descriptions we showed above.

8. claude.md for the project you can use.¶

https://github.com/aaronw122/particleArt/blob/main/CLAUDE.md

brief summary of my journey:¶

1. generated + curated sample image data.¶

this was a <10% hit rate. had to generate >300 images to produce 30 samples

2. tried fineTuning using LoRA + textual inversion(TI), produced mid images¶

![[Screenshot 2026-03-12 at 9.18.57 PM.png]] selection criteria was "does it match the aesthetic i am going for in the image above"

3. claude/codex said the model overfit at 500 steps, and we should stop using TI¶

i regretfully followed their advice, and did pure LoRA. this produced disastrous outcomes: ![[Screenshot 2026-03-13 at 1.22.18 AM.png]]

4. switched back to using TI, ran for 1000 steps with a rank of 32¶

![[Screenshot 2026-03-13 at 2.47.16 PM.png]] some of the images generated had the wrong background. this was because 32% of my training data had a slightly off-white background.

5. train again on 2000 steps with the images fixed so they all had white backgrounds¶

model quality peaked at 1300 but wanted more out of it ![[Pasted image 20260314103555.png]] 1300 was identified as peak. see the white splothees in some areas.

6. curated 20 more images for a 4 hr training run, ran it at 2am on March 14¶

unfortunately i had a timeout set for 2 hours in my code, so it did not complete

7. re-ran with a checkpoint at halfway¶

![[Screenshot 2026-03-14 at 12.25.42 PM.png]] this felt good.

8. better results, but the model was still pulling from its base styling:¶

![[Screenshot 2026-03-14 at 2.39.13 PM.png]]

see how the figure here is 3d instead of 2d

to resolve the issue, i worked with codex and increased the rank so it would capture more nuanced style differences.

9. increased rank was better, more 2d, BUT was using blue dots instead of black/grey¶

![[Screenshot 2026-03-15 at 3.51.08 PM.png]] worked with codex/claude and identified two fixes: 1. added noise_offset (0.0357) to fix diffusion's luminance bias — without it, the model couldn't produce true white/black. 2. BF16 instead of FP16 1. BF16 is more stable for extremes like pure white + black dots

10. images finally looking good at step 1800!¶

![[Pasted image 20260319202226.png]]

![[Pasted image 20260319202233.png]]

Appendix:¶

github: https://github.com/aaronw122/particleArt