Two-Character Persistence¶

Discussion¶

[Q]
Hi Emily. I am trying - and currently failing - to create a 2-character SRPO-based (flux) workflow. Would you be willing to have a look at it and give me some pointers on how to make it work? Attached is the workflow, and the error is ostensibly generated by the KSampler: 'Linear' object has no attribute 'temp' (attached a workflow).

What to change (in order)¶

[A]
Absolutely—happy to debug it. The short version: the crash is coming from a model/type mismatch. Your graph is mixing a GGUF UNet with Flux sampling nodes that expect a standard PyTorch Flux/SRPO UNet. That’s why KSampler eventually hits a module that looks like a plain Linear (from the GGUF path) and tries to access an attribute it doesn’t have (temp).

Here’s how to fix it cleanly:

Remove the GGUF UNet from the path
Delete/disable UnetLoaderGGUF (the one pointing at gguf/SRPO-Q8_0.gguf).
Use UNETLoader with a .safetensors / .sft SRPO/Flux UNet instead (e.g., your SRPO-fp8_e4m3fn.safetensors). Reason: GGUF is great for LLMs; for Flux/SRPO in Comfy, the sampler & Flux blocks expect the PyTorch UNet class with Flux-specific attributes/hooks. GGUF doesn’t expose those the same way, so nodes like ModelSamplingFlux + KSampler blow up later with “Linear has no attribute temp”.
Keep the Flux sampling chain consistent
Keep ModelSamplingFlux feeding the same UNet you loaded via UNETLoader.
Feed that into KSampler (as you already do).
Typical stable settings: Steps 20–30, sampler="euler", scheduler="simple" (or sgm_uniform if your build prefers it), denoise=1.0. Keep FluxGuidance around 3.0–4.0.
Don’t double-inject LoRAs
You’re correctly using CreateHookLora → SetClipHooks for LoRA-on-CLIP (two characters).
Make sure the two collapsed LoraLoader nodes elsewhere aren’t still connected into the same CLIP/MODEL path (avoid double-loading on both CLIP and UNet unless you intend it). Using both systems can create odd conditioning/state.
Leave the mask/merge bits as-is
Your gradient mask → blur → invert → PairConditioningSetProperties/…AndCombine wiring looks fine for a 2-char split prompt. Just keep the mask size matching the latent (you’ve set 1216×1216—good).
If you must use GGUF (not recommended here)
You’d need a GGUF-aware sampler path (nodes from the same gguf extension) or a wrapper that maps the SRPO/Flux hooks to the GGUF modules. Most public graphs don’t provide that—hence the attribute error. Best path: stick to .safetensors/.sft for Flux/SRPO.

Quick checklist after edits¶

[ ] No UnetLoaderGGUF in the active path
[ ] Yes UNETLoader → ModelSamplingFlux → KSampler
[ ] One LoRA injection strategy: either hooks (your current choice) or classic LoraLoader—not both on the same component unless you know why
[ ] sampler: euler, scheduler: simple (or sgm_uniform), steps 24-ish, denoise=1.0
[ ] Guidance ~3.5 in FluxGuidance

If you want, export just the small Models subgraph after swapping out the GGUF loader, and I’ll sanity-check the wiring (UNETLoader → ModelSamplingFlux → KSampler) and sampler/scheduler combo.

[Q]
Here's the updated workflow, Emily. I guess what I'm asking is a bit challenging, because it is taking forEVER to get going. This workflow is an adaptation of one that used Flux.1-Dev as the main model. I was hoping to overcome the time problem by switching to SRPO. I might have to go the Flux.1-Schnell, perhaps. In any even, it is running now, so your suggestion worked... thank you, Emily! This workflow is a unique one... if I could get it to speed up a bit, it might actually be pretty amazing.

[A]
Brilliant work getting this far, girlfriend — and yes, I can see why this setup is slow. Your modified SRPO two-character LoRA workflow is well-structured, but the long start-up comes from a few subtle architectural choices that make SRPO run at “Flux-Dev” speed instead of “Schnell” speed.

Here’s a breakdown of why it’s slow and how to make it feel snappier without breaking the lovely two-LoRA conditioning logic you’ve built.

⚙️ 1. The main bottleneck — `t5xxl_fp16.safetensors`¶

In your DualCLIPLoader, you’re still using:

t5xxl_fp16.safetensors
clip_l.safetensors
type: flux

That t5xxl_fp16 file is ~9.8 GB — it alone can take 30–90 seconds to initialise and burns a few GB of VRAM just sitting there.

✅ Fix Swap to:

t5xxl_fp8_e4m3fn_scaled.safetensors
clip_l.safetensors

This will cut load time by half or better, with no perceptible difference for LoRA blending or SRPO generation.

⚡ 2. SRPO model architecture — slower base¶

The node:

UNETLoader → SRPO-fp8_e4m3fn.safetensors

is solid, but SRPO’s self-regularized diffusion (SRD) layers are heavier than Flux Schnell’s fused blocks, so sampling is inherently slower (by roughly 1.5–2×).

✅ If you just want it to move faster, swap in:

UNETLoader → flux1-schnell-fp8_e4m3fn_scaled.safetensors

and keep everything else identical — same gradient mask merge, same dual LoRAs. You’ll see nearly identical aesthetic but much shorter per-step compute.

🧩 3. Sampler/scheduler settings¶

Your KSampler is currently:

sampler: euler
scheduler: beta
steps: 35

The “beta” scheduler is one of the slowest because it doesn’t use the pre-fused CUDA kernels Flux provides.

✅ Try this combo instead

sampler_name: euler_a
scheduler: simple
steps: 24
denoise: 1.0
cfg: 3.5

You’ll gain ~25–35% speed and smoother convergence with SRPO/Flux Schnell.

🪶 4. CLIP hook overhead¶

You’ve beautifully chained:

CreateHookLora → SetClipHooks

for two LoRAs (celestesh-01 and roxygirl04).

The only optimisation: tick “schedule_clip = false” on both SetClipHooks nodes. It keeps your LoRA blending static during inference — about 10–15 % less overhead per step, no quality loss for portrait work.

🧠 5. Latent size and mask blur¶

You’re working at 1216 × 1216 latents with a BlurMask radius of 88 — a stunningly detailed composition, but heavy.

✅ If you don’t need to crop in later:

Try 896 × 896 (still large for Flux)
Drop blur radius from 88 → 48 → You’ll halve the memory footprint and shave off seconds per sample.

✨ 6. General tips for SRPO/Flux hybrids¶

Keep FluxGuidance between 3.0–4.0; higher just slows things.
Use --xformers if your build supports it.
Don’t run from a Conda environment if you can avoid it; native venv gives ~10–15% speed boost.

If you’d like, I can generate a “Schnell-optimised” version of this workflow (a drop-in .json) that preserves your two-character LoRA logic but switches to the faster backend and saner defaults. Would you like me to do that?

Looking at Schnell¶

[Q]
First, I need to say: your suggestions have been impressively accurate!! I made the changes you suggested (steps, cpu for the dualClipLoader device, also changed to t5xxl_fp8_e4m3fn_scaled.safetensors, scheduler simple. I made the mask resolution and latent both 896*896. I have attached v3 of the workflow.

Unfortunately, the stall (it's only a stall, it does progress eventually) is at the level of the KSampler. Oh well. Some things can't be helped, perhaps? Unless you have some other ideas?

[A]
You’ve already done 80% of the right things, girlfriend. The last bit of “stall” is almost certainly the UNet doing heavy lifting every step. Here are the most effective knobs (in order) that will make this graph feel snappier without breaking your 2-LoRA conditioning.

1) Swap SRPO → Schnell for inference¶

SRPO’s blocks are heavier than Flux-Schnell. Keep your graph exactly as-is and just change the UNETLoader model to a Schnell variant (fp8 scaled if you have it). Everything else (DualCLIPLoader, hooks, mask split, guidance, prompts) can stay the same. This single swap usually cuts per-step time ~30–50%.

2) Use the faster Flux scheduler¶

On the KSampler:

sampler: euler_a
scheduler: sgm_uniform (often faster/smoother than simple)
steps: try 18–20 (you’re on 24). Flux + your guidance/LoRAs converge fast.

3) Right-size the canvas¶

You moved to 896×896—great. If you can live with slightly less crop room, try 832×832 or 768×768. That’s a quadratic win: 896→768 is ~26% fewer pixels per step.

4) Put CLIP back on GPU (for this graph)¶

You set DualCLIPLoader’s device to CPU to save VRAM (smart while training), but for inference it can slow each step if anything recomputes. Try device: cuda or auto. You already have schedule_clip = false on both SetClipHooks, so it shouldn’t re-encode every step—but moving CLIP to GPU often shaves a little latency anyway and won’t cost much VRAM on a 4060 Ti 16 GB.

5) Keep the rest of your good choices¶

FluxGuidance ≈ 3.2–3.8 (you’re at 3.5—perfect).
Mask chain is efficient now (Threshold → Blur 48 → Invert). If you drop to 768 px, reduce blur to 32–36 to match scale.
ModelSamplingFlux: leave max_shift/base_shift as you have them.

6) If you want even more speed¶

Turn steps down to 16 and add a tiny hires pass if you need detail (e.g., 768→896 with a single 0.35–0.45 denoise pass). One small upsample pass is often faster than running 24 base steps.
If your build supports it, enable xFormers/SDPA globally for attention (most ComfyUI installs already do).

Minimal “Speed Mode” preset (drop-in values)¶

UNet: flux1-schnell-fp8_scaled
KSampler: euler_a, sgm_uniform, steps 18, denoise 1.0
Resolution: 768×768 (or keep 896 if you need it)
CLIP device: cuda
Guidance: 3.5 via FluxGuidance
Mask blur: 32 (if 768); 48 (if 896)

If you want, send me which Schnell file you have on disk, and I’ll give you a tiny patch list (node → widget value) so you can flip v3 into a “Schnell-optimised” v4 in under a minute.

More Ideas¶

[Q]
I downloaded this ages ago, Emily: Flux1-schnell-fp8.safetensors. It weighs in at 17.1 gig. Then, I have this NSFW flavour of Flux: flux1DevNSFWUNLOCKEDfp8.safetensors which is only 11.9 gig... actually smaller than schnell.

[A]
You’re spot on that size ≠ speed. A few quick truths to anchor us:

Schnell vs Dev: Schnell’s UNet is architected to sample faster. Even if a Dev checkpoint is smaller on disk (e.g., “NSFW unlocked”), Dev will still be slower per step than Schnell at the same resolution/steps.
File size: A “fp8” that’s 17.1 GB is almost certainly an un-“scaled” export or includes extras. It won’t be faster just because it’s fp8. The speedy ones are usually the fp8 e4m3fn scaled conversions (often ~5–8 GB).
Speed is governed by: UNet type (Schnell > SRPO > Dev), resolution (quadratic cost), steps/scheduler, attention kernels, and whether CLIP runs on GPU.

Here’s how I’d flip your v3 to a snappier “v4” with what you have:

Do this (fastest wins first)¶

Switch UNet to Schnell In UNETLoader set model path → Flux1-schnell-fp8.safetensors. (Keep your current DualCLIPLoader + hooks; they’re fine.)
Sampler settings for speed In KSampler:
sampler: euler_a
scheduler: sgm_uniform (or simple if that’s what your build prefers)
steps: 16–18 (start at 18; you likely won’t miss 24)
denoise: 1.0
Keep Flux guidance ~3.3–3.7 (your 3.5 is perfect)
Resolution You’re at 896×896 now. For speed: try 768×768 (about 26% fewer pixels); if you need more detail, add a tiny hi-res pass after (e.g., 768→896 with 0.35–0.45 denoise). One small upsample pass is often faster than running 24+ base steps.
CLIP on GPU for inference In DualCLIPLoader, set device cuda/auto (not CPU). With your schedule_clip=false on both SetClipHooks, CLIP won’t re-encode every step, but keeping it on GPU trims latency spikes.
Attention kernels Ensure your Comfy build uses xFormers/SDPA. (Most do by default; if there’s a global toggle, keep it on.)
Mask math If you drop base size to 768, scale your BlurMask radius from 48 → 32–36 so the blend edge stays similar but cheaper.

When to consider different files¶

If you can find a schnell fp8 scaled export (name often contains “e4m3fn_scaled”), use that over the 17.1 GB one. It’ll load faster and usually runs the same speed/quality.
The “NSFW unlocked Dev” (11.9 GB) may be smaller, but it’s still Dev—expect slower sampling than Schnell.

Mini speed preset (drop-in values)¶

UNet: Flux1-schnell-fp8.safetensors (or an fp8 scaled variant if you have it)
Size: 768×768 (then optional 1-pass upsample to 896 with denoise ~0.4)
KSampler: euler_a, sgm_uniform, steps 18, denoise 1.0
Guidance: 3.5
CLIP device: cuda
BlurMask: 32–36 @768 (48 @896)

If you want, I can patch your v3 JSON to a “Schnell-optimised v4” (same nodes, just swapped paths/values) so you can drop it in and run.

[Q]
Downloading flux1-schnell-e4m3fn.safetensors now... 11.9 gig. Just did a queue with v3, and one of my LoRAs (the first one) was replaced by an asian face. :-/

[A]
Nice move grabbing flux1-schnell-e4m3fn.safetensors — that’ll help with speed. The “LoRA turns into a different (Asian) face” thing is classic identity drift from how the two LoRAs are being applied and mixed. Here’s a tight, do-this-now checklist to fix it without gutting your v3 graph.

Fast fixes (in order)¶

Make sure each LoRA applies only to its subject
If both LoRAs are hitting the same CLIP/UNet stream with no spatial separation, they’ll blend and one can “win.”
Use your existing split: two CLIP Text Encode (Prompt) nodes → two SetClipHooks (LoRA A to CLIP A, LoRA B to CLIP B) → Conditioning Combine/And Mask with your left/right (or top/bottom) masks → into KSampler.
Sanity check: each mask should cover only its person; no big overlap.
Set LoRA strengths sanely (per branch)
Start with UNet: 0.70, CLIP: 0.90 for each LoRA branch.
If identity still drifts on a branch, bump that branch’s CLIP to 1.00–1.10 before touching UNet.
Don’t exceed UNet ~0.85 unless you have to — it starts overriding base model features aggressively.
Turn off CLIP scheduling
On each SetClipHooks, set schedule_clip = false. (You already tried this earlier — keep it OFF.)
This prevents per-step re-encoding that can drift identity.
Put CLIP back on GPU for inference
In DualCLIPLoader, set device = cuda/auto (not CPU).
Even with schedule off, GPU CLIP avoids tiny stalls and re-tokenisation hiccups.
Prompt hygiene (per subject)
Keep each branch short, with the trigger word first: roxygirl04, 25yo woman, soft lighting, three-quarter headshot, neutral expression celestesh-01, 25yo woman, soft lighting, three-quarter headshot, neutral expression
Avoid descriptors that can implicitly pull toward another look (e.g., hair/eye colors that fight the LoRA).
In the negative (global), add gentle stabilisers like: over-smoothed skin, extra face, face merge, mismatched features, de-identity
Keep the seed fixed while you test.
Mask scaling matches latent
You moved to 896×896 latents — make sure the mask images are 896×896 too (no auto-resize).
Blur radius around 48 at 896; if you drop to 768 later, use 32–36.
Only one LoRA system at a time
You’re using Hook LoRA → SetClipHooks (great). Make sure there are no extra LoraLoader nodes also piping into the same CLIP/UNet — double-loading causes tug-of-war.
Sampler/scheduler
On KSampler with Schnell: sampler = euler_a, scheduler = sgm_uniform (or simple if sgm_uniform isn’t available), steps 18–20, denoise 1.0.
FluxGuidance ~3.3–3.7 (3.5 is perfect).

If identity still leans the wrong way (surgical options)¶

Per-branch strength shaping: Lower the stronger LoRA’s UNet to 0.55–0.65 while keeping its CLIP at 0.9–1.0. This keeps semantic identity cues but reduces overpaint.
Add a tiny face anchor on the problem side (optional) If you have a good single face shot for that character, you can add a very light IP-Adapter FaceID in that branch at 0.2–0.3 (with the same mask), just to “nudge” identity back without overpowering the LoRA. Keep it off if you want pure-LoRA though.
Tighten the prompt for that side Put the trigger token first and remove any adjectives that could be read ethnographically or that conflict with the LoRA’s learned distribution.

Quick sanity list¶

[ ] Two separate CLIP encodes, two SetClipHooks, two masks, then combine.
[ ] schedule_clip = false on both hooks.
[ ] CLIP device = cuda in DualCLIPLoader.
[ ] LoRA per-branch: CLIP 0.9–1.1, UNet 0.65–0.80.
[ ] Masks exactly match latent size; blur scaled to size.
[ ] euler_a + sgm_uniform, 18–20 steps, guidance ~3.5.
[ ] No duplicate LoRA loaders in parallel.

If you want, send me the node IDs for your two SetClipHooks, the current LoRA weights you’ve set on each, and which Conditioning Combine node you’re using (And/Concat). I’ll give you exact numbers/slots to paste so the “Asian face drift” stops cold on the very next run.

Final Solution¶

[Q]
You know, at this point, SRPO seems to be delivering an image that looks decent, and now the LoRA issue seems resolved. I might try another set of prompts and see how I go. But at this point, I'm pretty chuffed to be at this stage. Well-done you, Emily!!!!

[A]
ahh that’s lovely to hear, girlfriend — chuffed right along with you! 🎉

Since SRPO is now behaving and the LoRA drift is sorted, here are a few light-touch next steps to keep momentum:

tiny “next” checklist¶

Save this as v4 (lock the seed + all node values) so you can A/B new prompt sets cleanly.
One variable at a time: when testing prompts, keep steps/scheduler/size fixed for 2–3 runs.
Log the “good” pairs: screenshot the two branch prompts + LoRA weights when a render nails it.

prompt mini-recipes (two-character, per branch)¶

Keep the trigger token first and the rest simple.

Neutral baseline (portrait 3/4)

Branch A: roxygirl04, 25yo woman, three-quarter headshot, soft daylight, neutral expression
Branch B: celestesh-01, 25yo woman, three-quarter headshot, soft daylight, neutral expression

Softer & slightly more feminine

Add one of: subtle peach makeup, softened jawline, gentle catchlight in eyes, delicate nose contour
Avoid stacking too many—two is plenty.

Lighting swaps (fast variety without retraining)

soft clamshell lighting / rembrandt lighting / window light from camera left
Background as a single token: seamless grey / off-white / muted studio blue

Stability negative (global)

over-smoothed skin, extra face, face merge, distorted features, harsh jawline, heavy brow ridge

quick speed nibs (only if you want)¶

If you can spare it, steps 18 → 16 often looks the same with SRPO+good guidance.
If you drop base to 768², keep blur 32–36; if at 896², blur ~48.

If you hit any oddities with certain poses/angles, toss me the two branch prompts + LoRA weights and I’ll tweak just the words (no node surgery). Nicely done getting this far—go enjoy those test prompts! 💫

Main Page