SeedVR2 vs Flux2 Klein 9B: Which AI Upscaler Wins Across 264 Production Image Pairs?
SeedVR2 vs FLUX.2 [klein] 9B: blind evaluation of 264 image pairs by 13 professional evaluators across 4 quality dimensions. SeedVR2 wins naturalness at 91%, sharpness at 83%. Exception: AI-generated textures at close range.
When 91% of professional evaluators pick one model on naturalness and 1% pick the other, you're not reading a benchmark. You're reading a verdict.
That's what happened when we ran SeedVR2 against FLUX.2 [klein] 9B — two very different bets on what AI upscaling should do. SeedVR2 is a video restoration model by ByteDance — applied here to still image upscaling — built around a one-step diffusion adversarial approach optimized for perceptual coherence. FLUX.2 [klein] 9B is a 9B-parameter rectified flow transformer by Black Forest Labs, step-distilled to 4 inference steps, applied in this benchmark via a tiling workflow for high-resolution image processing.
On paper, the tradeoff sounds interesting. In practice, the gap is wider than expected, with one exception that matters enough to document carefully before you pick a pipeline.
Methodology
The benchmark ran across 264 image pairs: 132 real photographs and 132 AI-generated images, split equally across six content categories — crowds and complex scenes, faces, materials and textures, nature, text and signage, and urban environments.
Each pair received three independent evaluations in A/B/tie format across four quality dimensions: detail recovery, edge sharpness, naturalness, and color accuracy. Thirteen evaluators participated: 4 QA testers, 4 art directors, and 5 distribution reviewers. Total: 792 scored comparisons.
Agreement is measured using Gwet's AC1 rather than Fleiss κ. When one outcome dominates — as it does here on naturalness — Fleiss κ collapses toward zero not because evaluators disagree, but due to a mathematical artifact known as the paradox of kappa (Feinstein & Cicchetti, 1990). AC1 is stable under skewed distributions (Gwet, 2002), making it the correct metric for a benchmark where one model wins this decisively.
Final ranking uses the Bradley-Terry model (Bradley & Terry, 1952), which estimates the probability that one model beats another in any given comparison. Session ID: 16d236bc. All images sourced from Everypixel production workflows.
Results
| Dimension | SeedVR2 | FLUX.2 [klein] 9B | Tie | AC1 Agreement |
|---|---|---|---|---|
| Detail Recovery | 76% | 5% | 19% | 0.42 |
| Edge Sharpness | 83% | 5% | 12% | 0.38 |
| Naturalness | 91% | 1% | 9% | 0.71 |
| Color Accuracy | 53% | 0% | 47% | 0.33 |
Bradley-Terry: SeedVR2 β = +0.70 · P(SeedVR2 > FLUX.2 [klein] 9B) = 80%
Across all four measured dimensions, SeedVR2 outperforms FLUX.2 [klein] 9B in a blind evaluation of 264 image pairs by 13 professional evaluators. (Everypixel production team, June 2026)
Detailed Findings
Naturalness: The Clearest Signal
SeedVR2 wins naturalness in 91% of comparisons; FLUX.2 [klein] 9B wins 1%. Evaluator agreement on this dimension (AC1 = 0.71) is the highest of any measured quality — reflecting genuine consensus, not a split panel. (Everypixel production team, June 2026)
This is where the models reveal their underlying design philosophy. SeedVR2 adds information that could plausibly have existed in the original scene. FLUX.2 [klein] 9B adds information that looks sharp but wasn't there. The difference is subtle in spec sheets and obvious in practice: evaluators consistently described SeedVR2 output as "clean without over-processing" and FLUX.2 [klein] 9B output as producing skin that looks "overly glossy" or textures that develop halos at edges.
Naturalness also has the most interpretive weight for commercial images. A stock photo that reads as processed is harder to license than one that reads as real.
Edge Sharpness: A Large Gap with a Ceiling
SeedVR2 wins edge sharpness in 83% of comparisons. Evaluators flagged over-sharpening as a recurring SeedVR2 failure mode on complex high-contrast source material. Winning at 83% doesn't mean winning cleanly. (Everypixel production team, June 2026)
Sharpness is where both models show their failure modes most clearly. SeedVR2 can over-sharpen images that are already high-contrast at source — particularly patterned fabric and complex textures. FLUX.2 [klein] 9B over-sharpens too, but also introduces what evaluators called "soapy textures" on skin and fabric surfaces. SeedVR2 gets the win, but the evaluators documented the ceiling.
Detail Recovery: Consistent, With Ties
SeedVR2 wins detail recovery in 76% of comparisons. The 19% tie rate is the highest of any dimension — suggesting that at the extremes of recoverable detail, both models occasionally reach comparable results. (Everypixel production team, June 2026)
Small text is where both models struggle. SeedVR2 preserves general text but modifies fine letterforms. FLUX.2 [klein] 9B hallucinates plausible-looking but incorrect characters. Neither model is production-safe on small type without downstream verification. That's an honest limitation of the current generation of diffusion-based upscalers, not a deficiency specific to either model tested here.
Color Accuracy: The Exception to the Pattern
Color accuracy is the closest dimension: SeedVR2 wins 53% of comparisons, FLUX.2 [klein] 9B wins 0%, and 47% are ties. The high tie rate reflects SeedVR2's tendency to preserve the source color palette rather than modify it — making comparisons harder to call when the source is already well-exposed. (Everypixel production team, June 2026)
FLUX.2 [klein] 9B's color problem is consistent across the full dataset: it shifts green tones darker, increases overall contrast, and adds specular highlights where none existed in the source. Evaluators described this as "changes the palette and contrast of the source." SeedVR2's preservation-first approach is why the tie rate is high — when the source is correct, keeping it unchanged is the right call, and evaluators recorded that as a tie rather than a clear win.
Results by Category
<div data-component="benchmark-table">
<script type="application/json">
{
"caption": "SeedVR2 win rate by category and dimension. (Everypixel production team, June 2026)",
"columns": ["Category", "Detail", "Sharpness", "Naturalness", "Color"],
"rows": [
["crowds_complex", "84%", "93%", "100%", "66%"],
["faces", "89%", "86%", "93%", "36%"],
["materials_textures", "70%", "75%", "82%", "39%"],
["nature", "64%", "84%", "91%", "57%"],
["text_signage", "80%", "75%", "86%", "52%"],
["urban_scenes", "70%", "84%", "91%", "66%"]
],
"unit": "e",
"source_link": ""
}
</script>
</div>All figures: SeedVR2 win rate by category and dimension. (Everypixel production team, June 2026)
SeedVR2 wins across all six content categories and all four quality dimensions. The weakest result is faces × color accuracy at 36% — the only dimension/category combination where SeedVR2 fails to hold a majority preference. (Everypixel production team, June 2026)
Two category patterns are worth highlighting.
Crowds and complex scenes: the strongest result. SeedVR2 achieves 100% naturalness on crowd photographs. FLUX.2 [klein] 9B's tiling approach handles the center subject reasonably well, but peripheral detail — background figures, distant faces — degrades with visible distortion. In a crowd photograph, the periphery is most of the frame.
Faces is the category where the color accuracy gap narrows significantly. SeedVR2 wins naturalness at 93% and detail at 89%, but color preference is 36% with many ties. Evaluators noted that FLUX.2 [klein] 9B adds skin texture that wasn't present in the source — a result that looks detailed but deviates from the original. (Everypixel production team, June 2026)
Materials and textures show SeedVR2's lowest naturalness across all categories (82%) and the weakest color preference (39%). This is also where FLUX.2 [klein] 9B's generative approach performs best in absolute terms — particularly on AI-generated source material, where there's no ground-truth detail to preserve. (Everypixel production team, June 2026)
Hard Cases: Where Evaluators Disagreed
The most informative pairs in a benchmark are the ones evaluators couldn't agree on. Three of them reveal exactly where each model's assumptions break down.
Case #2 — materials_textures / real photo (AC1 = 0.08)
Source: dark surface with complex specular geometry. FLUX.2 [klein] 9B hallucinated highlights on the black area and distorted the underlying shape. An evaluator noted directly: the model "drew highlights on the black and distorted the geometry." SeedVR2 left the dark area intact. The low AC1 reflects that some evaluators read the added highlights as enhanced detail — which is precisely the problem with evaluating models that confidently invent.
Case #8 — materials_textures / AI-generated (AC1 = 0.17)
This is one of the few cases where FLUX.2 [klein] 9B wins a majority evaluation — on detail and naturalness for an AI-generated texture close-up. When the source has no ground-truth detail to preserve, FLUX.2 [klein] 9B's generative approach becomes an asset rather than a liability. (Everypixel production team, June 2026)
This is the genuine exception. For AI-generated images at close range — where the source is synthetic and there's no correct texture to recover — FLUX.2 [klein] 9B occasionally produces results that evaluators rated higher. It's a narrow use case. But it's real, and it's useful to know before you automate an upscaling pipeline across mixed content types.
Case #11 — faces / real photo (AC1 = 0.17)
A distribution reviewer put it precisely: "SeedVR2 better preserved the source colors; skin pores are natural, not just smoothed and blurred." SeedVR2 rendered fine skin texture as textured and alive. FLUX.2 [klein] 9B rendered it as smoothed — cleaner at first glance, but less faithful to the source. The evaluator split here came from readers who interpreted FLUX.2 [klein] 9B's smoother output as higher quality rather than lower fidelity. That's a real perceptual ambiguity, and it's worth knowing it exists in your reviewer panel.
Evaluator Observations
SeedVR2 — what the panel noted:
Strengths: clean output without over-processing; realistic skin detail on portrait photography without blown highlights; preserves source lighting and soft contours; adds detail that reads as plausible and authentic.
Failure modes: over-sharpens on already high-contrast sources; can alter fine facial features on distant subjects; modifies small text rather than preserving it; may change clothing texture on complex patterned fabric.
FLUX.2 [klein] 9B — what the panel noted:
Strengths: pulls blurry areas into focus; can reconstruct detail absent from the source; handles isolated high-contrast portrait subjects reasonably; on AI-generated close-ups, generative detail occasionally reads as credible.
Failure modes: skin renders as over-glossy; shifts color palette and contrast consistently; peripheral areas in tiled output develop distortion; greenery darkens and loses natural tone; invented highlights contradict source lighting; adds non-existent skin details that contradict the original.
When to Use Each Model
SeedVR2 is the default choice for production stock photography — real photographs, mixed content, any workflow where source fidelity is a hard requirement. It's particularly strong on crowds, faces, urban scenes, and nature. Color accuracy is preserved rather than modified, which matters for images that will be distributed without color correction.
FLUX.2 [klein] 9B has a narrow competitive window: AI-generated images at close range, where the source has no recoverable ground-truth detail and generative hallucination reads as enhancement. For single high-contrast portrait subjects, it's occasionally competitive on detail — but it loses on naturalness and color accuracy as scene complexity increases.
Two cases where you should not use FLUX.2 [klein] 9B: images with peripheral detail that matters (crowds, wide shots, complex backgrounds), and any image where source color accuracy is a requirement. In both cases, the model's behavior is consistent and the failure is predictable.
Frequently asked questions
Which AI upscaler is better for stock photography — SeedVR2 or FLUX.2 [klein] 9B?
For production stock photography, SeedVR2. In our evaluation of 264 pairs across six content categories, SeedVR2 won naturalness at 91% and sharpness at 83%. The decisive differentiator: SeedVR2 preserves the source color palette; FLUX.2 [klein] 9B modifies it consistently. (Everypixel production team, June 2026)
Does SeedVR2 work well on AI-generated images, not just real photos?
Yes — SeedVR2 wins across both real photo and AI-generated subcategories. The exception is materials and textures from AI-generated sources at close range, where FLUX.2 [klein] 9B occasionally produces results that evaluators rated as more detailed. For mixed content pipelines, SeedVR2 is still the safer default.
What is Gwet's AC1, and why use it instead of Fleiss kappa for image evaluation?
Gwet's AC1 is an inter-rater agreement metric that remains stable when one outcome dominates. Fleiss κ suffers from the paradox of kappa: when most evaluators agree, the expected-agreement baseline inflates, and κ collapses toward zero — making high consensus look like low agreement. In a benchmark where 91% of evaluators agree on one model, Fleiss κ would produce a misleadingly low agreement score. AC1 doesn't have this problem.
When does FLUX.2 [klein] 9B outperform SeedVR2?
In our evaluation, FLUX.2 [klein] 9B performs best on AI-generated texture close-ups where the source has no recoverable ground-truth detail. It's also competitive on isolated high-contrast portrait subjects in center-frame. Both are narrow conditions; as scene complexity increases, the gap favors SeedVR2.
How large was this evaluation?
264 image pairs × 4 quality dimensions × 3 evaluators per pair = 792 total scored comparisons. 13 evaluators across three roles: 4 QA testers, 4 art directors, 5 distribution reviewers. Benchmark session ID: 16d236bc.
About Everypixel
Everypixel runs systematic benchmarks of AI image and video models used in production content workflows. Evaluations use domain-specific reviewer panels — not crowdsourced annotation — to reflect commercial visual content standards. Research is published at research.everypixel.com.
References
- ByteDance Seed. (2026). SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training. arXiv:2506.05301. arxiv.org/abs/2506.05301
- Black Forest Labs. (2025). FLUX.2 [klein]. bfl.ai/models/flux-2
- Bradley, R.A., & Terry, M.E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324–345. doi.org/10.2307/2334029
- Feinstein, A.R., & Cicchetti, D.V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549. doi.org/10.1016/0895-4356(90)90158-L
- Gwet, K.L. (2002). Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Statistical Methods for Inter-Rater Reliability Assessment, 1, 1–6. agreestat.com
Cite this article
<blockquote cite="https://research.everypixel.com/seedvr2-vs-flux2-klein-9b/"> <p>SeedVR2 vs FLUX.2 [klein] 9B: blind evaluation of 264 image pairs by 13 professional evaluators across 4 quality dimensions. SeedVR2 wins naturalness at 91%, sharpness at 83%. Exception: AI-generated textures at close range.</p> <footer>— <a href="https://research.everypixel.com/seedvr2-vs-flux2-klein-9b/">SeedVR2 vs Flux2 Klein 9B: Which AI Upscaler Wins Across 264 Production Image Pairs?</a>, Everypixel Research, June 2026</footer> </blockquote>
Everypixel Research. (2026). SeedVR2 vs Flux2 Klein 9B: Which AI Upscaler Wins Across 264 Production Image Pairs?. research.everypixel.com. https://research.everypixel.com/seedvr2-vs-flux2-klein-9b/