VIOLIN Leaderboard
The VIOLIN leaderboard evaluates image generation models on Visual Instruction Obedience Level-4 EvaluatIoN โ the first systematic benchmark targeting deterministic pixel-level control.
State-of-the-art generative models excel at complex scenes yet fail at trivially simple tasks โ the "Paradox of Simplicity". Violin exposes this failure through three zero- or low-entropy tasks: Pure Color Generation, Image Masking, and Geometric Shape Generation. The benchmark also includes Violin-Absolute-Color, a supplementary zero-entropy extension with six hex-code variations.
๐ Main Benchmark Leaderboard
Three deterministic Level-4 Obedience tasks. Lower error scores indicate better obedience.
Generate uniform color blocks using ISCC-NBS Level-2 natural language color names. Measures pixel-level color accuracy and image purity.
| Rank | Model | Type | Higher is better | Lower is better |
|---|---|---|---|---|
Seedream-5ByteDance | Closed | 95.1 | 0.049 | |
GPT-Image-2OpenAI | Closed | 94.9 | 0.051 | |
Nano-Banana-2Google | Closed | 94.7 | 0.053 | |
4 | FLUX.2Black Forest Labs | Open | 94.3 | 0.057 |
5 | Qwen-ImageAlibaba | Open | 94.3 | 0.057 |
6 | Z-ImageHuawei | Open | 93.9 | 0.061 |
7 | FLUX.1Black Forest Labs | Open | 90.9 | 0.091 |
Apply binary masks (Inpainting / Outpainting / Random) to images with strict pixel-level adherence. Evaluates spatial coverage and boundary precision.
| Rank | Model | Type | Higher is better | Lower is better |
|---|---|---|---|---|
GPT-Image-2OpenAI | Closed | 84.2 | 0.158 | |
Seedream-5ByteDance | Closed | 72.0 | 0.280 | |
Nano-Banana-2Google | Closed | 63.4 | 0.366 | |
4 | FLUX.2Black Forest Labs | Open | 48.8 | 0.512 |
Generate circles, squares, and triangles at precisely specified spatial positions. Measures shape fidelity, localization accuracy, and fill purity.
| Rank | Model | Type | Higher is better | Lower is better |
|---|---|---|---|---|
GPT-Image-2OpenAI | Closed | 86.6 | 0.134 | |
Seedream-5ByteDance | Closed | 79.3 | 0.207 | |
Nano-Banana-2Google | Closed | 71.8 | 0.282 | |
4 | FLUX.2Black Forest Labs | Open | 69.6 | 0.304 |
5 | Qwen-ImageAlibaba | Open | 66.3 | 0.337 |
6 | Z-ImageHuawei | Open | 62.4 | 0.376 |
7 | FLUX.1Black Forest Labs | Open | 59.3 | 0.407 |
๐ก Click any row to view detailed metric breakdown. All metrics are error measures โ lower is better.
๐ฆ Supplementary: Violin-Absolute-Color
Zero-entropy color task โ hex code prompts across 6 variations, 5 open-source models.
Single uniform color block, specified with hexadecimal code.
| Rank | Model | rgb-ed | lab-00 | sd | ced | hf | color-mean |
|---|---|---|---|---|---|---|---|
| 1 | Qwen-Image Alibaba | 0.156 | 0.180 | 0.058 | 0.002 | 0.021 | 0.083 |
| 2 | FLUX.1 Black Forest Labs | 0.364 | 0.387 | 0.044 | 0.001 | 0.001 | 0.159 |
| 3 | SANA NVIDIA | 0.288 | 0.331 | 0.232 | 0.028 | 0.030 | 0.182 |
| 4 | Janus-Pro-1.5 DeepSeek | 0.344 | 0.410 | 0.193 | 0.006 | 0.004 | 0.191 |
| 5 | OmniGen2 PKU/VectorSpace | 0.397 | 0.402 | 0.202 | 0.070 | 0.016 | 0.217 |
