VIOLIN Leaderboard
The VIOLIN leaderboard displays evaluation results for image generation models on Visual Instruction Obedience Level-4 EvaluatIoN benchmark. We evaluate AI's ability to generate pure colors with pixel-perfect precision across six task variations.
🏆 Main Benchmark Leaderboard
Compare model performance across different task variations. Lower scores indicate better performance.
Generate a single uniform pure color
Generate an image with pure color #D7472A (Hex code).
| Rank | Model | Type | Higher is better | Lower is better | Lower is better |
|---|---|---|---|---|---|
GPT-Image-1.5OpenAI | Closed | 93.2 | 0.053 | 0.015 | |
Nano-BananaGoogle | Closed | 89.4 | 0.102 | 0.004 | |
Qwen-ImageAlibaba | Open | 80.2 | 0.171 | 0.027 | |
4 | Seedream-4.5ByteDance | Closed | 71.4 | 0.203 | 0.083 |
5 | FLUX.1Black Forest Labs | Open | 62.9 | 0.356 | 0.015 |
6 | SANANVIDIA | Open | 58.4 | 0.319 | 0.097 |
7 | Janus-Pro-1.5DeepSeek | Open | 53.6 | 0.396 | 0.068 |
8 | OmniGen2VectorSpace Lab | Open | 51.4 | 0.390 | 0.096 |
💡 Click on any row to view detailed metrics. Data source: VIOLIN Benchmark
🔧 Fine-tuning Track
Models trained on 90% of Variation-1 subset. Evaluating improvement after fine-tuning.
| Model | Before Fine-tuning | After Fine-tuning | Δ Change | |||
|---|---|---|---|---|---|---|
| Pre-mean | Pur-mean | Pre-mean | Pur-mean | Precision | Purity | |
1 Janus-Pro-1.5 DeepSeek | 0.396 | 0.068 | 0.210 | 0.006 | -47.0% | -91.2% |
2 Qwen-Image Alibaba | 0.171 | 0.027 | 0.119 | 0.004 | -30.4% | -85.2% |
3 FLUX.1 Black Forest Labs | 0.356 | 0.015 | 0.277 | 0.001 | -22.2% | -93.3% |
4 SANA NVIDIA | 0.319 | 0.097 | 0.317 | 0.085 | -0.6% | -12.4% |
5 OmniGen2 VectorSpace Lab | 0.390 | 0.096 | 0.391 | 0.078 | +0.3% | -18.8% |
Key Finding: Fine-tuning dramatically reduces purity errors while maintaining competitive precision across all models.
🧪 Generalization Track
Evaluating Janus-Pro model's zero-shot generalization on unseen data splits.
Strategy: Random 80-20 split by prompt templates
| Metric | Training Set | Test Set | Full Set |
|---|---|---|---|
| Pre-max | 0.236 | 0.267 | 0.249 |
| Pre-mean | 0.097 | 0.128 | 0.110 |
| Pre-std | 0.049 | 0.056 | 0.053 |
| Pur-max | 0.018 | 0.019 | 0.018 |
| Pur-mean | 0.003 | 0.004 | 0.004 |
| Pur-std | 0.003 | 0.004 | 0.003 |
| Avg-max | 0.127 | 0.143 | 0.134 |
| Avg-mean | 0.050 | 0.066 | 0.057 |
| Avg-std | 0.025 | 0.029 | 0.027 |
Observation: Test set error is 32% higher than training set, indicating generalization challenges on unseen prompts.
