VIOLIN Logo
VIOLIN

VIOLIN Leaderboard

The VIOLIN leaderboard displays evaluation results for image generation models on Visual Instruction Obedience Level-4 EvaluatIoN benchmark. We evaluate AI's ability to generate pure colors with pixel-perfect precision across six task variations.

🏆 Main Benchmark Leaderboard

Compare model performance across different task variations. Lower scores indicate better performance.

Variation 1: Single Color

Generate a single uniform pure color

Example:Generate an image with pure color #D7472A (Hex code).
Variation 1: Single Color example
RankModelTypeHigher is betterLower is betterLower is better
GPT-Image-1.5OpenAI
Closed
93.2
0.053
0.015
Nano-BananaGoogle
Closed
89.4
0.102
0.004
Qwen-ImageAlibaba
Open
80.2
0.171
0.027
4
Seedream-4.5ByteDance
Closed
71.4
0.203
0.083
5
FLUX.1Black Forest Labs
Open
62.9
0.356
0.015
6
SANANVIDIA
Open
58.4
0.319
0.097
7
Janus-Pro-1.5DeepSeek
Open
53.6
0.396
0.068
8
OmniGen2VectorSpace Lab
Open
51.4
0.390
0.096

💡 Click on any row to view detailed metrics. Data source: VIOLIN Benchmark

🔧 Fine-tuning Track

Models trained on 90% of Variation-1 subset. Evaluating improvement after fine-tuning.

ModelBefore Fine-tuningAfter Fine-tuningΔ Change
Pre-meanPur-meanPre-meanPur-meanPrecisionPurity
1
Janus-Pro-1.5
DeepSeek
0.3960.0680.2100.006
-47.0%
-91.2%
2
Qwen-Image
Alibaba
0.1710.0270.1190.004
-30.4%
-85.2%
3
FLUX.1
Black Forest Labs
0.3560.0150.2770.001
-22.2%
-93.3%
4
SANA
NVIDIA
0.3190.0970.3170.085
-0.6%
-12.4%
5
OmniGen2
VectorSpace Lab
0.3900.0960.3910.078
+0.3%
-18.8%

Key Finding: Fine-tuning dramatically reduces purity errors while maintaining competitive precision across all models.

🧪 Generalization Track

Evaluating Janus-Pro model's zero-shot generalization on unseen data splits.

Prompt-Split

Strategy: Random 80-20 split by prompt templates

MetricTraining SetTest SetFull Set
Pre-max0.2360.2670.249
Pre-mean0.0970.1280.110
Pre-std0.0490.0560.053
Pur-max0.0180.0190.018
Pur-mean0.0030.0040.004
Pur-std0.0030.0040.003
Avg-max0.1270.1430.134
Avg-mean0.0500.0660.057
Avg-std0.0250.0290.027

Observation: Test set error is 32% higher than training set, indicating generalization challenges on unseen prompts.