VIOLINVisual Obedience Benchmark

VIOLIN Leaderboard

The VIOLIN leaderboard displays evaluation results for image generation models on Visual Instruction Obedience Level-4 EvaluatIoN benchmark. We evaluate AI's ability to generate pure colors with pixel-perfect precision across six task variations.

🏆 Main Benchmark Leaderboard

Compare model performance across different task variations. Lower scores indicate better performance.

Variation 1: Single Color

Generate a single uniform pure color

Example:Generate an image with pure color #D7472A (Hex code).

Rank	Model	Type	Higher is better	Lower is better	Lower is better
	GPT-Image-1.5OpenAI	Closed	93.2	0.053	0.015
	Nano-BananaGoogle	Closed	89.4	0.102	0.004
	Qwen-ImageAlibaba	Open	80.2	0.171	0.027
4	Seedream-4.5ByteDance	Closed	71.4	0.203	0.083
5	FLUX.1Black Forest Labs	Open	62.9	0.356	0.015
6	SANANVIDIA	Open	58.4	0.319	0.097
7	Janus-Pro-1.5DeepSeek	Open	53.6	0.396	0.068
8	OmniGen2VectorSpace Lab	Open	51.4	0.390	0.096

💡 Click on any row to view detailed metrics. Data source: VIOLIN Benchmark

🔧 Fine-tuning Track

Models trained on 90% of Variation-1 subset. Evaluating improvement after fine-tuning.

Model	Before Fine-tuning		After Fine-tuning		Δ Change
Model	Pre-mean	Pur-mean	Pre-mean	Pur-mean	Precision	Purity
1 Janus-Pro-1.5 DeepSeek	0.396	0.068	0.210	0.006	-47.0%	-91.2%
2 Qwen-Image Alibaba	0.171	0.027	0.119	0.004	-30.4%	-85.2%
3 FLUX.1 Black Forest Labs	0.356	0.015	0.277	0.001	-22.2%	-93.3%
4 SANA NVIDIA	0.319	0.097	0.317	0.085	-0.6%	-12.4%
5 OmniGen2 VectorSpace Lab	0.390	0.096	0.391	0.078	+0.3%	-18.8%

Key Finding: Fine-tuning dramatically reduces purity errors while maintaining competitive precision across all models.

🧪 Generalization Track

Evaluating Janus-Pro model's zero-shot generalization on unseen data splits.

Prompt-Split

Strategy: Random 80-20 split by prompt templates

Metric	Training Set	Test Set	Full Set
Pre-max	0.236	0.267	0.249
Pre-mean	0.097	0.128	0.110
Pre-std	0.049	0.056	0.053
Pur-max	0.018	0.019	0.018
Pur-mean	0.003	0.004	0.004
Pur-std	0.003	0.004	0.003
Avg-max	0.127	0.143	0.134
Avg-mean	0.050	0.066	0.057
Avg-std	0.025	0.029	0.027

Observation: Test set error is 32% higher than training set, indicating generalization challenges on unseen prompts.