initial progress as we finalize the eval prompt, tested a simple topic "AI in animal care" for different closed-source counterparts:

Rank	System	Overall Score	Visual Score	Content Score
1st	Our Baseline	4.75	5.0 ⭐	4.5
2nd	Manus	4.5	4.0	5.0 ⭐
3rd	Genspark	4.25	4.5	4.0
4th	Gamma	4.0	4.5	3.5
5th	OpenAI	3.625	3.25	4.0

🎯 Key Findings
🥇 Our Baseline (4.75) - BEST OVERALL
Strengths: Perfect visual design (5.0), exceptional professionalism, masterful information hierarchy
Quote: "Exceptionally cohesive and polished design...would undoubtedly impress at top-tier conferences"
🥈 Manus (4.5) - BEST CONTENT
Strengths: Perfect content structure (5.0), exceptional narrative quality
Weakness: AI-generated images with garbled text hurt visual clarity (3.0 clarity score)
🥉 Genspark (4.25) - STRONG BALANCE
Strengths: Excellent visual-textual balance (5.0), good information hierarchy (5.0)
Weakness: AI image artifacts with distorted text elements
4️⃣ Gamma (4.0) - GOOD INTRO SLIDE
Strengths: Excellent clarity/readability (5.0), perfect visual-textual balance (5.0)
Weakness: Lacks compelling narrative (3.0), more descriptive than engaging
5️⃣ OpenAI (3.625) - NEEDS IMPROVEMENT
Strengths: Good content structure (4.0), excellent readability (5.0)
Critical Flaw: Red 'X' icons throughout presentation look like error placeholders (2.0 professional design)
🔍 Common Issues Identified
AI Image Artifacts: Multiple systems (Genspark, Manus) suffer from AI-generated images with garbled/nonsensical text
Visual Polish: Only our baseline achieves truly professional conference-level visual design
Content Quality: Manus leads in narrative structure and logical flow
💡 Action Items
✅ Our system performs best overall - maintain current approach
🔧 Address AI image quality in competing systems
📝 Consider adopting Manus-style content structure for even better narrative flow
Evaluation completed on 2025-07-28 for "AI in animal care" topic across 5 different AI presentation systems.

Next is to add more topics, and add pdf-based comparisons

eval benchmark across three models and human #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions