Skip to content

Latest commit

 

History

History
29 lines (21 loc) · 2.07 KB

File metadata and controls

29 lines (21 loc) · 2.07 KB

bongard

Evaluating Claude models on Bongard problems. The problems are drawn from the Online Encyclopaedia of Bongard Problems.

You can visualize results in the Streamlit app here.

Next steps

  • Investigate model descriptions of abstract images, independent of the Bongard task. Indeed, this currently seems to be the bottleneck, not the abstract reasoning aspects.
  • Add an LLM evaluation of model responses: give an LLM the solution and a model response, and ask it to determine whether the response is correct.
  • Evaluate more models than Haiku and Opus (including future releases targeting diagrams).
  • Do more prompt engineering.
  • Evaluate via classification rather than description. That is, given 5 left images and 5 right images, ask the model to place a new image in the correct group.

Resources

As far as I can tell, no one has evaluated modern multimodal models on this exact task, but there is some related work:

Resources on Bongard problems: