In evaluating Chameleon, we focus on tasks requiring text generation conditioned on images, particularly image captioning and visual question-answering, with results grouped by task specificity.
This evaluation was conducted in a relatively short time, and we only tested the model with simple agent scaffolds. We expect higher performance [on benchmarks] is possible with more elicitation effort.