Seemingly trivial variations in the wording of prompts have a significant effect on model performance.
The point of the research was to discover if conducting this experiment involving 12,000 requests per model using GPT-3.5/4, Gemini, or Claude would have cost several thousand dollars.
[
add
]
[
|
|
...
]