After looking more into it, Claude DOES give the correct answer, just not in the format that it's asked, it always adds more info at the end, even when asked to just give the answer...
What do you mean? You can force JSON with structured output.
It was just an example though, in real-world scenarios, sometimes I have to tell the AI to respond in a specific strict format, which is not JSON (e.g. asking it to end with "Good bye!"). Claude is the one who is the worst at following those type of instructions, and because of this it fails to return to correct answer in the correct format, even though the answer itself is good.
i agree that is annoying but seems like anthropic's stance is that the task/agent should be provided an environment to write the file in the output you provide or provided a skill.md description on how to do that specific task.
personally it's a blurry line. most times i'm interacting with an agent where outputting to a file makes sense but it makes it less reliable when treating the model call as a deterministic function call.
There's definitely many ways to improve the output of the AI, and provide it extra hints. Also, some AIs are made for a specific use-case. Maybe I should rephrase it and say that those benchmarks are more about the single-reply intelligence of a model, and more like an AGI test then for specific use-cases.