Reasoning Capabilities Model Evaluation Natural Language Processing AI Safety AI Techniques Open Science Model Behavior Analysis Language Processing
OpenAI's GPT-4.5 reached a 73% success rate, surpassing human participants, while Meta's LLaMa-3.1 scored 56%, raising questions about the Turing Test's relevance and societal implications.