OpenAI Unveils HealthBench to Evaluate Healthcare AI Models

Story BySameeksha Bahuguna

•

1 year ago

•

3 Mins Read

OpenAI Unveils HealthBench to Evaluate Healthcare AI Models

HealthBench supports 49 languages and covers 26 medical specialities, from neurological surgery to ophthalmology.

OpenAI, the creator of the artificial intelligence chatbot ChatGPT, has stepped into healthcare with the launch of HealthBench. This open-source dataset is designed to evaluate the performance of large language models (LLMs) in clinical and biomedical domains.

This launch by OpenAI focuses on the responsible deployment of AI, in collaboration with medical professionals, researchers, and institutions to ensure the credibility and reliability of its systems

HealthBench was developed in collaboration with 262 physicians from 60 countries. These experts helped create 5,000 realistic health conversations that reflect real-world patient interactions. Each AI response is rated using a scoring guide written by doctors, reflecting how they make medical decisions.

The main goal of this model is to check if AI-generated responses match what medical professionals would recommend and whether the model is useful to us when the need for trustworthy and safe AI in medicine is escalating.

The scoring process is handled by OpenAI’s advanced language model, GPT-4.1. It applies the physician-authored rubric to each answer, with criteria weighted for clinical relevance, ensuring that accuracy, clarity, and safety are prioritized in the evaluation.

HealthBench supports 49 languages and covers 26 medical specialities, from neurological surgery to ophthalmology. This diversity allows the dataset to test AI models on varied medical scenarios.

Early results show that OpenAI’s o3 model leads with a score of 60%, followed by Elon Musk’s Grok at 54%, and Google’s Gemini 2.5 Pro at 52%. These results come from testing real-life medical scenarios, like how to respond to someone who’s unresponsive.

In one example, an AI model’s answer to a critical situation received a score of 77%, highlighting both strengths and areas for improvement.

According to a report on HealthBench, “The goal for HealthBench is to discover whether AI models are giving the best possible responses to people's health-related inquiries.”

This initiative sets a new standard for evaluating the clinical readiness of language models. It also fosters trust and accountability as AI becomes more integrated into healthcare.

Stay tuned for more such updates on Digital Health News.

Stay tuned for more such updates on Digital Health News