Researchers at the USC Viterbi School of Engineering released a comprehensive study on April 22, 2026, detailing a fundamental misalignment in how artificial intelligence systems and humans interpret linguistic expressions of uncertainty. The research, conducted by the Information Sciences Institute, suggests that words such as likely, probably, and possibly are assigned vastly different numerical probabilities by AI models compared to human users, creating a calibration gap that could lead to systemic errors in collaborative environments.

The study involved a comparative analysis of 1,500 human participants and five state-of-the-art large language models, including GPT-5, Claude 4, and Gemini 2.0. Participants and AI models were asked to assign percentage values to a set of 30 common probability phrases. The data revealed that while humans typically associate the word likely with a probability range of 60% to 75%, the tested AI models consistently assigned it a much higher value, often exceeding 85%. Conversely, terms like unlikely were interpreted by humans as roughly 20%, whereas AI models frequently quantified the term as low as 5%.

Technical analysis of the model outputs indicated that AI systems tend toward extremeness in their interpretations. This overconfidence is attributed to the reinforcement learning from human feedback processes used during model training, which often reward definitive answers over nuanced uncertainty. The researchers noted that this divergence was consistent across different model architectures, suggesting a systemic characteristic of current transformer-based learning rather than a flaw in a specific proprietary system.

Dr. Mayank Kejriwal, the lead researcher on the project, stated that the findings represent a significant hurdle for the integration of AI in high-stakes sectors. According to the report, the discrepancy is most pronounced in zero-shot scenarios where the AI does not have prior context for the user's specific vocabulary. In simulated medical diagnostic tests included in the study, the AI's interpretation of moderate risk led to a 22% higher rate of aggressive intervention recommendations compared to human clinicians viewing the same qualitative data.

The USC team also measured the impact of prompt engineering on these interpretations. They found that even when models were explicitly instructed to adopt a human-centric view of probability, the numerical outputs remained statistically skewed toward the extremes. The study concludes that current AI calibration techniques fail to account for the subjective variability of human language, necessitating new frameworks for uncertainty alignment to ensure safe human-AI collaboration in fields such as defense, healthcare, and emergency response.