H.A.I.R - AI in HR
Posts
LLM Reality Check: AI Résumé Screeners are Fast, Fluent, and Fundamentally Unreliable (For Now)

LLM Reality Check: AI Résumé Screeners are Fast, Fluent, and Fundamentally Unreliable (For Now)

Martyn Redstone
July 15, 2025

Introduction

The buzz around AI in recruitment is deafening. Vendors promise "shortlists in seconds", and tools like ChatGPT suggest CV screening as a prime use case. For overwhelmed talent acquisition teams, this sounds like a dream. But what happens when these off-the-shelf Large Language Models (LLMs) face a real-world hiring task? Our latest field experiment at Eunomia HR, the "LLM Reality Check," sought to answer precisely that. The findings might surprise you.

The "Big Idea

Our core finding can be summed up in one sentence: Off-the-shelf large-language models screen CVs like over-confident interns—fast and smooth, but shockingly inconsistent.

The Experiment - Briefly

In May 2025, I conducted 300 head-to-head résumé screens. We took three leading chatbots – ChatGPT-4o, Gemini 2.0 Flash, and Grok 3 – and gave them the same task: review 109 anonymized HR Business Partner résumés for a global Meta role, using a typical prompt a busy recruiter might employ.

The Uncomfortable Truth: Key Findings

Shocking Disagreement (14%): The models agreed on just 14% of their daily top-ten shortlists. Imagine hiring three human recruiters who could only agree on one or two "top" candidates out of ten – that's the level of consensus we saw. Two AI recruiters looking at the same CVs will differ four times out of five.
Rank Roulette (±2.5 places): It wasn't just disagreement between models. Individual models were inconsistent with themselves day-to-day, reshuffling identical résumés by an average of ±2.5 rank places. Yesterday's #2 candidate could easily become tomorrow's #5 with no change to their CV. We even saw one résumé jump from #10 to #1 in 48 hours with Gemini.
Significant Blind Spots (55% Unseen): Perhaps most alarmingly, 55% of the résumés in our talent pool were never shortlisted by any model, on any day. These candidates effectively vanished, an algorithmic blind spot with serious implications for fairness and discovering hidden talent.
Superficial Reasoning (96% Repetitive Phrases): When we looked at why models made their selections, their rationale bullets recycled the same three phrases 96% of the time, signaling minimal deep résumé comprehension.

Context: Human Benchmarks

While human recruiters aren't perfectly consistent (studies show inter-rater reliability around K≈0.49), our findings indicate LLM overlap 14% is half the human κ band, with twice the volatility. For now, replacing recruiters outright with these off-the-shelf tools would likely lower, not raise, screening reliability.

The Good, The Bad, and The Ugly

It’s not all bad news. LLMs are great for speed blitzes (109 CVs in < 1 min), template discipline, and first-draft bullets. But the "Bad" includes ranking roulette and vendor lottery, and the "Ugly" points to serious risks: invisible disqualifiers violating EU AI Act and GDPR, and the difficulty of defending volatile, unexplainable decisions.

Implications & The Path Forward

This instability has profound implications for CHROs, policymakers, vendors, and researchers. CHROs must treat LLMs as copilots, not gatekeepers. However, a practical path forward exists. It's human-in-the-loop, MLOps-heavy, and audit-first. Our report details essential guardrails like programmatic API calls at temperature = 0, structured prompts (rubric-score-sort), and robust MLOps checkliststo transform LLMs into "controlled copilots."

Conclusion

Deploy AI at the speed of innovation, govern it at the speed of risk.LLMs in hiring are too useful to ignore, too unstable to trust blindly.Treat them as you would a brilliant but erratic apprentice: let them draft, sift, and summarise—but never let them sign the offer letter without human review.

This field experiment provides crucial insight for anyone leveraging AI in HR.

➡️ Download the full "LLM Reality Check" report for all the data, charts, and our complete MLOps checklist

📺 Rewatch the webinar, where I go over the experiment and findings in detail.

Take the Eunomia HR QuickScore AI Governance Assessment: Understand your organization's AI governance maturity in minutes.