New research shows that although 55 percent of organizations have released AI-powered applications and features, 52 percent of AI initiatives fail to reach full production. This tension is reflected in user sentiment too, with 40 percent saying that while AI tools boost productivity by more than 75 percent, quality issues are on the rise.
The report from Applause, based on a survey of over 1,000 developers, QA professionals and consumers, finds 40 percent of users experienced AI hallucinations, up from 32 percent in 2025. Additionally, 46 percent say AI misunderstood their prompts — now the most reported issue — while 41 percent say responses lack sufficient detail.
Among other findings 84 percent of generative AI users say multimodal functionality — the ability to process and generate text, images, audio and video — is critical, putting additional pressure on QA teams. The report also finds that scaling AI initiatives, including the two most common — chatbots and customer service tools — remains a challenge.
Although organizations are speeding up the adoption of AI testing techniques to validate new AI products, evaluation by humans remains the most widely used approach, with 61 percent of organizations relying on human input to validate AI performance. Meanwhile, 33 percent use LLM-as-a-judge methods, where multiple models assess AI outputs in parallel to uncover blind spots.
Despite this mix of approaches, testing strategies are still struggling to keep pace with the speed and complexity of AI development, leaving critical gaps in how these systems are validated at scale.
To address this teams are adopting a mix of AI-driven and human-led testing approaches. These include fine-tuning with synthetic (29 percent) and human-generated data (54 percent), human-led (39 percent) and automated (23 percent) red teaming, as well as AI-first testing agents (30 percent) and human-in-the-loop monitoring (31 percent). Human insight remains central to the AI QA process.
46 percent report that human sentiment and usability are the primary factors in determining whether an AI feature is ready for production — far outweighing purely technical benchmarks.
“Testing AI isn’t just about accuracy — it’s about evaluating complex, multimodal outputs at scale,” says Chris Munroe, VP of AI programs at Applause. “LLM-as-judge systems are becoming an important part of that process, but they can’t operate in isolation.
Without human oversight, you risk reinforcing the same blind spots you’re trying to detect. In addition to human-led evals and fine-tuning, structured red teaming by both domain experts and generalists is essential.”































