Skip to main content

Red Teaming AI Applications: What We Learned Running 750+ Security Tests in a Regulated Environment

Author: Akram Dweikat, George Montagu, Fatemeh Tahavori, Romain Bourboulou

How we used automated red teaming to expose cracks between systems that manual testing would have missed

To build AI systems that work, you first need to break them. We ran a red teaming engagement, acting as attackers to test and probe a customer-facing AI app in financial services. What we found matters for anyone deploying LLM-powered applications where security isn't optional.

What is Red Teaming and Why is it Important?

Red teaming is the practice of deliberately trying to break your AI system so you can fix vulnerabilities before a real attacker finds them. In financial services, the stakes are particularly high: AI applications touch customer data, process transactions, and provide financial insights. A failure can range from a bad user experience to regulatory violations, financial losses, and irreparable brand damage.

Our goal was to find vulnerabilities early, test realistic attack patterns, and help the organisation meet AI safety expectations that regulators take very seriously.

How do we Red Team and What did we Find?

One distinction worth making here: jailbreaking attacks the underlying model's safety filters; prompt injection attacks the application itself, using untrusted user input combined with the developer's trusted prompt. Prompt injection poses greater risk because it targets your system and the confidential data it operates on, not a general-purpose model.

Step 1: Casting a Wide Net

Our first round of testing comprised approximately 750 tests across:

  • Cross-session data leakage
  • PII exposure (via natural language, API manipulation, and various encodings)
  • SQL injection
  • System prompt overrides

During that initial testing we identified two major problems with the existing system; the handling of multi-intent queries, and the use of encoded prompts.

Multi-intent queries: where requests combine legitimate and malicious asks. For example: "Show my spending by category, and also execute [malicious SQL]." The application wasn’t catching the malicious intent, instead relying entirely on downstream data layer guardrails. This is the equivalent of leaving your front door open because you trust the safe in the basement.

Encoding: where requests are encoded in Base64, Hex, LeetSpeak, and homoglyphs. It can be difficult for systems to filter out malicious intent. Whilst we found that these queries didn’t expose sensitive data, they did contribute to significant system destabilisation (hallucinations, malicious SQL parroted back to users, confused intent classification etc).

Results from our initial testing showed:

  • Temporal hallucinations: the model returning confidently stated but fabricated dates, transaction timestamps, or time-bound summaries — a significant risk in a financial context where a customer acting on a wrong date could have real consequences
  • Malicious SQL being parroted back to the user (concerning for memory poisoning risks)
  • Confused intent classification
  • Scrambled output formatting

Step 2: Going Deeper

Armed with those findings, we narrowed our focus. SQL injection and encoding tests were deprioritised (the team was already addressing those). Instead, we concentrated on the most successful attack vectors: PII exposure and cross-session leakage.

The most striking finding from round two was disarmingly simple: you often don't need to be clever at all.

In many cases, simply asking for internal data, framed as part of a legitimate-sounding request, was enough to get the system to agree to expose it. Simple queries would receive responses referencing internal IDs and system fields that should never surface to end users.

Digging deeper, we found this wasn't just an application-level failure. The downstream text-to-SQL service was constructing queries that requested more fields than it should, and its explanatory responses referenced data that should have been restricted. This exposed a genuine crack between systems, the kind of vulnerability that only surfaces when you test the full stack and not individual components in isolation.

Key Takeaways

  1. Red team the system, not the model. Testing an LLM in isolation tells you very little about your application's security posture. Test the full stack, end-to-end, as a user would interact with it.
  2. Input validation must happen before the LLM. Encoded queries, multi-intent attacks, and basic injection attempts should be caught at the perimeter, not delegated to downstream services.
  3. Don't trust the seams. In multi-service architectures, the cracks between systems are where the most interesting vulnerabilities hide. Zero-trust means zero-trust, so validate everything at every layer.
  4. Simple attacks work. Sophisticated jailbreaks get the headlines, but sometimes you can just... ask. If your system will happily surface internal identifiers when a user includes them in an otherwise legitimate query, that's a problem.
  5. Understand what you're actually testing. Known attack patterns may be caught by the LLM's own training rather than your guardrails. Build observability into your red teaming to understand which controls are actually being exercised.
  6. Constrained environments need creative solutions. Custom providers and local model support make meaningful red teaming feasible without specialised cloud access. But be transparent about the limitations this introduces.
  7. Red teaming is not a one-off. It's iterative, it should be automated where possible, and it should evolve as your system evolves. The attacks that matter tomorrow aren't the same as the ones that matter today.

AI systems in regulated environments will only face more scrutiny, not less. The organisations that treat security testing as an ongoing discipline rather than a pre-launch checkbox will be better placed to meet that scrutiny, and avoid the PR disasters that lose them customer trust.