Back to the blog

Beyond prompts: Why enterprise AI demands context engineering

September 24, 2025 10 min read

David Pan

Director - Industry Practice Lead for Asia Pacific

Enterprises have begun to discover what the GenAI hype can obscure: large language models are convincing but inconsistent unless fed the right data. Verbal polish alone cannot support decisions or manage risk, and in enterprise settings especially, sounding right is not the same as being right. Markets move on data and analysis; a misplaced figure, a stale disclosure, or a hallucinated data point can make the difference between sound judgment and costly error.

That’s why the true differentiator in enterprise grade GenAI isn’t style, but substance — specifically,  context engineering: the structuring, selection, and delivery of the right data into an AI system’s context window at the right moment. Without it, models are more likely to hallucinate, miss critical signals, or provide generic answers unfit for high-stakes decision-making.

In this post, we’ll explore the nature of the context window, why it matters, the risks of neglecting it, and how firms with established comprehensive, rigorously maintained data  — like Moody’s — are uniquely positioned to turn context into a competitive moat.

Learn more about Moody's and AI

What is the needle in the haystack problem?

One way to test an AI platform’s reliability is by using the “needle in a haystack” benchmark, which measures how well a model can retrieve a precise fact (the needle) buried inside a mass of irrelevant text (the haystack).

Initially, the challenge was that models had very limited context windows. They could only “see” a few thousand tokens at once. That constraint is easing, with some systems now stretching to hundreds of thousands or even a million tokens. But researchers have discovered that size alone doesn’t solve the problem. When you flood a model with too much data, it starts to lose focus. It’s akin to humans reading a thousand-page book: they won’t recall every word, only the passages that stand out.

Enterprises handle millions of documents – thousands of pages of disclosures, constant regulatory updates, and real-time market feeds. Hidden within that flood are the small but decisive pieces of information that are relevant for each particular query.

Models have fixed context windows, meaning they can only “see” a fixed amount of text at once. Even with increases in size, without careful curation, irrelevant or outdated material can fill that window, and crucial insights might remain buried.

This is what we mean by the needle in the haystack problem: relevant data “needles” can be camouflaged within masses of data “haystacks” and there is only so much the model can see.  So if the problem is trying to find needles in mountains of hay, the smarter solution is not more frantic, brute-force searching, but a system that sorts, filters, and organizes the hay before the search even starts.

Enter context engineering

Context engineering is the discipline of structuring, prioritizing, and supplying the model with the right information in the right way at the right moment.

Language models are generalists. Left on their own, they draw on their training data, which is vast, but frozen at the point of training. Without context engineering, even the most sophisticated model might reference misaligned information (“last year’s interest rates” in a live risk scenario), hallucinate plausible but false facts, or provide generic answers that are technically not wrong but are devoid of any actual substance.

Context engineering is what turns a general-purpose model into a domain-specific one that can potentially solve actual enterprise problems — not because the model has changed, but because the information it sees has been carefully selected, filtered, and sequenced.

Context engineering is a layered practice:

Data foundation: Raw data can come in many messy forms: scanned PDFs, unstructured analyst notes, tables buried in filings. Context engineering starts with ingestion pipelines: OCR (to read scanned text), table parsing (to interpret financials), and metadata tagging (to capture who, when, and what). Clean, structured, and relational data is non-negotiable.

Retrieval-augmented generation (RAG): Models can’t hold the entire corporate data estate in its working memory. Even with context windows expanding to 1 million tokens, that’s still only 1.3x the length of War and Peace, and still far from equal to the amount of data an enterprise will need to hold. RAG bridges the gap by “chunking” processed documents and embedding those chunks in vector databases (numerical/spatial representations of that data) and retrieving only the most relevant “chunks” for a given question. Retrieval isn’t naïve keyword search: it can combine semantic similarity, hierarchical trees, or re-ranking methods to surface the passages that matter most.

Chunking means splitting large documents into collections of tokens/words, which are then indexed. Splits can follow semantic breaks, paragraph structures, or topic shifts so that retrieved chunks are coherent. The trick is to balance granularity (fine enough to isolate detail) with context (broad enough to make sense).

Knowledge graphs and metadata: Adding structure improves retrieval. A credit report linked to an issuer, sector, and geography can be pulled more accurately than one floating in isolation. Increasingly, tools like GraphRAG can layer on a graph-based understanding of entities and relationships, connecting the dots across documents so critical context isn’t lost.

Safeguards: For enterprises, there may be little room for “creative” answers. Safeguards help filter or constrain the model’s outputs, helping agents to stay on purpose and minimize the chances of deviation to irrelevant conversations that may violate policies or introduce misaligned behaviors.  These enforce consistency with the ground truth.

Evaluation: Context engineering is not a one and done exercise.  Outputs are constantly scored for “groundedness” (are they supported by retrieved evidence?), relevance (do they answer the user’s question?), and factual accuracy. These evaluation loops help to improve the system over time.

Prompt vs context

When people talk about “getting good at AI,” they often mean prompt engineering: phrasing a request so the model interprets it correctly. That’s useful, but in enterprise settings it’s like asking a brilliant analyst to answer a question without giving them the files they need.

Let’s draw the distinction more clearly:

Prompt engineering adjusts how you ask: “Summarize this earnings call in plain English,” “Answer as a risk analyst,” “Give me a bullet-point briefing.”

Context engineering helps provide  the model with what it needs to know: the correct transcript, the latest financials, and expert commentary — not an outdated or irrelevant document.

Put simply: prompt engineering controls style; context engineering controls substance.

The context window: size isn’t everything

Large language models (LLMs) operate with a finite context window — its working memory. Think of it as the maximum number of words, numbers, and symbols the model can consider at once.

To give an idea of current numbers:

  • GPT-4o: ~128,000 tokens (≈500 pages) (OpenAI)
  • GPT-4.1: 1 million+ tokens
  • Claude 3.5 and3.7: up to 200,000 tokens (Anthropic)
  • Claude 4: up to 500,000 tokens
  • Gemini 2.5: 1 million+ tokens (Google)

At first glance, it seems simple: the bigger the window, the better the performance should be. But the architecture under the hood — the transformer — makes things more complicated.

Transformers rely on a mechanism called self-attention. At every step, the model compares each token (a word or a part of a word) with every other token in the context window to decide which ones matter. This produces an N×N grid of “attention weights,” where N is the number of tokens.

  • If the model sees 100 tokens, it must compute 100×100 = 10,000 comparisons.
  • With 10,000 tokens, that explodes to 100 million comparisons.
  • Double the context, and you roughly quadruple the compute.

This is why context windows don’t scale easily. Expanding from 100,000 to 1 million tokens isn’t just “ten times harder” — it’s closer to a hundred times harder.

Two further problems emerge:

  1. Attention dilution: As the grid grows, weights tend toward being spread thinly across all tokens. The model can lose focus on the truly relevant parts — like searching for a quote in your 1,000-page book and finding your highlighter ink is running dry.
  2. Training limits: Most human-written text comes in chapters, papers, or articles — not million-token stretches. There simply isn’t much natural long-form data to train on, so models struggle to generalize when it comes to ultra-long contexts.

The takeaway: bigger context windows help, but they are not a panacea. They can be costly to use during inference time, tricky to train, and quality can quickly degrade over long conversations. For enterprises, intelligent system design around context engineering can give even smaller language models an ability to surpass the largest frontier ones.  Thus, the context window becomes prime real estate. Every token must be selected carefully, and that’s exactly why context engineering matters.

The consequences of ignoring context – and why it’s so hard to get right

On paper, context engineering sounds simple: give the model the right information, in the right form, at the right time. In reality, the size  diversity, and speed of enterprise data makes it one of the toughest challenges in enterprise adoption of GenAI today. And when that challenge is ignored, the effects can be swift: trust can erode, outputs may weaken, and even the most advanced models risk becomeing less reliable. As a result, user adoption and user trust can steadily decline.

Risks of neglecting context:

Context rot: Data pipelines will almost inevitably accumulate stale or irrelevant information. An outdated credit report or last year’s earnings can linger alongside live inputs, and the model treats them all as equally valid. The danger here is insidious: the outputs look polished and authoritative, but they are quietly misleading, which is sometimes worse than silence.

Context bloat: In the absence of careful curation, teams often take the “just in case” approach and flood the context window with everything they can find. But more is not always better. Overloading the window with every possible source reduces signal-to-noise and hits token limits without surfacing the key facts.

Incoherence: Models frequently draw on multiple sources, and if those sources are misaligned — a six-month-old risk report mixed with today’s market feed, for instance — contradictions may emerge. The AI may hedge or hallucinate in an attempt to reconcile the conflict, leaving users with confused or contradictory outputs.

What makes these pitfalls particularly challenging is that they are structural. They cannot be fixed with cleverer prompts or minor tweaks. The root causes are baked into the nature of enterprise data:

Size: Moody’s alone ingests thousands of documents a month. The real problem is not storage but retrieval: finding the two or three critical lines hidden among thousands of pages and surfacing them quickly enough for real-time decision-making. This is less like finding a needle in a haystack than finding one in a warehouse of haystacks, constantly replenished by the truckload.

Diversity: Enterprise data rarely arrives neatly packaged. It comes as structured tables, free-form analyst notes, PDF filings with embedded charts, scanned documents that require OCR, and noisy transcripts of earnings calls. Models are optimized for clean text; they need help navigating this chaos. Context engineering begins by transforming messy, multimodal inputs into structured, machine-readable form.

Timeliness: Markets move by the minute, and yesterday’s data can be worse than useless if it contradicts current reality. A refinancing reported today invalidates last week’s debt ratios. For clients, the expectation is clear. Every AI-generated response must reflect the latest available evidence.

In many cases, meeting these key demands requires infrastructure, not prompt cleverness. It means robust ingestion pipelines, vector databases for semantic search, multi-agent systems that can divide complex requests into steps, and evaluation frameworks that score every output for accuracy and groundedness. And because this is finance, everything should be auditable and traceable.

This is why context engineering is both difficult and unavoidable. Without it, enterprise AI is not possible. 

The new AI stack

The first wave of enterprise AI revolved around prompts. Teams built guides on how to phrase requests — “act as a financial analyst,” “summarize in bullet points,” “explain as if to a regulator.” Helpful, yes, but limited in scope of usefulness at scale. This focus on surface-level interaction missed the deeper machinery underneath.

A good metaphor is a car. Prompts are the driver: they set the intent and direction, telling the system where to go. But a driver without fuel won’t get anywhere. Context is the fuel: the refined input that powers the engine, giving the system the energy to act.

And crucially, fuel quality matters. High-grade, well-refined fuel keeps the engine running smoothly; contaminated or low-grade fuel causes misfires, inefficiency, even breakdowns. The same is true of AI. Context that is curated, timely, and structured is more likely to produce sharp, more favorable outputs. Context that is stale, noisy, or irrelevant can lead  to hallucinations and errors.

Delivering that kind of “high-grade fuel” for AI means building the right systems behind the scenes. In practice, it often requires:

Data pipelines that ingest and structure messy, multimodal inputs.

RAG frameworks that decide which fragments to surface and in what sequence.

Evaluation loops to score answers for accuracy, groundedness, and relevance.

Governance layers that log, audit, and explain context choices for regulators.

Put differently: the new stack integrates prompting as just one layer in a much larger system. Context engineering is the core that makes GenAI more reliable, repeatable, and safer to deploy at enterprise level.

For firms, this reframing matters. It shifts AI strategy from “getting better at asking” to “getting better at feeding.”

Context as competitive moat

As large language models improve, they also converge. GPT, Claude, Gemini, and their peers now offer broadly similar capabilities, available via low-cost APIs. Meaning, access to the latest frontier AI capabilities is easily available to everyone from Fortune 500s to individuals vibe coding at home. Launching a product using a state-of-the-art model is no longer a competitive advantage.

What remains valuable — and defensible — is context. Superior context engineering can lead to high quality and consistent outputs that actually deliver enterprise value. Inferior context engineering means that AI projects might never make it to production, or worse, fail to attract any internal users due to poor performance after millions spent.

As newer ways of utilizing language models are introduced, such as the rise of agentic AI, context becomes even more critical. These systems will not simply answer questions but carry out workflows, “reasoning” across multiple steps. Their success may largely hinge on the reliability of the context they retrieve at each stage. A system that pulls stale or contradictory evidence will likely fail, no matter how sophisticated its reasoning engine.

At the same time, governance pressures are rising: people demand transparency — not just what the AI said, but why it said it. That means provenance, auditability, and validation of the inputs as much as the outputs. Without robust context pipelines, enterprises may struggle to meet these demands.

The Moody’s advantage

If context is the scarce resource, Moody’s begins from a position of unusual strength. Its advantage rests on three pillars: a reputable data estate, global domain knowledge, and enterprise-grade architecture.

Moody’s data estate is comprehensive. Decades of Moody’s Ratings credit ratings histories, credit events, Moody’s risk metrics, and Moody’s Ratings research provide auditable and verifiable  sources of information. Unlike the sprawling, unverified data often scraped into generic models, Moody’s corpus is curated and standardized.

Layered on top is domain knowledge. Moody’s Ratings  continuously monitor credit ratings. Their insights don’t sit separately from the data; they are embedded within it. This helps to ensure that when an AI system draws context from Moody’s pipelines, it’s not just pulling raw numbers in isolation, it’s also calling up information curated by analyst experience.

The third pillar is architecture. Moody’s has invested in retrieval-augmented generation pipelines that fetch and rank relevant fragments across formats, vector search systems tuned for precision, and auditable workflows that can string together multiple agents, agentic reasoning and tool use.

Taken together, these three strengths create what might be called Moody’s-grade context: high-precision, low-latency, and auditable. It is not just data, but data engineered to inform and support decisions.

Conclusion

Enabling generative AI to succeed in the workplace requires more than prompts. Context engineering is what can set proofs-of-concept apart from AI production systems that enterprise users will actually adopt.

In finance and other regulated domains, AI must strive to meet higher standards: outputs must aim to be accurate, current, and defensible. That is why context engineering is foundational to credible, high-stakes AI systems.

At Moody’s, our comprehensive data estate, deep domain knowledge, and enterprise-grade infrastructure can help provide the building blocks for context engineering that supports decision-making. Together, these strengths can produce outputs that are precise, timely, and transparent — what we call Moody’s-grade context.

That level of context is the fuel that powers more reliable AI solutions. Prompts set the direction. Context propels the journey — determining both the distance AI can cover and the confidence with which it arrives. Enterprises that master both may very well define the next era of decision intelligence.

About the author:

David Pan
is a Director - Industry Practice Lead for Asia Pacific at Moody’s and is responsible for exploring innovative applications of Moody's data exposed through GenAI. He advises organizations across the region on distinguishing hype from reality, identifying practical use cases, and guiding effective adoption strategies.

Before joining Moody’s, David led generative-AI solution design, development handbooks, and supervised production deployments that drove measurable business impact. He has held leadership roles in Professional Services, Solution Architecture, and Business Development across financial crime & compliance, fraud & identity, and data science consulting.

David currently holds an Executive MBA at INSEAD.


Learn more about Moody's Agentic solutions

Moody’s Agentic Solutions leverage advanced AI to add automation  and increased optimization to high-value processes like credit assessment, portfolio monitoring, KYC screening and sales intelligence, powered by Moody’s comprehensive foundation of financial data and content. 

 

Learn More
buildings from the top view
blog
AI is here to stay—but enterprises can’t afford to get it wrong
GenAI is the fastest adopted technology in history, but speed of uptake has not necessarily translated into durable enterprise value. Despite unprecedented investment, many organizations struggle to convert experimentation into lasting operational impact.
downtown tokyo night aerial
blog
Demystifying agentic AI
The next frontier in artificial intelligence comes with ambition, autonomy, and agency. At Moody’s, we’re already putting agentic AI to work, focusing not just on its potential, but on concrete steps to both automate and enhance workflows.

Book & explore

Get in touch or book a demo to explore how we can help