Skip to main content
DataApr 14, 2026

Your Data Is Not Ready for AI. Here Is What to Do About It.

Most AI projects do not fail on the model. They fail on the data.

A frontier model is a commodity. Any team with an API key and a credit card can use one. The part that is not a commodity is the state of your data. That is where almost every stuck project is actually stuck, even when the conversation is about prompts, agents, or fine-tuning.

The honest assessment on most enterprise data, including in the federal environments we work in, is that it was never designed for machine consumption. It was designed for a person to read. That difference is not trivial. It is the whole problem.

What "not ready" actually looks like

The same five issues show up in nearly every audit we run. None of them require a consultant to spot. They just require someone to say them out loud.

Fields that mean different things in different systems. A customer ID in one platform is a contract number in another. Nobody wrote that down. Everyone knew.

Free-text fields doing the work of structured ones. Status, category, owner, priority. All typed by humans. All inconsistent. All filtered on in reports.

Silent schema drift. A source system added a column two years ago. Downstream extracts never picked it up. The data has been quietly incomplete ever since.

Access by exception. The data exists, but pulling it requires a ticket, a VPN, a named human, and a calendar. Pipelines cannot run on that.

History that does not go back far enough. The model needs two years of labeled examples. You have six months, and the labels changed definition halfway through.

The fix is not a data lake

The reflex when someone says the data is not ready is to buy a platform. A warehouse. A lake. A lakehouse. A governance suite. Another vendor on another license.

That is the wrong move at the wrong time. Platforms solve a scale problem. You do not have a scale problem yet. You have a clarity problem. Buying infrastructure before you have clarity just gives you expensive unclear data in a new location.

The fix is narrower and cheaper. Pick the one workflow the AI system is going to serve. Map the fields it actually needs. Get those fields into a single clean source, even if that source is a flat table updated nightly. Nothing more.

Minimum viable data

There is a version of your data that is good enough to ship a production AI system. It is almost always smaller than you think. We look for four things.

A stable key. Something that identifies the entity and does not change across systems. If you do not have one, the first piece of work is manufacturing one.

A source of truth per field. One system owns status. One owns owner. One owns value. Not three systems racing to be wrong at the same time.

Access by pipeline, not by person. A service account, a credential, a schedule. No human in the path for routine reads.

Labeled examples that match the question. If the model is going to classify, you need clean labels. If it is going to retrieve, you need the retrievable units separated and tagged. The mismatch between available labels and the task at hand kills more projects than model choice ever has.

What we do in week one

On every engagement, the first seven days are a data audit. Not a strategy session. Not a capability review. We sit with the people who actually pull the data and ask them to show us.

What we find in that week determines what gets built. If the data is further along than expected, the system is bigger. If it is further behind, the scope narrows to what the data can actually support, and the first phase of the project becomes closing that gap.

The teams that ship are the ones willing to have that conversation up front. The ones that do not ship are the ones that spent six months pretending the data was ready and then discovered it was not, in production, on a Friday afternoon.

The short version

You do not need a better model. You need a smaller, cleaner slice of your own data, pointed at a single well-defined question, with a pipeline that can run without a human in the loop. Everything else is a distraction until that is true.

Want an honest read on your data?

30-minute call. We will tell you what is ready, what is not, and what the shortest path to a production system actually is.