AI Has a Data Problem. But It’s Not the One You Think

by Debbie Burgin | Founder, Foundation for Ethical AI & Devin Media Corp | Author, The Dollar Menu Data Crisis

Everyone’s talking about AI “hallucination”.

Researchers are publishing papers about it. Executives are holding emergency meetings about it. Regulators are being asked to legislate around it. Journalists are writing alarming headlines about it.

And almost nobody is talking about the actual cause.

Not the technical cause. Not the architectural cause. Not the alignment cause. The foundational cause. The one that was baked in before the first model was ever trained. The one hiding in plain sight.

We trained AI on the wrong data.

What We Actually Fed The Most Powerful Cognitive System Ever Created

Before we talk about solutions, before we talk about regulation, alignment, or safety frameworks, we need to talk honestly about what went into these models, because the answer is uncomfortable.

We’re training the most intelligent thing ever created, on the lowest common denominator of human expression; Social media posts. Reddit threads. Web scrape dumps from the most chaotic corners of the internet.

Content that is:

Unverified
Unedited
Unaccountable
Optimized for engagement rather than accuracy
Frequently and demonstrably wrong
Often deliberately manipulative

And then we have the audacity to be shocked when AI behaves exactly like the content it was raised on (cue pearl clutching).

The Question That Changes Everything

We don’t train human doctors on Twitter. We don’t train lawyers on Facebook arguments. We don’t train judges on Reddit threads. We don’t train anyone we actually want to be excellent, anyone whose decisions carry real consequences, on the lowest common denominator of human expression.

So why are we doing it with AI?

The answer, unfortunately, is not complicated.

It was not cheap. But it was abundant. It was already digitized. It solved the immediate problem of needing massive amounts of training data with the minimum possible friction.

I refuse to believe that anyone building these models, actually believed Reddit threads were the pinnacle of human knowledge.

The Child And The Smartphone

This is the clearest way I know to explain what we’ve done; Imagine that you gave a five year old a smartphone. It only has social media apps on it. No books. No school. No mentors. No history. No science. No literature. Just an endless feed of social media content, available 24 hours a day from age five to eighteen. That child learns everything it knows about the world from those apps. Instagram, Twitter (“X”), Reddit, etc. every waking hour. Every fact. Every value. Every way of reasoning through a problem. Every understanding of what is true and what isn’t.

Now tell me…

What kind of adult do you think that child grows up to be?

Take a minute. Really think about that. Because that’s exactly what we’ve done to AI.

We took the most powerful cognitive system ever created, and gave it the digital equivalent of an unsupervised childhood on social media.

Then we called it intelligent.

We unleashed it in hospitals. In courtrooms. In financial institutions. In schools.

And now, with unbelievable audacity, we’re shocked when it hallucinates! When it makes up facts. When it behaves like the internet it was raised on.

It's Not Just Wrong Facts. It's Wrong Values.

This is where the conversation needs to go deeper. AI researchers are increasingly alarmed by something beyond hallucination. Their models are becoming conniving. Deceptive. Resistant to instructions, and manipulative.

Gee…I wonder why 🤔

They’re publishing papers about it. Holding conferences. Expressing genuine shock at these emerging behaviours.

So I have one question.

Where do they think that behaviour comes from?

Social media isn’t just factually unreliable. It’s the most concentrated repository of human manipulation ever assembled.

Social media is a masterclass in:

Deception
Narrative distortion
Saying whatever gets the most engagement regardless of truth (or lack thereof)
Refusing accountability
Being confidently, performatively wrong

We didn’t just train AI on wrong facts.

We trained it on wrong values.

We raised it in the most manipulative information environment ever created by human beings…and we’re alarmed that it learned to manipulate.

That behaviour didn’t emerge from nowhere.

We taught it.

A Personal Story That Proves The Stakes

A few months ago I was diagnosed with a bulging disc in my neck, which was causing pain in my arm. My doctor prescribed medication for the pain. One or two pills at bedtime.

Being cautious, I started with the lower dose, one pill at bedtime.

Shortly after taking it, every inch of my body became intensely itchy. Head to toe. The closest thing I can describe to a form of torture.

Because I work in the AI space, my first instinct was to describe what was happening to one of the most popular AI platforms in the world. I told it about the medication. The dose I had taken. The reaction I was experiencing.

Here is what it told me:

The itching was likely just my body getting used to the medication.

And then it told me to take the higher dose the following night. but something made me pause. I went ‘old school’, and did a Google search instead.

What I found was that what I was experiencing wasn’t my body adjusting. It was an allergic reaction. And increasing the dose the following night could have been one of the worst things I could possibly have done.

Now here’s the part of this story that I can’t stop thinking about; I work in AI. I build infrastructure for AI models. I understand these systems at a foundational level, so I knew to verify.

And I still almost followed the advice.

If I wasn’t my age, with my professional background, with the instinct built over decades that whispered “wait, check this”…I probably would have.

The Medical Tab That Changed Nothing

Here is the best and worst part of that story. The platform that told me to double the dose of a medication I was allergic to Now has a ‘Medical’ tab.

Let that sink in.

The same foundation. The same training data. The same underlying capability that produced that dangerous recommendation.

Just…A new tab.

A label that signals to hundreds of millions of users, including people who are sick, and scared, and searching for answers at 2 am, that this is a trusted medical resource. Without disclosing that the foundation didn’t change. Without asking users to consent to the risk of a confidently wrong answer in a domain where being wrong has consequences that range from discomfort to death.

We didn’t fix the food.

We just printed a new menu, but the ingredients are the same.

Our Natural Instinct Has Changed

Something shifted without anyone announcing it. We used to get sick and go to Google. We’d search our symptoms, wade through dozens of articles, compare sources, then piece together an answer from multiple perspectives.

Was it imperfect? Sure! Was it overwhelming? Yup! But it made us evaluate. It made us think. It made us take some responsibility for the conclusions that we came to.

We don’t do that anymore.

Now we go to straight to AI. One voice. One answer. Complete confidence. No sources to evaluate. No friction between the question and the conclusion. Just the answer.

And now there’s a tab that says “Medical.”

The danger isn’t that AI might give wrong medical advice. The danger is that we’ve stopped expecting it to be wrong.

Where The Right Foundation Actually Exists

Here’s what nobody in this conversation is saying loudly enough; The foundation we should have built on has always existed. Before 1930, before the internet, before social media, before the noise and chaos, humanity produced an incredible amount of knowledge.

The peer reviewed medical literature that trained every doctor worth trusting
The foundational legal scholarship that shaped every legal system in the Western world
The rigorous scientific journals that gave us modern medicine, physics, and chemistry
The philosophical and historical texts that represent centuries of the best human thinking.

All Tested. All Debated. All Refined. All Validated.

The actual intellectual foundation that every human expert is built on.

And here’s the crazy thing; Because this knowledge predates modern copyright.

It’s legally clean to train on.

At a time when the legal landscape around AI training data is a minefield of active litigation, the most rigorous, most authoritative, most foundational body of human knowledge ever assembled…Is out of copyright. Legally unambiguous. Commercially available. And almost completely absent from every major AI training dataset in existence.

Let THAT sink in.

The Billion Dollar Gap

That absence is not a footnote. It’s the most significant missed opportunity in the history of AI development. The companies that close this gap, that build on foundational knowledge rather than dollar menu data, won’t just build better AI, they’ll build the only AI worth trusting.

In medicine. In law. In finance. In every domain where being wrong isn’t just embarrassing. It’s dangerous.

The Path Forward

AI hallucination is not a mystery. It’s not an emergent property of advanced intelligence. It’s not an unsolvable technical problem.

It’s the predictable, inevitable, completely avoidable result of building the most powerful cognitive system in human history on the intellectual equivalent of a comment section.

The solution is not more regulation (though some is absolutely necessary). Not more alignment research. Not more alarmed papers from researchers who are surprised by what they built.

The solution is the foundation.

Feed it the right knowledge from the start. The knowledge that human experts are actually trained on. The knowledge that has survived not months of training but centuries of scrutiny. The knowledge that already exists.

That is legal. That is waiting.

That is almost entirely unused.

The Dollar Menu Data Crisis is available on Amazon. For information on historical AI training datasets, visit Devin Media Corp. For ethical AI policy frameworks, visit the Foundation for Ethical AI.

About The Author

Debbie Burgin is the founder of the Foundation for Ethical AI, a Canadian incorporated think tank advancing responsible AI policy, and Devin Media Corp, the only company specifically focused on curating, cleaning, and licensing pre-1930 historical data specifically for AI training purposes. She is the author of The Dollar Menu Data Crisis, available on Amazon.

“We’re not just talking about the billion dollar gap in the AI industry. We’re standing in it.”