Imagine a colleague drops a new textbook on your desk and says: "This is going to transform student learning." You'd probably ask: has it been tested? With which students? Over how long? Did the results hold up when you took the book away?
We ask those questions about textbooks. We're not asking them nearly enough about AI.
That's the quiet, clarifying message in a major new report from Stanford's SCALE Initiative. The Evidence Base on AI in K-12: A 2026 Review isn't an anti-AI manifesto. It's a careful audit of what we actually know — and as it turns out, knowing what we don't know is exactly where teachers should start.
The SCALE team analyzed more than 800 academic papers on AI in K-12 education. Of those, they identified just 20 high-quality causal studies — the kind that can actually tell you whether a tool changed outcomes, rather than just correlating with them. Twenty. Out of 800.
That gap matters. A lot of what's being said about AI in classrooms — that it boosts engagement, improves writing, accelerates math practice — is based on studies that track what happens while students are using the tool. Far fewer studies check whether those gains persist once the tool is gone.
And when researchers do look at AI-free assessments, the picture gets complicated. Some students show real learning gains. Others stay flat. In some cases, performance actually declines. The report frames this as the central question teachers should be sitting with: are AI tools helping students complete tasks, or helping them develop durable skills?
Those are different things. And right now, the research can't always tell them apart.
Here's where the Stanford report gets genuinely useful. Not all AI tools work the same way, and the evidence makes a meaningful distinction between them.
Tools designed with what researchers call pedagogical guardrails — systems that offer hints, prompt reasoning, and guide students toward answers rather than handing them over — show more promising results than general-purpose chatbots that answer questions directly. A tutoring tool that responds to a stuck student with "What do you already know about this?" is doing something different than one that writes the paragraph for them.
That distinction is something teachers can actually act on. When you're looking at an AI tool, it's worth asking: does this tool make students think harder, or does it make thinking optional?
The Stanford review is honest about what we don't yet know. There are no high-quality causal studies of student AI use conducted in U.S. K-12 classrooms. Most research looks at short-term outcomes. Almost nothing examines equity, student wellness, or social development. That's a lot of open territory.
But "the research is incomplete" doesn't mean "wait and do nothing." It means be deliberate. Here's what that looks like in practice:
Test transfer, not just performance. If your students are using an AI writing tool, don't only grade the AI-assisted draft. Give them a short in-class writing task without AI, or a quick oral explanation of their argument. If understanding is real, it should show up there too. If it doesn't, you've learned something important.
Favor tools that guide over tools that answer. When evaluating any AI tool — for yourself or your students — notice whether it's designed to replace thinking or scaffold it. Does it ask follow-up questions? Does it offer hints before solutions? Does it require students to do something with its output, or just accept it? The research suggests the first category is meaningfully different from the second.
Pilot small and watch closely. The Stanford team recommends trying AI tools for specific, bounded purposes — lesson preparation, revision feedback, structured practice — and then watching what happens to student performance when the tool isn't present. That doesn't require a research design. It just requires paying attention.
One of the more honest things the Stanford report acknowledges is that education leaders are currently being asked to make high-stakes decisions about AI without a strong evidence base to stand on. That's a real and uncomfortable position.
But teachers are actually well-equipped for this moment. We've always had to make judgment calls in the gap between research and practice. We know how to watch students closely, ask probing questions, and notice when performance is surface-level versus when it reflects genuine understanding. Those are exactly the skills this moment calls for.
The evidence isn't there yet to tell us definitively which AI tools are worth your time. But the evidence is clear enough to tell us what questions to ask. And asking better questions — about transfer, about scaffolding, about what happens when the tool is off — is something every teacher can start doing tomorrow.
This is part of Teaching in the Age of AI, a weekly digest of research and ideas for educators navigating AI in the classroom. Share this post with a colleague who's wrestling with the same questions — and subscribe to get next week's in your inbox.
Source: Stanford SCALE Initiative, "The Evidence Base on AI in K-12: A 2026 Review"

.png)
A landmark Stanford review found that most AI classroom claims aren't backed by rigorous evidence. Here's how to use that finding to make smarter decisions about the tools on your desk right now.
.png)
A run of 2026 research confirms that AI-writing detectors are unreliable and biased. The surprising upside: it puts the most important tool in the classroom — your professional judgment — back where it belongs.