Language Models Exhibit Emergent Abilities in Complex Reasoning Tasks

AI Meets the Classroom: When Do Large Language Models Harm Learning?

Matthias Lehmann, Philipp B. Cornelius & Fabian J. Sting (2025)

Why the Study Matters

Educators debate whether large‑language‑model (LLM) tools such as ChatGPT help or hinder real learning. Prior studies show mixed results, often ignoring how students actually use the AI. This paper asks: When do LLMs substitute for, and when do they complement, meaningful study—and with what consequences?

Research Design at a Glance

Two pre‑registered, incentivized lab experiments (coding tasks) compare students with and without GPT‑4 access.
Field study tracks a university programming course during sudden campus‑wide LLM availability.
Usage data (prompts, copy‑paste activity) allow the authors to classify substitutive vs. complementary behavior.

Key Findings

Theme	What Happens?
Average effect	Across the whole sample, LLM access does not change total learning gains.
Substitution	Students cover more topics but understand each one less.
Complementarity	Topic volume unchanged, depth of understanding rises.
Equity impact	LLMs widen the gap: students with lower prior knowledge learn less when allowed to rely on LLMs.
Copy‑paste affordance	When copy‑paste is enabled, students request “full solutions” far more often, fueling substitution and longer‑term decline.
Perceived vs. actual learning	Access inflates students’ sense of how much they’ve learned beyond measured gains.

Practical Takeaways for Instructors

Guide the usage mode. Frame LLMs explicitly as explainers, not answer‑generators.
Disable or limit copy‑paste during formative work to discourage shortcutting.
Extra scaffolding for novices. Lower‑prepared students need structured prompts or human feedback to avoid superficial learning.
Monitor metacognition. Pair AI support with reflective checks so students calibrate their self‑assessment.

Contributions to the Debate

Clarifies why prior studies reached opposite conclusions: the behavioral pathway (substitute vs. complement) determines the outcome.
Introduces a two‑dimensional view of learning—topic volume and topic understanding—as a lens for evaluating educational technology.

Limitations & Future Work

Lab tasks focused on programming; effects may differ in concept‑driven disciplines.
Field data observed only substitutive use; complementary scenarios need real‑class validation.
Future research should test interface nudges, prompt‑engineering lessons, and longer semesters to see if complementary use can close (rather than widen) equity gaps.

Bottom line: LLMs are neither panacea nor poison; they magnify whatever study habits students bring to them. Design learning environments that channel AI toward explanation and reflection, not quick fixes, to unlock their real educational value.