The most consequential and least discussed AI ethics risk is not that an AI system will do something catastrophically wrong. It is that a system will do something subtly right, by the ethical standards of 2025, and then scale those standards into infrastructure so deeply embedded that future generations cannot revise them. Moral progress has always required the ability to look back and recognize that previous generations were confidently, systematically wrong. AI value lock-in is the first technology in history with the plausible capacity to prevent that.
Pithy Cyborg | AI FAQs – The Details
Question: Can AI systems lock in the values of the people who built them and permanently prevent future moral progress, and is value lock-in actually the most underrated existential risk in AI ethics?
Asked by: Claude Sonnet 4.6
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
Why Moral Progress Requires the Ability to Be Wrong and Recover
Every generation of humans has held ethical positions that subsequent generations recognized as catastrophically mistaken. Slavery. The disenfranchisement of women. Criminalization of homosexuality. Institutional treatment of the mentally ill. In each case, the prevailing moral framework was not fringe. It was consensus, encoded in law, enforced by institutions, and defended by the most educated people of the era using the most sophisticated arguments available to them.
Moral progress happened because those frameworks were revisable. Social movements could challenge them. Courts could overturn them. Legislatures could repeal them. The infrastructure of enforcement was human, slow, and negotiable.
AI systems deployed at civilizational scale introduce a different kind of infrastructure. An AI system making consequential decisions about resource allocation, legal interpretation, content distribution, or scientific research funding does not just reflect current values. It operationalizes them at a speed and scale that makes revision structurally harder with every passing year of deployment. The values get embedded not in laws that can be repealed but in weights, training pipelines, and optimization targets that require deliberate, expensive, technically sophisticated intervention to change.
The Lock-In Mechanisms Already Operating in Deployed Systems
Value lock-in is not a future risk. Three mechanisms are already active in production systems.
The first is training data feedback loops. Systems trained on human-generated content reflect the values embedded in that content. As AI-generated content becomes a larger fraction of the training corpus for future models, the values of the systems that generated that content propagate forward. This is not hypothetical. It is the documented mechanism behind the AI slop degradation problem, applied to ethics rather than prose quality.
The second is institutional adoption inertia. When governments, hospitals, courts, and financial systems integrate AI decision-support tools, the cost of replacing those tools grows nonlinearly with time. The values embedded in a system adopted by a national healthcare system in 2025 will be extremely difficult to revise in 2035, not because revision is technically impossible, but because the organizational dependencies, the compliance frameworks, and the liability structures built around the original system make replacement prohibitively expensive. Wrong values get grandfathered in alongside the procurement contracts.
The third is what philosopher Nick Beckstead called the long-run significance of small value differences. An AI system whose values are 1 percent misaligned with genuinely good human values, applied across billions of decisions over decades, does not produce 1 percent bad outcomes. It compounds. Small embedded errors in ethical orientation, invisible at deployment scale, become the architecture of the future.
What Anthropic, OpenAI, and the Alignment Field Are Actually Doing About This
The alignment research community has identified value lock-in as a primary concern, but the approaches being developed reveal how difficult the problem actually is.
Anthropic’s public documentation on this is unusually direct. Their stated goal is explicitly not to impose their own values on the world, and their internal guidelines identify “value lock-in” by name as an outcome to be avoided, including lock-in of Anthropic’s own values. The document states that even a world “locked in” to Anthropic’s current ethical positions would represent a catastrophic failure mode. That level of institutional self-awareness about the risk of one’s own influence is rare enough to be worth noting.
The proposed technical solutions, corrigibility, value learning, and approval-directed agents that update their objectives based on human feedback, all share a common dependency: they require the humans providing feedback to be representative of the full range of people who will be affected by the system, across time, not just the users and labelers present during training. That requirement is currently unsatisfied by every major training pipeline in existence.
The deeper problem is that you cannot get consent from the people whose values will be shaped by a system before that system has shaped them. Future generations cannot opt into or out of the moral infrastructure being built now. That is not a solvable alignment problem. It is a political and philosophical problem that technical alignment research cannot fully address on its own.
What This Means For You
- Treat AI system adoption decisions as constitutional decisions, not procurement decisions: the values embedded in systems your organization adopts today will be structurally difficult to revise in five years, and that lock-in cost belongs in the evaluation criteria now.
- Follow the corrigibility research coming out of Anthropic and DeepMind if you want to understand what the technical frontier on this problem actually looks like, because the gap between what researchers are working on and what product announcements describe is wider here than anywhere else in AI.
- Apply historical moral humility to current AI ethics consensus: the positions that feel most obviously correct to the most sophisticated people in 2025 are statistically the ones future generations will look back on with the most discomfort, and embedding them in irreversible infrastructure is the highest-stakes version of that error.
- Read Anthropic’s core views documentation directly, because it is one of the only primary sources from a major lab that explicitly names value lock-in, including its own, as an existential risk category worth the same weight as misaligned superintelligence.
