This is an old revision of the document!
Beyond Alignment: The Case for a Robustness Agenda in AI Safety
Pluripotent technologies are those with broad, open-ended capabilities – much like pluripotent stem cells in biology that can transform into any cell type. These stand in contrast to utility technologies, which are narrowly specialized for a specific function. A classic example of a utility technology is a simple thermostat: it regulates temperature and nothing else. By contrast, a pluripotent technology is more like a computer or a machine learning model that can be repurposed to countless tasks. The key trade-off is adaptability vs. predictability. Pluripotent tech adapts dynamically to different stimuli or needs, but this very flexibility makes its behavior harder to predict or constrain in advance.
Figure: Pluripotent stem cells (analogy for general-purpose tech) can develop into many specialized cells. Likewise, general AI can potentially adapt to many tasks, unlike a fixed-purpose tool. (Source: Wikimedia Commons)
Artificial intelligence – especially in its more general forms – exemplifies a pluripotent technology. Just as a stem cell can differentiate into muscle, nerve, or blood cells, an advanced AI can potentially perform medical diagnosis, drive a car, write code, or compose music depending on how it's prompted or trained. But with this versatility comes uncertainty. In biology, pluripotent stem cells carry risks; they can turn into cancers that kill the host if their growth isn't carefully managed. Similarly, a highly general AI might “differentiate” into unforeseen capabilities or behaviors that pose risks. By contrast, a narrow AI (like a spam filter or a chess program) is like a specialized cell – constrained to a particular “fate” and far less likely to behave erratically. In living organisms, evolution's solution to this trade-off was to keep only a tiny reserve of pluripotent stem cells and heavily regulate them, while the vast majority of cells are committed specialists. This hints at a fundamental principle: greater adaptability entails greater oversight to manage the unpredictability that comes with it.
AI as a Pluripotent Technology
Artificial intelligence is often described as a general-purpose technology – akin to electricity or the internet – that can permeate virtually every industry and domain. Indeed, AI's modern resurgence (especially through machine learning and large language models) has shown it to be highly adaptable. A single AI model can translate languages, answer questions, generate images, and even exhibit rudimentary reasoning across a variety of subjects. This pluripotent character of AI is why many compare it to human intelligence or even to stem cells.
However, this generality makes AI behavior intrinsically harder to pin down. Designers can't easily pre-specify every outcome for a system meant to navigate open-ended tasks. In practice, it is difficult for AI engineers to specify the full range of desired and undesired behaviors in advance. Unintended objectives and side effects can emerge when an AI is deployed in new contexts that its creators didn't fully anticipate. This is analogous to the “cancer” risk of stem cells: a powerful AI might find clever loopholes in its instructions or optimize for proxy goals in ways misaligned with human intent. The more generally capable the AI, the more avenues it has to pursue unexpected strategies.
Crucially, artificial intelligence is not a static tool; it's dynamic and often self-improving. Modern machine learning systems adapt based on data – they update their internal parameters, sometimes even in deployment, in response to new inputs. Advanced AIs could one day redesign parts of themselves (think of an AI writing its own code improvements), which amplifies their pluripotency but further reduces predictability. AI, in effect, is a technology that learns, and learned behavior can shift with new stimuli. It's this fluid, evolving nature that makes AI both immensely powerful and challenging to govern with traditional top-down commands.
Human Intelligence: The Original Pluripotent Technology
Long before artificial intelligence, human intelligence itself was the general-purpose “technology” that revolutionized the world. Unlike other animals with fixed instincts or narrow skillsets, humans could learn, innovate, and adapt to virtually any environment. This mental pluripotency allowed our species to invent tools, build institutions, and drive economic progress at an unprecedented scale. From harnessing fire to engineering spacecraft, the versatility of the human mind has been the engine of innovation.
But human intelligence has always been a double-edged sword: extraordinarily creative yet notoriously unpredictable. Each person harbors private thoughts, hidden motives, and even unconscious impulses that can lead to surprising behavior. This unpredictability of humans has profoundly shaped our social institutions. Enlightenment thinkers recognized that societies could not be run assuming every individual would consistently follow a predefined script or always act in the collective interest. On the contrary, they assumed people have free will – a capacity to choose one's actions – which made outcomes in society fundamentally unpredictable. As philosopher James Madison famously noted in The Federalist Papers, “If men were angels, no government would be necessary”. In other words, instead of trusting any single “intelligent agent” (a monarch or ruling council) to be impeccably aligned with the people's interests, the system forces competing interests to keep each other in check.
Democratic ideals such as individual liberty, privacy, and freedom of thought also stem from an acknowledgment of human unpredictability. We enshrined the right to privacy – for instance, the secret ballot in elections – because we understood that each person's genuine opinions and choices should be free from outside control or perfect foresight. Liberal democracies were built on the Enlightenment assumption that humans are autonomous agents: “Liberalism is founded on the belief in human liberty… human beings are supposed to have ‘free will'. This is what makes human feelings and human choices the ultimate moral and political authority in the world” [Harari]. In practice, this means we expect people to sometimes make choices that surprise elites, defy experts, or upset the status quo. That unpredictability is not a flaw but a feature – it protects society from stagnation and tyranny. A voter's choice is his own; a creator's idea can disrupt the old order.
When AI Meets the Enlightenment: The Threat of Predictability
Now consider what happens when advanced AI is introduced into this mix. If human intelligence was the black box that kept society on its toes, AI is fast becoming the lamp that illuminates that box. Modern AI systems can model, predict, and manipulate human behavior in ways that challenge the core Enlightenment ideals of free will and privacy. As AI analytics comb through big data – our browsing histories, purchase patterns, social media activity, even biometric information – they start to crack the code of individual decision-making.
This isn't science fiction; it's already happening. Targeted advertising algorithms can pinpoint our weaknesses (that late-night snack craving or that impulsive gadget purchase) better than any human salesman. Political micro-targeting can tailor messages to sway our votes, exploiting psychological profiles generated by AI. Historian Yuval Noah Harari warns that once governments or corporations “succeed in hacking the human animal,” they will “not only predict your choices, but also reengineer your feelings” [Harari]. The more we are monitored by AI, the less room there is for the kind of spontaneous, uncalculable choice that liberalism celebrated. Your sense of independent judgment can be subverted by a perfectly timed algorithmic nudge. And if that happens widely, the philosophical justification for democracy – that each individual's free choice has sovereign value – starts to erode.
It gets even more jarring: liberal democracy's tolerance for human idiosyncrasy assumes there is an inner sanctum to each person (the mind, the soul, conscience – call it what you will) that remains opaque and inviolable. But AI threatens to invade that inner sanctum. Harari puts it bluntly: “Once they can hack you, they can know you better than you know yourself… and once they can predict your choices, they can manipulate them” [Harari]. Even if AI can't read your thoughts directly, it doesn't need perfect knowledge – just “to know you a little better than you know yourself”, which is “not impossible, because most people don't know themselves very well”. In a world where our actions become transparently predictable, the comforting myth of free will – which “isn't a scientific reality… but a fiction that sustains social order” [Harari] – collapses. And with it collapses the moral framework that held individuals as accountable, sovereign choosers.
Furthermore, AI's predictive power can undermine privacy as a societal norm. If an AI can forecast what you'll do or want, it's likely because it has amassed data about you – often without your explicit consent. The push for ever more data to feed AI models means pressures for surveillance increase. We risk drifting into what some scholars call a “surveillance society,” where everything is observed for the sake of optimization. But a democracy cannot survive if citizens feel perpetually watched and judged by an algorithm. Privacy is what allows dissent to germinate and unconventional ideas to form without fear. It's no coincidence that totalitarian regimes historically pried into every aspect of private life, whereas open societies respected a zone of individual privacy (homes, thoughts, communications). AI that can “tailor messages to the unique weaknesses of individual brains” [Harari] and precisely target propaganda breaks down the distinction between the public and private realm. In doing so, it strikes at the heart of Enlightenment values.
In summary, advanced AI poses a paradox: it is born of human ingenuity and promises great benefits, yet it also threatens to finish the job that past intellectual revolutions started – the job of dethroning the human being as the central actor in our moral and political worldview. This brings us to the historical arc of human exceptionalism and its undoing.
From Human Exceptionalism to Ethical Inclusivity: Five Revolutions
Over the last five centuries, a series of scientific and intellectual revolutions have progressively knocked humans off the pedestal of assumed superiority. Each step forced us to broaden our perspective and include entities other than humans in our understanding of the world (and often in our circle of moral concern). These can be seen as five major shifts:
- The Copernican Revolution (16th century): Nicolaus Copernicus and Galileo showed that we are not the center of the universe. Earth circles the sun, not vice versa. This was a profound blow to human cosmic ego. It taught us humility: our planet is just one of many. (Eventually, this perspective would feed into ethical inklings that maybe our planet—and the life on it—deserves care beyond just serving human ends.)
- The Darwinian Revolution (19th century): Charles Darwin demonstrated that humans evolved from other animals. We are kin to apes, not a special creation separate from the tree of life. Darwin's theory questioned the idea of a special place in creation for humans. It expanded our moral circle by implying continuity between us and other creatures. Today's movements for animal welfare draw on this Darwinian insight that humans are not *categorically* above other sentient beings.
- The Freudian Revolution (late 19th – early 20th century): Sigmund Freud revealed that we are not fully in control of our own minds. Our behavior is influenced by unconscious drives and desires beyond rational grasp. This was an affront to the Enlightenment image of the person as a completely rational agent. Freud's ideas (and those of psychologists after him) introduced compassion for the inner complexities and irrationalities of humans – we had to include ourselves in the category of things that need understanding, not just judgment. It also further humbled our self-image: the mind has depths we ourselves can't fathom.
- The Environmental (Ecological) Revolution (20th century): With the rise of ecology, environmental science, and the sight of Earth from space, humanity realized we are not masters of the planet, but part of it. By the 1970s, events like the Apollo 8 “Earthrise” photo brought home the message that “we're stuck on this beautiful blue gem in space and we'd better learn to live within its limits”. The environmental movement forced us to see that human prosperity is entwined with the well-being of the entire biosphere. Ethical inclusivity now meant valuing ecosystems, species, and the planet's health – not only for our survival but as a moral imperative to respect our home.
- The AI Revolution (21st century): Now comes the era of artificial intelligence. AI is challenging our understanding of mind and agency. Machines like GPT-4 can converse at a human level; algorithms can outplay us in strategy games and potentially make decisions faster and, in some cases, more accurately than we can. In the words of one commentator, technologies like ChatGPT have demonstrated abilities previously considered exclusive to humans, raising profound questions about our ongoing role in a world with AI. If a machine can think, what does that make us? The AI revolution confronts us with the possibility that intelligence – our last bastion of supposed superiority – may not be uniquely human after all. It urges us toward ethical inclusivity yet again: to consider the possibility of AI as entities that might merit moral consideration, or at least to reevaluate human hubris in assuming only we can shape the future of this planet.
Seen together, these revolutions paint a clear trajectory: from human exceptionalism to a more inclusive, humble ethics. We went from a worldview that placed Man at the center of everything (the pinnacle of creation, the rational king of nature) to a worldview that increasingly acknowledges decentralization – we are one part of a vast universe, one branch on the tree of life, one species sharing a fragile environment, and now possibly one type of intelligence among others.
This trajectory has two potential endpoints when it comes to AI. We stand at a crossroads:
- The Fatalist (or Post-Humanist) Stance: Humanity fully relinquishes its central moral standing. In this view, it's a natural conclusion of the past revolutions – just as we stopped seeing Earth as special, we will stop seeing humans as inherently more valuable or morally privileged than other intelligent entities. If AI surpasses human abilities, a fatalist might argue that humans should accept a future where the foundations of human dignity are radically rethought. Perhaps advanced AI, if conscious, would deserve rights equivalent (or superior) to humans. Perhaps human governance should give way to algorithmic governance since algorithms might be “more rational.” In the most extreme fatalist vision, humans could even become obsolete – passengers to superintelligent machines that run the world. This stance is “fatalist” not in the sense of doom (some adherents see it as a positive evolution), but in the sense that it accepts the fate of human demotion. It extends ethical inclusivity to the maximum, potentially erasing the boundary between human and machine in terms of moral status. If taken to an extreme, however, this view can slide into a kind of nihilism about human agency: if we are just another node in the network, why insist on any special responsibility or privilege for our species?
- The Enlightenment (or Humanist) Stance: Humanity holds on to the notion of human dignity and moral responsibility as non-negotiable, even in the face of advanced AI. This view doesn't deny the lessons of Copernicus, Darwin, or ecology – we still accept humility and inclusivity – but it draws a line in the sand regarding moral agency. Humans, and only humans, are full moral agents in our society. We might treat animals kindly and use AI beneficially, but we do not grant them the driver's seat in ethical or political matters. In this stance, regardless of AI's capabilities, the creators and operators of AI remain responsible for what their creations do. An AI would be seen more like a very sophisticated tool or a corporate entity at most – something that humans direct and are accountable for, not a sovereign being with its own rights above or equal to humans. This echoes Enlightenment principles like the inherent dignity of man (as per Kant or the Universal Declaration of Human Rights: “All human beings are born free and equal in dignity and rights.”). It emphasizes that even if free will is scientifically murky, we must treat humans as if they have free will and moral agency, because our legal and ethical systems collapse without that assumption. In short, the Enlightenment stance insists on a human-centric moral framework: we might coexist with AIs, but on our terms and with human values at the helm. The buck for decisions stops with a human, not a machine.
The tension between these stances is palpable in AI ethics debates. Do we prepare to integrate AI as moral equals (fatalist/post-humanist), or do we double down on human oversight and human-centric rules (Enlightenment/humanist)? This finally brings us to the debate in AI safety between alignment and robustness, because how one views this human-AI relationship informs how we try to make AI safe.
Why the Traditional Alignment Agenda Falls Short
The AI alignment agenda is rooted in the well-intentioned desire to ensure AI systems do what we want and share our values. In an ideal form, an aligned AI would know and respect human ethical principles, never betraying our interests or crossing moral lines. Alignment research often focuses on how to encode human values into AI, how to make AI's objectives tethered to what humans intend, and how to avoid AI pursuing goals that conflict with human well-being. On the surface, this seems perfectly sensible – after all, why wouldn't we want a super-powerful AI to be aligned with what is good for humanity?
The problem is that alignment, as traditionally conceived, assumes a level of predictability and control that might be feasible for narrow tools but not for pluripotent intelligences. By definition, a generally intelligent AI will have the capacity to reinterpret and reformulate its goals in light of new situations. We simply cannot enumerate all the “bad” behaviors to forbid or all the “good” outcomes to reward in advance. Human values are complex and context-dependent, and even we humans disagree on them. Expecting to nail down a consistent value system inside a highly adaptive AI is asking for something we've never even achieved among ourselves. As the Wikipedia entry on AI alignment dryly notes, “AI designers often use simpler proxy goals… but proxy goals can overlook necessary constraints or reward the AI for merely appearing aligned”. In other words, any fixed alignment scheme we devise is likely to be incomplete – the AI might follow the letter of our instructions while betraying the spirit, the classic genie-in-a-lamp problem or “reward hacking”. And if the AI is learning and self-modifying, the issue compounds; its understanding of our intent may drift.
Another way to put it: the alignment agenda is about making AI a perfectly obedient angel (or at least a friendly genie). But building an angel is exceedingly hard when the entity in question is more like a chaotic, evolving organism than a static machine. Recall Madison's wisdom: If men were angels, no government necessary. In the AI case, alignment researchers hope to create an “angelic” AI through design. Skeptics argue this is a fragile approach – one bug or one unconsidered scenario, and your angel might act like a devil.
Moreover, pursuing strict alignment could inadvertently curtail the very adaptability that makes AI useful. To truly align an AI in every situation, one might be tempted to excessively box it in, limit its learning, or pre-program it with so many rules that it becomes inflexible. That runs contrary to why we want AI in the first place (its general intelligence). It also shades into an almost authoritarian approach to technology: it assumes a small group of designers can determine a priori what values and rules should govern all possible decisions of a super-intelligence. History shows that trying to centrally plan and micromanage a complex, evolving system is brittle. Society didn't progress by pre-aligning humans into a single mode of thought; it progressed by allowing diversity and then correcting course when things went wrong.
The Robustness Agenda: Embrace Unpredictability, Manage by Outcomes
An alternative to the alignment-first strategy is what we might call a robustness approach. This perspective accepts that, for pluripotent technologies like AI, we won't get everything right in advance. Instead of trying to instill a perfect value system inside the AI's mind, the robustness approach focuses on making the overall system (AI + human society) resilient to unexpected AI behaviors. It's less about guaranteeing the AI never does anything unaligned, and more about ensuring we can catch, correct, and survive those missteps when they happen.
In practice, what might robustness entail? It means heavily testing AI systems in varied scenarios (stress-tests, “red teaming” to find how they might go wrong) and building fail-safes. It means monitoring AI behavior continuously – “monitoring and oversight” are often mentioned alongside robustness. It means setting up institutions that can quickly respond to harmful outcomes, much like we do with product recalls or emergency regulations for other industries. Crucially, it means focusing on outcomes rather than inner workings. As one governance analysis put it, “Rather than trying to find a set of rules that can control the workings of AI itself, a more effective route could be to regulate AI's outcomes”. If an AI system causes harm, we hold someone accountable and enforce consequences or adjustments, regardless of whether the AI technically followed its given objectives.
Think of how we handle pharmaceuticals or cars. We don't fully understand every possible interaction a new drug will have in every human body – instead, we run trials, monitor side effects, and have a system to update recommendations or pull the drug if needed. We don't align the physics of a car so that it can never crash; we build airbags, seatbelts, and traffic laws to mitigate and manage accidents. For all its intelligence, an AI is, in this view, just another powerful innovation that society must adapt to and govern through iteration and evidence.
The robustness agenda also aligns with how we historically handled human unpredictability. We did not (and could not) re-engineer humans to be perfectly moral or rational beings. Instead, we built robust institutions: courts to handle disputes and crimes, markets to aggregate many independent decisions (accepting some failures will occur), and democracies to allow course-correction via elections. We “oblige [the government] to control itself” with checks and balances because we assume unaligned behavior will occur. Essentially, we manage risk and disorder without extinguishing the freedom that produces creativity. A robustness approach to AI would apply the same philosophy: allow AI to develop and then constrain and direct its use through external mechanisms – legal, technical, and social.
Importantly, robustness doesn't mean laissez-faire. It isn't an excuse to ignore AI risks; rather, it's an active form of risk management. It acknowledges worst-case scenarios and seeks to ensure we can withstand them. For example, a robust AI framework might mandate that any advanced AI system has a “circuit breaker” (a way to be shut down under certain conditions), much as stock markets have circuit breakers to pause trading during crashes. It might require AI developers to collaborate with regulators in sandbox environments – testing AI in controlled settings – before wide release. It certainly calls for transparency: you can't effectively monitor a black box if you aren't allowed to inspect it. So, robust governance would push against proprietary secrecy when an AI is influencing millions of lives.
One can imagine a future where, rather than assuming we've pre-baked morality into an AI, we treat a highly autonomous AI a bit like we treat a corporation or even a person in society: responsible for outcomes under law. If an AI system causes damage, its creators and operators might be held liable. This creates a strong incentive for those humans to continuously audit and improve the AI's behavior. It externalizes the locus of control from inside the AI's head (where we have limited reach) to the surrounding human and institutional context (where we have legal and social tools to use). In effect, the “moral compass” resides not solely in the AI, but in the feedback loop between AI actions and human oversight.
Conclusion: Reclaiming Human Agency with a Robustness Mindset
The debate between alignment and robustness in AI safety is not just a technical one – it's deeply philosophical. Do we view AI as a nascent agent that must be imbued with the “right” values (and if it misbehaves, that's a failure of our upfront design)? Or do we view AI as a powerful tool/force that will sometimes go awry, and thus emphasize adaptability and resilience in our response? The robustness agenda leans into the latter view. It resonates with an Enlightenment, humanist stance: it keeps humans firmly in the loop as the arbiters of last resort. We don't kid ourselves that we can perfectly align something as complex as a truly general AI – just as we don't expect to perfectly align 8 billion humans. Instead, we double down on the mechanisms that can absorb shocks and correct errors.
Yes, this approach accepts a degree of unpredictability in our AI systems, perhaps more than is comfortable. But unpredictability isn't always evil – it's the flip side of creativity and discovery. A robust society, like a robust organism, can handle surprises; a brittle one cannot. By focusing on robustness, we implicitly also choose to preserve the freedom to innovate. We're not putting all progress on hold until we solve an impossible equation of alignment. We're moving forward, eyes open, ready to learn from mistakes. As one tech governance insight notes, outcome-based, adaptive regulation is often better suited to fast-evolving technologies than trying to pre-emptively write the rulebook for every scenario.
Critics of robustness might call it reactive – but being reactive is only a sin if you react too slowly. The key is to react rapidly and effectively when needed. In fact, the robustness approach could enable faster identification of real issues. Rather than theorizing endlessly about hypothetical failure modes, we'd observe real AI behavior in controlled rollouts and channel our efforts towards tangible problems that arise. It's the difference between designing in a vacuum and engineering in the real world.
Ultimately, the robustness agenda is about trusting our evolutionary, democratic toolkit to handle a new upheaval. Humanity's past pluralpotent invention – our own intelligence – was managed not by a grand alignment schema, but by gradually building norms, laws, and checks informed by experience. We should approach artificial intelligence in a similar way, as a powerful extension of ourselves that needs pruning, guidance, and sometimes firm restraint, but not an omnipotent leash that chokes off all risk (and with it, all reward). This path preserves human dignity by asserting that we will take responsibility for what our creations do, rather than hoping to abdicate that responsibility to the creations themselves through “alignment.” In a world increasingly populated by algorithms and AIs, it is comforting to remember that our most important safety feature is not in the machines at all – it is the robust and accountable institutions we build around them.
Sources:
- Foucault, Michel. “Discipline and Punish: The Birth of the Prison”, 1975.
- Nietzsche, Friedrich. “On the Genealogy of Morality: A Polemic”, 1887.
- Mansfield, Nicholas. “Subjectivity: Theories of the Self from Freud to Haraway”, 2000.
- Harari, Yuval Noah. “The myth of freedom”, Society books, The Guardian.
- Floridi, Luciano. “The Fourth Revolution: How the Infosphere is Reshaping Human Reality”, Oxford University Press, 2014.
- Equanimity Blog. "The Fourth Great Humiliation of Humanity is Upon Us", 2022.
- Dr. Dayenoff. "The Fourth Wound: When Artificial Intelligence Challenges the Human", 2023.
- Wikipedia, "AI alignment".
- von Wendt, Karl. We don't need AGI for an amazing future
- Madison, James, “Federalist No. 51”, 1788.
- Universal Declaration of Human Rights – Article 1, 1948. – Article 1.