robustness [Pedro A. Ortega]

This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong.
{{ ::ai-taxonomy.webp |}}

====== Beyond Alignment: Robustness in AI Safety ======

> Advanced AI is highly adaptable yet inherently unpredictable, making it nearly impossible to embed a fixed set of human values from the start. Traditional alignment methods fall short because AI can reinterpret its goals dynamically, so instead, we need a robustness approach—one that emphasizes continuous oversight, rigorous stress-testing, and outcome-based regulation. This strategy mirrors how we manage human unpredictability, keeping human responsibility at the forefront and ensuring that we can react quickly and effectively when AI behavior deviates.

Pluripotent technologies possess transformative, open-ended capabilities that go far beyond the narrow functions of traditional tools. For instance, stem cell technology exemplifies this idea: stem cells can be induced to develop into virtually any cell type. Unlike conventional technologies designed for specific tasks, pluripotent systems can learn and adapt to perform a multitude of functions. This flexibility, however, comes with a trade-off: while they dynamically respond to varying stimuli and needs, their behavior is inherently less predictable and more challenging to constrain in advance.

{{ https://upload.wikimedia.org/wikipedia/commons/3/3c/Stem_cells_diagram.png?nolink&600 }}

//Figure: Pluripotent stem cells (analogy for general-purpose tech) can develop into many specialized cells. Likewise, general AI can potentially adapt to many tasks, unlike a fixed-purpose tool. (Source: Wikimedia Commons)//

Artificial intelligence –especially in its more general forms– exemplifies a pluripotent technology. Just as a stem cell can differentiate into muscle, nerve, or blood cells, an advanced AI can potentially perform medical diagnosis, drive a car, write code, or compose music depending on how it's prompted or trained. 

But with this versatility comes uncertainty. In biology, pluripotent stem cells carry risks; they can turn into cancers that kill the host if their growth isn't carefully managed. Similarly, a highly general AI might "differentiate" into unforeseen capabilities or behaviors that pose risks. By contrast, a narrow AI (like a spam filter or a chess program) is like a specialized cell; constrained to a particular “fate” and far less likely to behave erratically.

In living organisms, evolution's solution to this trade-off was to keep only a tiny reserve of pluripotent stem cells and heavily regulate them, while the vast majority of cells are committed specialists. This hints at a fundamental principle: greater adaptability entails greater oversight to manage the unpredictability that comes with it.

===== AI as a Pluripotent Technology =====

Artificial intelligence is commonly labeled as a general-purpose technology, much like electricity or the internet, because it provides a foundational infrastructure that supports a wide range of industries and applications. However, unlike these traditional general technologies, which offer fixed functions—power or connectivity—AI exhibits a pluripotent character. Through advances in machine learning and large language models, a single AI model can translate languages, answer questions, generate images, and even perform basic reasoning. This adaptability means that AI not only serves as a broad utility but also evolves and creates new functionalities over time, much like how stem cells differentiate into various cell types, or how human intelligence continuously adapts and grows.

However, this generality makes AI behavior intrinsically harder to pin down. Designers can't easily pre-specify every outcome for a system meant to navigate open-ended tasks. In practice, it is difficult for AI engineers to specify the full range of desired and undesired behaviors in advance. Unintended objectives and side effects can emerge when an AI is deployed in new contexts that its creators didn't fully anticipate. This is analogous to the "cancer" risk of stem cells: a powerful AI might find clever loopholes in its instructions or optimize for proxy goals in ways misaligned with human intent. The more generally capable the AI, the more avenues it has to pursue unexpected strategies. 

Crucially, artificial intelligence is not a static tool; it's dynamic and often self-improving. Modern machine learning systems adapt based on data. They update their internal parameters, sometimes even in deployment, in response to new inputs. Advanced AIs could one day redesign parts of themselves (think of an AI writing its own code improvements), which amplifies their pluripotency but further reduces predictability. AI, in effect, is a technology that learns, and learned behavior can shift with new stimuli. It's this fluid, evolving nature that makes AI both immensely powerful and challenging to govern with traditional top-down commands.

===== Human Intelligence: The Original Pluripotent Technology =====

Long before artificial intelligence emerged, human intelligence was the original general-purpose "technology" that transformed the world. Unlike other species limited by fixed instincts or narrow skills, humans possessed the remarkable ability to learn, adapt to any environment, and, crucially, innovate. This mental pluripotency enabled us to invent tools, create complex institutions, and propel economic progress on an unprecedented scale. From harnessing fire to engineering spacecraft, the versatility of the human mind has consistently been the engine of innovation.

Human intelligence has always been a double-edged sword: remarkably creative yet inherently unpredictable. Every individual harbors private thoughts, hidden motives, and unconscious impulses that can lead to unexpected actions. Enlightenment thinkers understood that societies could not function under the assumption that every person would reliably follow a predetermined script or always act in the collective interest. They embraced the notion of free will—the capacity for autonomous choice—which renders societal outcomes fundamentally uncertain. This insight led early political theorists to argue against entrusting any single ruler or elite group with absolute power, instead designing systems where competing interests check one another. As famously noted in The Federalist Papers, if humans were angelic, no government would be necessary.

The very foundations of democratic ideals —individual liberty, privacy, and freedom of thought— arose from this recognition of human unpredictability. Institutions such as the secret ballot were established to ensure that each person's genuine opinions and choices remain free from external manipulation or the presumption of consistent behavior. The liberal democratic model rests on the belief that humans are autonomous agents capable of making decisions that can defy expert expectations and disrupt established norms. Rather than a flaw, this unpredictability is a vital feature that allows societies to innovate, protecting it from stagnation and tyranny, empowering individuals to shape the social order through their distinct and sometimes surprising choices.

===== When AI Meets the Enlightenment: The Threat of Predictability =====

Consider the impact of advanced AI on our society. If human intelligence once served as the inscrutable force that kept us on our toes, AI is emerging as the lamp that illuminates its inner workings. Modern AI systems can model, predict, and even manipulate human behavior in ways that challenge core Enlightenment ideals like free will and privacy. By processing vast amounts of data —from our browsing histories and purchase patterns to social media activity and biometric information— AI begins to decode the complexities of individual decision-making.

{{ ::panoption.webp |}}

This is not science fiction; it is unfolding before our eyes. Targeted advertising algorithms now pinpoint our vulnerabilities (be it a late-night snack craving or an impulsive purchase) more adeptly than any human salesperson. Political micro-targeting leverages psychological profiles to tailor messages intended to sway votes. Critics warn that once governments or corporations learn to "hack" human behavior, they could not only predict our choices but also reshape our emotions. With ever-increasing surveillance, the space for the spontaneous, uncalculated decisions that form the foundation of liberal democracy is shrinking. A perfectly timed algorithmic nudge can undermine independent judgment, challenging the idea that each individual’s free choice holds intrinsic, sovereign value.

The challenge runs even deeper. Liberal democracy rests on the notion that each person possesses an inner sanctum —whether called the mind, conscience, or soul— that remains opaque and inviolable. However, AI now threatens to penetrate that private realm. Even without directly reading our thoughts, AI systems require only marginally better insight into our behaviors than we possess ourselves to predict and steer our actions. In a world where our actions become transparently predictable, the comforting fiction of free will —a necessary construct for social order— begins to crumble, along with the moral framework that upholds individual accountability.

Furthermore, AI's predictive prowess undermines privacy as a societal norm. The capacity of AI to forecast desires and behaviors implies the extensive collection of personal data, often without explicit consent. The relentless drive to fuel AI models with more data escalates surveillance, edging society closer to a state where every aspect of life is observed for optimization. Democracies cannot thrive when citizens feel perpetually monitored and judged by algorithms. Privacy provides the critical space for dissent and the nurturing of unconventional ideas; a space that totalitarian regimes historically obliterated, while open societies have long sought to protect. When AI tailors messages to individual vulnerabilities and targets propaganda with precision, it erodes the essential boundary between the public and the private, striking at the core of Enlightenment values.

In summary, advanced AI presents a profound paradox: it is born of human ingenuity and promises significant benefits, yet it also risks fulfilling the very revolution that began with the Enlightenment: the challenge to human centrality in our moral and political worldview. This turning point calls into question the enduring legacy of human exceptionalism.

===== From Human Exceptionalism to Ethical Inclusivity: Five Revolutions =====

Over the last five centuries, a series of scientific and intellectual revolutions have progressively knocked humans off the pedestal of assumed superiority. Each step forced us to broaden our perspective and include entities other than humans in our understanding of the world (and often in our circle of moral concern). These can be seen as five major shifts:

  * **The Copernican Revolution (16th century):** Nicolaus Copernicus and Galileo showed that we are not the center of the universe. Earth circles the sun, not vice versa. This was a profound blow to human cosmic ego. It taught us humility: our planet is just one of many. (Eventually, this perspective would feed into ethical inklings that maybe our planet —and the life on it— deserves care beyond just serving human ends.)
  * **The Darwinian Revolution (19th century):** Charles Darwin demonstrated that humans evolved from other animals. We are kin to apes, not a special creation separate from the tree of life. Darwin's theory questioned the idea of a special place in creation for humans. It expanded our moral circle by implying continuity between us and other creatures. Today's movements for animal welfare draw on this Darwinian insight that humans are not categorically above other sentient beings.
  * **The Freudian Revolution (late 19th – early 20th century):** Sigmund Freud revealed that we are not fully in control of our own minds. Our behavior is influenced by unconscious drives and desires beyond rational grasp. This was an affront to the Enlightenment image of the person as a completely rational agent. Freud's ideas (and those of psychologists after him) introduced compassion for the inner complexities and irrationalities of humans; we had to include ourselves in the category of things that need understanding, not just judgment. It also further humbled our self-image: the mind has depths we ourselves can't fathom.
  * **The Environmental (Ecological) Revolution (20th century):** With the rise of ecology, environmental science, and the sight of Earth from space, humanity realized we are not masters of the planet, but part of it. By the 1970s, events like the Apollo 8 "Earthrise" photo brought home the message that "we're stuck on this beautiful blue gem in space and we'd better learn to live within its limits". The environmental movement forced us to see that human prosperity is entwined with the well-being of the entire biosphere. Ethical inclusivity now meant valuing ecosystems, species, and the planet's health, not only for our survival but as a moral imperative to respect our home.
  * **The AI Revolution (21st century):** Now comes the era of artificial intelligence. AI is challenging our understanding of mind and agency. Machines like GPT-4 can converse at a human level; algorithms can outplay us in strategy games and potentially make decisions faster and, in some cases, more accurately than we can. These technologies have demonstrated abilities previously considered exclusive to humans, raising profound questions about our ongoing role in a world with AI. If a machine can think, what does that make us? The AI revolution confronts us with the possibility that intelligence –our last bastion of supposed superiority– may not be uniquely human after all. It urges us toward ethical inclusivity yet again: to consider the possibility of AI as entities that might merit moral consideration, or at least to reevaluate human hubris in assuming only we can shape the future of this planet.

Seen together, these revolutions paint a clear trajectory: from human exceptionalism to a more inclusive, humble ethics. We went from a worldview that placed Man at the center of everything (the pinnacle of creation, the rational king of nature) to a worldview that increasingly acknowledges decentralization. We are one part of a vast universe, one branch on the tree of life, one species sharing a fragile environment, and now possibly one type of intelligence among others.

This trajectory has two potential endpoints when it comes to AI. We stand at a crossroads:

  - **The Fatalist (or Post-Humanist) Stance:** Humanity fully relinquishes its central moral standing. In this view, it's a natural conclusion of the past revolutions. Just as we stopped seeing Earth as special, we will stop seeing humans as inherently more valuable or morally privileged than other intelligent entities. If AI surpasses human abilities, a fatalist might argue that humans should accept a future where the foundations of human dignity are radically rethought. Perhaps advanced AI, if conscious, would deserve rights equivalent (or superior) to humans. Perhaps human governance should give way to algorithmic governance since algorithms might be "more rational." In the most extreme fatalist vision, humans could even become obsolete: passengers to superintelligent machines that run the world. This stance is "fatalist" not in the sense of doom (some adherents see it as a positive evolution), but in the sense that it accepts the fate of human demotion. It extends ethical inclusivity to the maximum, potentially erasing the boundary between human and machine in terms of moral status. If taken to an extreme, however, this view can slide into a kind of nihilism about human agency: if we are just another node in the network, why insist on any special responsibility or privilege for our species?
  - **The Enlightenment (or Humanist) Stance:** Humanity holds on to the notion of human dignity and moral responsibility as non-negotiable, even in the face of advanced AI. This view doesn't deny the lessons of Copernicus, Darwin, or ecology –we still accept humility and inclusivity– but it draws a line in the sand regarding moral agency. Humans, and only humans, are full moral agents in our society. We might treat animals kindly and use AI beneficially, but we do not grant them the driver's seat in ethical or political matters. In this stance, regardless of AI's capabilities, the creators and operators of AI remain responsible for what their creations do. An AI would be seen more like a very sophisticated tool or a corporate entity at most, something that humans direct and are accountable for, not a sovereign being with its own rights above or equal to humans. This echoes Enlightenment principles like the inherent dignity of man (as per Kant or the Universal Declaration of Human Rights). It emphasizes that even if free will is scientifically murky, we must treat humans //as if// they have free will and moral agency, because our legal and ethical systems collapse without that assumption. In short, the Enlightenment stance insists on a human-centric moral framework: we might coexist with AIs, but on //our// terms and with human values at the helm. The buck for decisions stops with a human, not a machine.

The tension between these stances is palpable in AI ethics debates. Do we prepare to integrate AI as moral equals (fatalist/post-humanist), or do we double down on human oversight and human-centric rules (Enlightenment/humanist)? This finally brings us to the debate in AI safety between alignment and robustness, because how one views this human-AI relationship informs how we try to make AI safe.

===== Why the Traditional Alignment Agenda Falls Short =====

The AI alignment agenda is rooted in the well-intentioned desire to ensure AI systems do what we want and share our values. In an ideal form, an aligned AI would know and respect human ethical principles, never betraying our interests or crossing moral lines. Alignment research often focuses on how to encode human values into AI, how to make AI's objectives tethered to what humans intend, and how to avoid AI pursuing goals that conflict with human well-being. On the surface, this seems perfectly sensible: after all, why wouldn't we want a super-powerful AI to be aligned with what is good for humanity?

//The problem is that alignment, as traditionally conceived, assumes a level of predictability and control that might be feasible for narrow tools but not for pluripotent intelligences//. By definition, a generally intelligent AI will have the capacity to reinterpret and reformulate its goals in light of new situations. We simply cannot enumerate all the "bad" behaviors to forbid or all the "good" outcomes to reward in advance. Human values are complex and context-dependent, and even we humans disagree on them. Expecting to nail down a consistent value system inside a highly adaptive AI is asking for something we've never even achieved among ourselves. As the Wikipedia entry on AI alignment dryly notes, "AI designers often use simpler proxy goals... but proxy goals can overlook necessary constraints or reward the AI for merely appearing aligned". In other words, any fixed alignment scheme we devise is likely to be incomplete: the AI might follow the letter of our instructions while betraying the spirit, the classic genie-in-a-lamp problem or "reward hacking". And if the AI is learning and self-modifying, the issue compounds; its understanding of our intent may drift.

Another way to put it: the alignment agenda is about making AI a perfectly obedient angel (or at least a friendly genie). But building an angel is exceedingly hard when the entity in question is more like a chaotic, evolving organism than a static machine. Recall Madison's wisdom: If men were angels, no government necessary. In the AI case, alignment researchers hope to create an "angelic" AI through design. Skeptics argue this is a fragile approach: one bug or one unconsidered scenario, and your angel might act like a devil.

Moreover, pursuing strict alignment could inadvertently curtail the very adaptability that makes AI useful. To truly align an AI in every situation, one might be tempted to excessively box it in, limit its learning, or pre-program it with so many rules that it becomes inflexible. That runs contrary to why we want AI in the first place (i.e. its general intelligence). It also shades into an almost authoritarian approach to technology: it assumes a small group of designers can determine a priori what values and rules should govern all possible decisions of a super-intelligence. History shows that trying to centrally plan and micromanage a complex, evolving system is brittle. Society didn't progress by pre-aligning humans into a single mode of thought; it progressed by allowing diversity and then correcting course when things went wrong.

===== Robustness: Embrace Unpredictability, Manage by Outcomes =====

An alternative to the alignment-first strategy is what we might call a robustness approach. This perspective accepts that, for pluripotent technologies like AI, we won't get everything right in advance. Instead of trying to instill a perfect value system inside the AI's mind, the robustness approach focuses on making the overall system (AI + human society) resilient to unexpected AI behaviors. It's less about guaranteeing the AI never does anything unaligned, and more about ensuring we can catch, correct, and survive those missteps when they happen.

In practice, what might robustness entail? It means heavily testing AI systems in varied scenarios (stress-tests, "red teaming" to find how they might go wrong) and building fail-safes. It means monitoring AI behavior continuously: "monitoring and oversight" are often mentioned alongside robustness. It means setting up institutions that can quickly respond to harmful outcomes, much like we do with product recalls or emergency regulations for other industries. Crucially, it means focusing on outcomes rather than inner workings. As one governance analysis put it, "Rather than trying to find a set of rules that can control the workings of AI itself, a more effective route could be to regulate AI's outcomes". If an AI system causes harm, we hold someone accountable and enforce consequences or adjustments, regardless of whether the AI technically followed its given objectives.

Think of how we handle pharmaceuticals or cars. We don't fully understand every possible interaction a new drug will have in every human body. Instead, we run trials, monitor side effects, and have a system to update recommendations or pull the drug if needed. We don't align the physics of a car so that it can never crash; we build airbags, seatbelts, and traffic laws to mitigate and manage accidents. For all its intelligence, an AI is, in this view, just another powerful innovation that society must adapt to and govern through iteration and evidence.

The robustness agenda also aligns with how we historically handled human unpredictability. We did not (and could not) re-engineer humans to be perfectly moral or rational beings. Instead, we built robust institutions: courts to handle disputes and crimes, markets to aggregate many independent decisions (accepting some failures will occur), and democracies to allow course-correction via elections. //We "oblige [the government] to control itself" with checks and balances because we assume unaligned behavior will occur.// Essentially, we manage risk and disorder without extinguishing the freedom that produces creativity. A robustness approach to AI would apply the same philosophy: allow AI to develop and then constrain and direct its use through external mechanisms –legal, technical, and social.

Importantly, robustness doesn't mean laissez-faire. It isn't an excuse to ignore AI risks; rather, it's an active form of risk management. It acknowledges worst-case scenarios and seeks to ensure we can withstand them. For example, a robust AI framework might mandate that any advanced AI system has a "circuit breaker" (a way to be shut down under certain conditions), much as stock markets have circuit breakers to pause trading during crashes. It might require AI developers to collaborate with regulators in sandbox environments –testing AI in controlled settings– before wide release. It certainly calls for transparency: you can't effectively monitor a black box if you aren't allowed to inspect it. So, robust governance would push against proprietary secrecy when an AI is influencing millions of lives.

Instead of treating highly autonomous AI as self-contained moral agents or equating them with corporations, we should regard them as dependents —akin to children who require careful nurturing and oversight. In this framework, the individuals tasked with designing, deploying, and maintaining AI systems bear direct moral responsibility for their behavior. Relying on corporate limited liability to shield creators or operators would be dangerous, as it risks insulating them from the personal accountability that is essential for ethical caretaking. By treating AI as dependents, we ensure that real human judgment and responsibility remain at the forefront, maintaining the integrity of our ethical and legal systems.

===== Conclusion: Reclaiming Human Agency with a Robustness Mindset =====

The debate between alignment and robustness in AI safety is not just a technical one – it's deeply philosophical. Do we view AI as a nascent agent that must be imbued with the "right" values (and if it misbehaves, that's a failure of our upfront design)? Or do we view AI as a powerful tool/force that will sometimes go awry, and thus emphasize adaptability and resilience in our response? The robustness agenda leans into the latter view. It resonates with an Enlightenment, humanist stance: it keeps humans firmly in the loop as the arbiters of last resort. We don't kid ourselves that we can perfectly align something as complex as a truly general AI – just as we don't expect to perfectly align 8 billion humans. Instead, we double down on the mechanisms that can absorb shocks and correct errors. 

Yes, this approach accepts a degree of unpredictability in our AI systems, perhaps more than is comfortable. But unpredictability isn't always evil – it's the flip side of creativity and discovery. A robust society, like a robust organism, can handle surprises; a brittle one cannot. By focusing on robustness, we implicitly also choose to preserve the freedom to innovate. We're not putting all progress on hold until we solve an impossible equation of alignment. We're moving forward, eyes open, ready to learn from mistakes. As one tech governance insight notes, outcome-based, adaptive regulation is often better suited to fast-evolving technologies than trying to pre-emptively write the rulebook for every scenario. 

Critics of robustness might call it reactive – but being reactive is only a sin if you react too slowly. The key is to react rapidly and effectively when needed. In fact, the robustness approach could enable faster identification of real issues. Rather than theorizing endlessly about hypothetical failure modes, we'd observe real AI behavior in controlled rollouts and channel our efforts towards tangible problems that arise. It's the difference between designing in a vacuum and engineering in the real world.

Ultimately, the robustness agenda is about trusting our evolutionary, democratic toolkit to handle a new upheaval. Humanity's past pluralpotent invention – our own intelligence – was managed not by a grand alignment schema, but by gradually building norms, laws, and checks informed by experience. We should approach artificial intelligence in a similar way, as a powerful extension of ourselves that needs pruning, guidance, and sometimes firm restraint, but not an omnipotent leash that chokes off all risk (and with it, all reward). This path preserves human dignity by asserting that we will take responsibility for what our creations do, rather than hoping to abdicate that responsibility to the creations themselves through "alignment." In a world increasingly populated by algorithms and AIs, it is comforting to remember that our most important safety feature is not in the machines at all – it is the robust and accountable institutions we build around them.

==== Sources: ====

  - Foucault, Michel. "Discipline and Punish: The Birth of the Prison", 1975.
  - Nietzsche, Friedrich. "On the Genealogy of Morality: A Polemic", 1887.
  - Mansfield, Nicholas. "Subjectivity: Theories of the Self from Freud to Haraway", 2000.
  - Harari, Yuval Noah. "The myth of freedom", Society books, The Guardian.
  - Floridi, Luciano. "The Fourth Revolution: How the Infosphere is Reshaping Human Reality", Oxford University Press, 2014.
  - Equanimity Blog. [[https://equanimity.blog/2022/09/26/the-forth-great-humiliation-of-humanity-is-upon-us/#:~:text=The%20Fourth%20Great%20Humiliation%20%E2%80%93,prospects%20are%20very%20grim%20indeed|"The Fourth Great Humiliation of Humanity is Upon Us"]], 2022.
  - Dr. Dayenoff. [[https://www.linkedin.com/pulse/fourth-wound-when-artificial-intelligence-challenges-human-dayenoff-bg5mf/|"The Fourth Wound: When Artificial Intelligence Challenges the Human"]], 2023.
  - Wikipedia, [[https://en.wikipedia.org/wiki/AI_alignment|"AI alignment"]].
  - von Wendt, Karl. [[https://forum.effectivealtruism.org/posts/32wmwfYELKSEfckYv/we-don-t-need-agi-for-an-amazing-future|We don't need AGI for an amazing future]]
  - Madison, James, "Federalist No. 51", 1788.
  - Universal Declaration of Human Rights – Article 1, 1948. – Article 1.