AI Safety

AI Safety

We are not on track to solve the hard problems of safety

Some time before we build AI that surpasses humanity’s intelligence, we need to figure out how to make AI systems safe. Once AI exceeds humanity’s intelligence, it will be in control, and our safety will depend on aligning AI’s goals with humanity’s best interests.

The alignment problem is not a mere technical challenge — it demands that we collectively solve one of the most difficult problems that humanity has ever tackled, requiring progress in fields that resist formalization, Nobel-prize-level breakthroughs, and billions or trillions of dollars of investment.

In Defining alignment, we explain what alignment really means and why it’s not just a technical problem but an all-encompassing civilizational one.

In Estimating the cost of solving alignment, we explore what makes alignment challenging, and describe what it would cost in terms of effort to solve it and avert extinction. 

In Current technical efforts are not on track to solve alignment, we take a critical look at the current level of funding, organizations, and research dedicated to alignment. We argue that these efforts are insufficient, and that many of them do not even acknowledge the cost or complexity of the challenge. 

In AI will not solve alignment for us, we turn to the question of whether AI can help us solve alignment. We show that any potential benefits are mostly illusions, and argue that trying to use more advanced AI to solve alignment is a dangerous strategy.

Because both current and future safety efforts are not on track to solve alignment, we conclude that we are not on track to avert catastrophe from godlike AI. 

Defining alignment
Defining alignment

In the field of AI, alignment refers to the ability to “steer AI systems toward a person's or group's intended goals, preferences, and ethical principles.”

With simpler systems that are less intelligent than humans, the alignment challenge addresses simpler safety issues, such as making current chatbots refuse to create propaganda or provide instructions for building weapons. 

For systems that exceed human intelligence, the alignment problem is more complex and depends on guaranteeing that AI systems as powerful as godlike do what is best for humanity. This has a vastly larger scope than just censoring chatbots. 

We already need to solve alignment today, which demands getting individuals, companies, and governments to act reliably according to some set of values. Alignment challenges vary depending on the scope and entity:

  • To align individuals, we educate them to behave according to a certain set of cultural values. We also enforce compliance with the law through threat of state punishment. Most people are aligned and share values, with rare aberrations such as sociopathic geniuses or domestic terrorists. 

  • To align companies with societal values, we rely on regulations, corporate governance, and market incentives. However, companies often find loopholes or engage in unethical practices, such as how Boeing’s profit motive undermined safety, leading to crashes and hundreds of fatalities.

  • To align governments with the will of the people, we rely on constitutions, checks and balances, and democratic elections. Some countries operate under dictatorships or authoritarian regimes. But both of these models can go wrong, leading governments to commit atrocities against their own people or experience democratic backsliding.

  • To align humanitytoward common goals like peace and environmental sustainability, we establish international organizations and agreements, like the United Nations and the Paris Climate Accords. On a global scale, enforcement is challenging — there are ongoing wars on multiple continents, and we have met only 17% of the Sustainable Development Goals (SDGs) that all United Nations member states have agreed to.

The examples show that alignment relies on processes that reliably incentivize entities to pursue good outcomes, based on some set of values. In each of the instances above, we need to design processes to determine values (e.g. constitutional conventions), reconcile them (e.g. voting), enshrine them (e.g. constitutions, amendments, laws), oversee and enforce them (e.g. institutions and police), and coordinate the constituent parts (e.g. administrations). 

A system is aligned if there is a mechanistic connection between the original values and reliable outcomes. For example, while UN member states all share the value of protecting the environment and strive toward the Sustainable Development Goals, they lack reliable processes to ensure traction. Regardless of intention, without concrete processes we cannot consider the UN successfully aligned with protecting the environment. 

While they often fail us, we currently entrust the fate of the world to governments, corporations, and international institutions. 

AI alignment demands solving all of the same problems our current institutions try to solve, but instead use software to do it. 

As AI becomes more intelligent, its causal impact will increase, and misalignment will be more consequential. We must find a way to install our deepest values in AI, addressing questions ranging from how to raise children, to what kinds of governance to apply to which problems.

Solving the alignment problem is philosophy on a deadline, and requires defining and reconciling our values, enshrining them in robust processes, and entrusting those processes to AIs that may soon be more powerful than we are.

Estimating the cost of solving alignment
Estimating the cost
of solving alignment

Although alignment is not an impossible problem, it is extremely difficult and requires answering novel social and technical questions humanity has not yet solved. By considering some of these questions, we can understand how much it would cost to solve this problem. 

What do we value and how do we reconcile contradictions in values? We must align godlike AI with “what humanity wants,” but what does this even mean?

It is clear that even as individuals, we often don’t know what we want. For example, if we say and think that we want to spend more time with our family, but then end up playing games on our phones, which one do we really want? Individuals often have multiple conflicting desires or unconscious preferences that make it difficult to know what someone really wants. 

When we zoom out from the individual to groups, up to the whole of humanity, the complexity of “finding what we want” explodes: when different cultures, different religions, different countries disagree about what they want on key questions like state interventionism, immigration, or what is moral, how can we resolve these into a fixed set of values? If there is a scientific answer to this problem, we have made little progress on it.

If we cannot find, build, and reconcile values that fit with what we want, we will lose control of the future to AI systems that ardently defend a shadow of what we actually care about.

Making progress on understanding and reconciling values requires ground-breaking advances in the fields of psychology, neuroscience, anthropology, political science, and moral philosophy. The former fields are necessary for diving into the human psyche resolving uncertainties related to human rationality, emotion, and biases, and the latter two are necessary for finding ways to resolve conflicts between these.

How can we predict the consequences of our actions? A positive understanding of “what we want” is insufficient to keep AI safe: we also need to understand the consequences of getting what we want, to avoid unwanted side effects. 

Yet history demonstrates how often we fail to see consequences of our actions until after they are implemented. The Indian vulture crisis was a massive environmental disaster in which a new medicine given to cows turned out to be toxic for vultures, which died by millions upon eating the carcasses. The collapse in vulture population meant that carcasses were not cleaned, contaminating water sources, providing breeding grounds for feral dogs with rabies, and ultimately leading to a humanitarian disaster costing billions due to a single unknown externality.

The same can happen for designing institutions. The Articles of Confederation was the first attempt to create a US government, but they left Congress completely impotent to govern the individual states, so this had to be corrected in the US constitution

Progress on our ability to predict the consequences of our actions requires better science in every technical field, and learning what to do with these predictions requires progress in fields like non-idealized decision making. The last 100 years have seen some progress in scientific thinking and decision theory, and some efforts in rationalism have even attempted to inspire better decision-making in light of the AI problem. But while better decision-making has had clear consequences in fields like investment–quantitative strategies are increasingly outperforming discretionary ones–most people make decisions the same way we did 100 years ago. 

To confidently move forward on these questions, we need faster science, simulation, and modeling; breakthroughs in fields related to decision-making; and better institutions that demonstrate these approaches work.

Process design for alignment: If we can answer the philosophical questions of values alignment, and get better at predicting and avoiding consequential errors, we still need to build processes to ensure that our values are represented in systems and actually enacted in the real world. 

Often, even the most powerful entities fail to build processes that connect the dots between values and end outcomes. Nearly every country struggles with taxation, particularly of large entities and high net worth individuals. And there are process failures abound in history, such as the largest famine ever caused by inefficient distribution of food within China’s planned economy during the Great Leap Forward. 

The mechanism design of these entities and their implementation are two separate things. When we zoom in on the theory, our best approaches aren’t great. The field of political philosophy attempts to make progress on statecraft, but ideas like the separation of powers in many modern constitutions are based on 250-year-old theories from Montesquieu. New ideas in voting theory have been proposed, and efforts like blockchain governance try to implement some of these, but these have done little so far to displace our current systems. Slow theoretical progress and little implementation of new systems suggest massive room to improve our current statecraft and decision-making processes. 

On the corporate level, we have the theory of the firm and management theory to tell us about how to run companies, but the start of the art in designing a winning company today looks a lot closer to Y-combinator’s oral theory of knowledge and knowing the right people than science, and even then the failure rate is very high. 

Neither statecraft nor making a company is a scientific process in which there are formal guarantees, and things often go very wrong. In the context of the alignment problem, this demonstrates large gaps in humanity’s knowledge that expose huge risks if we were to imagine trusting advanced AI systems to run the future. Without better theories and implementation, these systems could make the same mistakes, with larger consequences given their greater intelligence and power. 

Guaranteeing alignment: Last but not least, even if we can design processes to align AIs and we know what to align them to, we still need to be able to guarantee and check that they will actually do what we want. That is, as in any critical technology, we want guarantees that it won’t create a catastrophe before turning it on.

This is already difficult with normal software: making (almost) bug-free systems require the use of formal methods which are both expensive (in time, skill, effort) and in their infancy, especially with regard to the kind of complex properties that we would care about for AIs acting in the world.

And this is even harder with AIs built with the current paradigm, due to the fact that they are not built by hand (like normal software), but instead grown through mathematical optimization. This means that even the makers of AI systems have next to no understanding of what they can and cannot do, and no predictive capabilities whatsoever to anticipate what they will do before training them, or even just before using them.

But the situation is actually worse than that: whereas most current AI systems are still less smart than humans, alignment actually requires getting guarantees on systems that are significantly smarter than humans. That is, in addition to managing the complexity of software, and the obscurity of neural networks, we need to figure out how to check an entity which can outsmart us at every turn.

-

Even if all of these questions need not be answered at once, we nonetheless need to invent a process by which they are answered. Currently, our human science and morality is inadequate to address the risks posed by advanced AI, and it must improve for us to have a chance.

How much would this cost, in terms of funding and human effort?

When we look at major research efforts that humans have pursued in the past which led to breakthroughs and Nobel prizes, we can begin to envision what such a “significant research project” constitutes. The Manhattan Project cost $27B to produce the first nuclear weapons, and at its height employed 130,000 people. Over four years, researchers and engineers cracked problem after problem to develop the bomb, with over 31 Nobel-prize winners tied to the project. Another massive research effort, the Human Genome Project (HGP), cost $5B over 13 years and required contributions from thousands of researchers from various countries.

If alignment was of the same difficulty level as these problems, we would assume at least a tens-of-billions of dollars effort, featuring thousands of people, with dedicated coordination, and multiple breakthroughs of Nobel-prize magnitude. 

But the cost of alignment is almost guaranteed to be significantly higher. While the HGP and Manhattan Project were profoundly difficult projects, they were concentrated in narrower domains of study in fields that were already hard sciences. In contrast, alignment is a pre-paradigmatic field, in which many of the questions we need to answer resist study. Many of the domains above that require advances (psychology, anthropology, economics, simulation) are not hard sciences, but would need to become them–similar to the transitions from alchemy to chemistry. And at the end of the day, alignment needs to be just right, as “close but dangerously wrong”10 answers could lead to death. 

But the cost of alignment is almost guaranteed to be significantly higher. While the HGP and Manhattan Project were profoundly difficult projects, they were concentrated in narrower domains of study in fields that were already hard sciences. In contrast, alignment is a pre-paradigmatic field, in which many of the questions we need to answer resist study. Many of the domains above that require advances (psychology, anthropology, economics, simulation) are not hard sciences, but would need to become them–similar to the transitions from alchemy to chemistry. And at the end of the day, alignment needs to be just right, as “close but dangerously wrong10 answers could lead to death. 

But the cost of alignment is almost guaranteed to be significantly higher. While the HGP and Manhattan Project were profoundly difficult projects, they were concentrated in narrower domains of study in fields that were already hard sciences. In contrast, alignment is a pre-paradigmatic field, in which many of the questions we need to answer resist study. Many of the domains above that require advances (psychology, anthropology, economics, simulation) are not hard sciences, but would need to become them–similar to the transitions from alchemy to chemistry. And at the end of the day, alignment needs to be just right, as “close but dangerously wrong10 answers could lead to death. 

10  Specification gaming is the idea that an AI can satisfy the goal that a developer sets out for it during training, but end up following a different goal later. This occurs for a few reasons, but the result is dangerous in every case involving superintelligence: with systems this powerful, slight divergences from the original goal could lead to catastrophic divergences in downstream actions.

Given the magnitude of the danger ahead, the complexity and uncertainty of these estimates should make us even more careful, and cause us to assume that the costs may be higher still than what is presented here.

We conclude that solving alignment is extremely hard, and the cost is clearly very high: at least billions, maybe trillions, and with a time frame of decades of a constant string of nobel-prize winning research.

Current technical efforts are not on track to solve alignment
Current technical efforts are not
Current technical efforts are not
on track to solve alignment
on track to solve alignment

The field of AI Safety is not making meaningful progress on or investment in alignment; current funding and focus are insufficient, and the research approaches being pursued do not attend to the hard problems of aligning AI. 

On the capabilities side, there are reports thatOpenAI and Microsoft plan to build a $100B data center, and Demis Hassabis’ has commented that GoogleDeepmind is “investing more than that over time.” 

In comparison, government AI Safety Institutes are funded on the order of tens to hundreds of millions, with the UK’s AISI allocated £100m, and alignment funding outside AGI companies and government estimated at $100m/year. At AGI companies such as DeepMind, alignment research efforts are run by small teams, and some (like OpenAI) are presently suffering from mass exodus. The most concerted alignment-related investment focuses on compute-intensive interpretability experiments and on the teams that run them, but this is unlikely to reach beyond the tens of millions.

And these are optimistic estimates; in reality, only a tiny fraction of this total goes to genuine AI alignment efforts. With few exceptions, the majority of funding is directed at problems associated with AI safety, rather than paying the exorbitant cost of alignment. 

Nearly all current technical safety approaches are limited in their efficacy and trail their own stated goals:

  • Black-box evaluations and red-teaming aim to test a model's capabilities to evaluate how powerful or dangerous it is. With this strategy, the theory of change is that identifying dangerous behavior could forceAI companies to pause development or governments to coordinate on regulation. Teams working on evaluations include AI Safety Institutes, METR and Evaluations at Anthropic.

    Black-Box Evaluations can only catch all relevant safety issues insofar as we have either an exhaustive list of all possible failure modes, or a mechanistic model of how concrete capabilities lead to safety risks. We currently have neither, so evaluations boil down to ad-hoc lists of tests that capture some possible risks. These are insufficient even for today’s models, as demonstrated by the fact that current LLMs can notice they are being tested, which the evaluators and researchers did not even anticipate.

  • Interpretability aims to reverse-engineer the concepts and thought processes a model uses to understand what it can do and how it works. This approach presumes that a more complete understanding of the systems could prevent misbehavior, or unlock new ways to control models, such as training them to be fully honest. Teams working on interpretability include Interpretability at Anthropic and Apollo Research.

    Interpretability’s value depends on its ability to fully understand and reverse engineer AI systems to check if they have capabilities and thoughts that might lead to unsafe actions. Yet current interpretability research is unable to do that even for LLMs a few generations back (GPT2), let alone for the massive and complex models used in practice today (Claude and GPT4/o1).

    And even with full understanding and reverse engineering of state of the art LLMs, interpretability is blind to any form of extended cognition, such as what the system can do when connected to the environment (notably the internet), given tools, interacting with other systems or instances of itself. A huge part of recent progress in AI comes from moving to agents and scaffolding that leverage exactly this form of extended cognition. Just as solving neuroscience would be insufficient to explain how a company works, even full interpretability of an LLM would be insufficient to explain most research efforts on the AI frontier.

  • Whack-A-Mole Fixes, which use techniques like RLHF, fine-tuning, and prompting to remove undesirable model behavior or a specific failure mode. The theory of change is that current safety problems can be solved in a patchwork manner, addressed as they arise, and that we can perhaps learn from this process to correct the behavior of more advanced systems. Teams working on this include Alignment Capabilities at Anthropic and OpenAI’s Safety Team.

    Whack-A-Mole fixes, from RLHF to finetuning, are about teaching the system to not demonstrate problematic behavior, not about fundamentally fixing that behavior. For example, a model that produces violent text output may be finetuned to be more innocuous, but the underlying base model is just as capable of producing violent content as ever. The problem as to how this behavior arose in the first place is left unaddressed by even the best finetuning. By pushing for models to hide unsafe actions rather than resolving underlying issues, whack-a-mole fixes lead to models that are more and more competent at hiding their issues and failures, rather than models that are genuinely safer.

At best, these strategies can identify and incrementally correct problems, address model misbehavior, and use misbehavior as a red flag to motivate policy solutions and regulations. However, even according to their proponents, these strategies do not attempt to align superhuman AI, but merely align the next generation of systems, trusting that aligning the Nth systems will help align the N+1 system.

This approach can be thought of as Iterative Alignment, a strategy that rests on the hope that we can build slightly smarter systems, align those, and use them to help align successor systems, repeating the process until we reach superintelligent AI. OpenAI’s Superalignment plan explicitly states this:

"Currently, we don't have a solution for steering or controlling a potentially superintelligent AI, and preventing it from going rogue. Our current techniques for aligning AI, such as reinforcement learning from human feedback, rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than us, and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs.

Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence. To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline."

The plans of nearly every other AI company are similarly limited and failure-prone attempts at iterative alignment. Deepmind’s 2024 update on AGI safety approaches discusses the evaluative techniques listed above and names “amplified oversight” as its focus. Anthropic’s Core Views on AI Safety describes how the evaluative techniques can be combined in a “portfolio approach” to keep advanced AI safe, and offers a similar justification for iterative alignment:

"Turning language models into aligned AI systems will require significant amounts of high-quality feedback to steer their behaviors. A major concern is that humans won't be able to provide the necessary feedback. It may be that humans won't be able to provide accurate/informed enough feedback to adequately train models to avoid harmful behavior across a wide range of circumstances. It may be that humans can be fooled by the AI system, and won't be able to provide feedback that reflects what they actually want (e.g. accidentally providing positive feedback for misleading advice). It may be that the issue is a combination, and humans could provide correct feedback with enough effort, but can't do so at scale. This is the problem of scalable oversight, and it seems likely to be a central issue in training safe, aligned AI systems.

Ultimately, we believe the only way to provide the necessary supervision will be to have AI systems partially supervise themselves or assist humans in their own supervision. Somehow, we need to magnify a small amount of high-quality human supervision into a large amount of high-quality AI supervision. This idea is already showing promise through techniques such as RLHF and Constitutional AI, though we see room for much more to make these techniques reliable with human-level systems. We think approaches like these are promising because language models already learn a lot about human values during pretraining. Learning about human values is not unlike learning about other subjects, and we should expect larger models to have a more accurate picture of human values and to find them easier to learn relative to smaller models. The main goal of scalable oversight is to get models to better understand and behave in accordance with human values."

Regardless of whether or not one believes that this strategy will work, it is clear that this approach does not adequately address the true complexity of alignment. A meaningful attempt at alignment must integrate moral philosophy to understand values reconciliation, implement formal verification to make guarantees about system properties, consider humanitarian questions of what we value and why, and propose institution design, at minimum. All current efforts fail to do so.

Today’s AI safety research is vastly underfunded compared to investments in capabilities work, and the majority of technical approaches intentionally do not address the conceptual complexity of alignment, instead operating in a reactive  empiricist framework that simply identifies misbehavior once it already exists. Humanity’s current AI safety plan is to race toward building superintelligent AI, and delegate the most difficult questions of alignment to AI itself. This is a naive and dangerous approach.

AI will not solve alignment for us
AI will not solve alignment for us

For a safe future, we must solve the hard problems of alignment, allocating adequate research hours, investment, and coordination effort. OpenAI, Deepmind, Anthropic, X.AI (“accelerating human scientific discovery”), and others have all proposed deferring and outsourcing these questions to more advanced future AI systems.

But on reflection, this is an incredibly risky approach. Situational Awareness, a document written by ex-OpenAI superalignment researcher Leopold Aschenbrenner which has gotten significant traction even from popular news outlets, puts the argument bluntly. Aschenbrennerargues for a vision of the future in which AI becomes powerful extremely quickly due to scaling up the orders of magnitude (“OOMs”) of AI models. When discussing future safety approaches, he makes a vivid argument for iterative alignment:

"Ultimately, we’re going to need to automate alignment research. There’s no way we’ll manage to solve alignment for true superintelligence directly; covering that vast of an intelligence gap seems extremely challenging. Moreover, by the end of the intelligence explosion—after 100 million automated AI researchers have furiously powered through a decade of ML progress—I expect much more alien systems in terms of architecture and algorithms compared to current system (with potentially less benign properties, e.g. on legibility of CoT, generalization properties, or the severity of misalignment induced by training).

But we also don’t have to solve this problem just on our own. If we manage to align somewhat-superhuman systems enough to trust them, we’ll be in an incredible position: we’ll have millions of automated AI researchers, smarter than the best AI researchers, at our disposal. Leveraging these army of automated researchers properly to solve alignment for even-more superhuman systems will be decisive.

Getting automated alignment right during the intelligence explosion will be extraordinarily high-stakes: we’ll be going through many years of AI advances in mere months, with little human-time to make the right decisions, and we’ll start entering territory where alignment failures could be catastrophic."

The dangers here are explicit: alien systems, huge advances in mere months, and a tightrope walk through an “intelligence explosion” in which wrong choices could lead to catastrophe. 

But even before we get to a dramatic vision of the AI future, the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs. 

Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.

Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.

Proponents of iterated alignment argue that this is not a real issue, because “evaluation is easier than generation.” For example, Aschenbrenner further argues in Situational Awareness that:

"We get some of the way [to superalignment] “for free,” because it’s easier for us to evaluate outputs (especially for egregious misbehaviors) than it is to generate them ourselves. For example, it takes me months or years of hard work to write a paper, but only a couple hours to tell if a paper someone has written is any good (though perhaps longer to catch fraud). We’ll have teams of expert humans spend a lot of time evaluating every RLHF example, and they’ll be able to “thumbs down” a lot of misbehavior even if the AI system is somewhat smarter than them. That said, this will only take us so far (GPT-2 or even GPT-3 couldn’t detect nefarious GPT-4 reliably, even though evaluation is easier than generation!)"

The argument holds for standard peer-review, where the authors and reviewers are generally on the same intellectual level, with sensibly similar cognitive architecture, education, and knowledge. But this doesn’t not apply to automated alignment research, where to be useful the research needs to be done by AIs that are both smarter and faster than humans.

The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.

Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved – yet no one ever proposes an actual plan for doing so beyond “the (unaligned) AIs will help us”.

There are other risks with this approach as well. 

The more powerful AI we have, the faster things will go. As AI systems improve and automate their own learning, AGI will be able to improve faster than our current research, and ASI will be able to improve faster than humanity can do science. The dynamics of intelligence growth means that it is possible for an ASI “about as smart as humanity” to move to “beyond all human scientific frontiers” on the order of weeks or months. While the change is most dramatic with more advanced systems, as soon as we have AGI we enter a world where things begin to move much quicker, forcing us to solve alignment much faster than in a pre-AGI world.

'Tensions between world powers will also heat up as AI becomes more powerful, something we are already witnessing in AI weapons used in warfare, global disinformation campaigns, the US-China chip war, and how Europe is struggling with regulation around Big Tech. As we move towards AGI, ASI, and eventually godlike AI, pressure on existing international treaties and diplomacy methods will be pushed beyond their limits. Unlike with nuclear war, there is not necessarily the same promise of mutually assured destruction with AI that could create a (semi)stable equilibrium. Ensuring geopolitical stability is necessary to create supportive conditions to solve the hard problems of alignment, something that gets more challenging if AI is becoming rapidly more powerful. 

AGI and its successor AIs will also cause massive political, economical, and societal destabilization through automating disinformation and online manipulation, job automation, and other shifts that look like “issues seen today but magnified as systems grow stronger”. This in turn makes coordination around massive research projects like the ones necessary to solve alignment extremely difficult. 

Thus, iterative alignment fails on multiple accounts. In addition to not addressing the hard parts of alignment, it also encourages entering a time-pressured and precarious world.

-

We have seen that alignment is an incredibly complex technical and social problem, one of the most complex any civilization needs to handle. And while the costs are enormous, no one is even starting to pay them, instead hoping that they will disappear by themselves as AIs become more powerful.

In the light of this failure to address the risks of godlike AI from a research angle, it’s necessary to strongly slow down and regulate AI progress, in order to avoid the catastrophe ahead. This comes from strong AI regulations, policies, and institutions.

Unfortunately, as we explore next, the landscape is as barren here as it is in the research side.

Footnotes

10  Specification gaming is the idea that an AI can satisfy the goal that a developer sets out for it during training, but end up following a different goal later. This occurs for a few reasons, but the result is dangerous in every case involving superintelligence: with systems this powerful, slight divergences from the original goal could lead to catastrophic divergences in downstream actions.

Footnotes

10  Specification gaming is the idea that an AI can satisfy the goal that a developer sets out for it during training, but end up following a different goal later. This occurs for a few reasons, but the result is dangerous in every case involving superintelligence: with systems this powerful, slight divergences from the original goal could lead to catastrophic divergences in downstream actions.