X-Risk Analysis for AI Research

x risk analysis for ai research

Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping in mind the potential benefits of AI, there is some concern that building ever more intelligent and powerful AI systems could eventually result in systems that are more powerful than us; some say this is like playing with fire and speculate that this could create existential risks (x-risks). To add precision and ground these discussions, we provide a guide for how to analyze AI x-risk, which consists of three parts: First, we review how systems can be made safer today, drawing on time-tested concepts from hazard analysis and systems safety that have been designed to steer large processes in safer directions. Next, we discuss strategies for having long-term impacts on the safety of future systems. Finally, we discuss a crucial concept in making AI systems safer by improving the balance between safety and general capabilities. We hope this document and the presented concepts and tools serve as a useful guide for understanding how to analyze AI x-risk.

x risk analysis for ai research

Dan Hendrycks

Mantas Mazeika

x risk analysis for ai research

Related Research

Current and near-term ai as a potential existential risk factor, fairness in ai and its long-term implications on society, understanding and avoiding ai failures: a practical guide, transdisciplinary ai observatory – retrospective analyses and future-oriented contradistinctions, ai deception: a survey of examples, risks, and potential solutions, modeling transformative ai risks (mtair) project – summary report, the peril of artificial intelligence.

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

> cs > arXiv:2206.05862

  • Other formats

Current browse context:

Change to browse by:, references & citations, dblp - cs bibliography, computer science > computers and society, title: x-risk analysis for ai research.

Abstract: Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping in mind the potential benefits of AI, there is some concern that building ever more intelligent and powerful AI systems could eventually result in systems that are more powerful than us; some say this is like playing with fire and speculate that this could create existential risks (x-risks). To add precision and ground these discussions, we provide a guide for how to analyze AI x-risk, which consists of three parts: First, we review how systems can be made safer today, drawing on time-tested concepts from hazard analysis and systems safety that have been designed to steer large processes in safer directions. Next, we discuss strategies for having long-term impacts on the safety of future systems. Finally, we discuss a crucial concept in making AI systems safer by improving the balance between safety and general capabilities. We hope this document and the presented concepts and tools serve as a useful guide for understanding how to analyze AI x-risk.

Submission history

Link back to: arXiv , form interface , contact .

🇺🇦    make metadata, not war

X-Risk Analysis for AI Research

  • Dan Hendrycks
  • Mantas Mazeika
  • Computer Science - Computers and Society
  • Computer Science - Artificial Intelligence
  • Computer Science - Machine Learning

Similar works

thumbnail-image

arXiv.org e-Print Archive

This paper was published in arXiv.org e-Print Archive .

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

LESSWRONG LW

Cheat sheet of ai x-risk.

This document was made as part of my internship at  EffiSciences . Thanks to everyone who helped review it, in particular Charbel-Raphaël Segerie, Léo Dana, Jonathan Claybrough and Florent Berthet!

Introduction

Clarifying AI X-risk is a summary of a literature review of AI risk models by DeepMind that (among other things) categorizes the threat models it studies along two dimensions: the technical cause of misalignment and the path leading to X-risk. I think this classification is inadequate because the technical causes of misalignment that are considered in some of the presented risk models should be differently nuanced.

So, I made a document that would go more in-depth and accurately represent what (I believe) the authors had in mind. Hopefully, it can help researchers quickly remind themselves of the characteristics of the risk model they are studying.  

x risk analysis for ai research

Here it is , for anyone who wants to use it! Anyone can suggest edits to the cheat sheet. I hope that this can foster healthy discussion about controversial or less detailed risk models.

If you just want to take a look, go right ahead! If you would like to contribute, the penultimate section explains the intended use of the document, the conventions used and how to suggest edits.

There is also  a more detailed version if you are interested, but that one is not nearly as well formatted (i.e. ugly as duck), nor is it up to date. It is where I try to express more nuanced analysis than allowed by a dropdown list.

Some information in this cheat sheet is controversial; for example, the inner/outer framework is still debated. In addition, some information might be incorrect. I did my best, but I do not agree with all the sources I have read (notably, Deepmind classifies Cohen’s scenario as specification gaming, but I think it is more like goal misgeneralization) so remain critical. My hope is that after a while the controversial cell comments will contain links and conversations to better understand the positions involved.

The takeaways are mostly my own work, with all the risks that entails.

Some values had to be simplified for the sake of readability; for instance, it would be more accurate to say that Cohen’s scenario does not require the AI to be powerful rather than to say the cause of the AI’s power is unclear. If that bothers you, you can look at  the detailed version where I make less compromises at the cost of readability.

The takeaways are my own and do not necessarily represent the views of the authors of the risk models. They are all presented with respect to the risk models examined.

General takeaway

General takeaway: There are many paths to AI X-risk. Some are more likely or dangerous than others, but none can be neglected if we are to avoid disaster.

Outer/inner/intent

(Keep in mind that the inner/outer framework is not universally accepted.)

Along the many scenarios, misalignment comes from many places. No single solution will solve everything, though inner misalignment is significantly more commonly depicted than other types.

Technical causes of misalignment

It is not enough to have a solution for a specific cause, because there are many technical reasons for misalignment.

In addition, some risk models simply require any source of misalignment, which is why I think a  security mindset is necessary to build a safe AI: we need to prove that there is no loophole, that the machine is  daemon-free , to guarantee that none of these risk models are allowed by an AI design.

Social causes of misalignment

  • The most frequently cited factor of misalignment is competitive pressure [1] : it increases the likelihood of accidents due to haste or misuse. Solving these coordination problems (through discussion, contracts, government intervention...) therefore seems to be an important lever to reduce x-risk.
  • Even if we coordinate perfectly, there are technical hurdles that could still lead to failure. Governance could allow us to reach existential safety by freezing AI research, training or deployment, but this will not be enough to solve alignment: we need to allocate resources for technical alignment.

I did not expect coordination problems to be so dangerous; going into this, I was mostly concerned about tackling “ hard alignment ”. I had to update hard:  most risk models assume that a major reason technical causes are not addressed is a societal failure [2]  to implement safeguards, regulation, take our time, and agree not to deploy unsafe AI.

We could completely solve technical alignment (have the knowledge to make an aligned AGI) and still die from someone who doesn't use those techniques, which is a likely scenario if we don't solve coordination [3] .

Deception & power-seeking

There is a strong correlation between deception, power-seeking and agency. They are present in most scenarios and are generally considered to indicate very high risk. The development of correct assessments to measure deception and power-seeking appears to be robustly beneficial [4] . Surprise, surprise!

Capabilities

  • Ensuring that AIs are not agentic would go a long way towards preventing most AI takeover scenarios.
  • But even if no AI were agentic, a disaster could happen because of societal failure such as Christiano1 [5] .

Most scenarios consider that an advanced AI will be able to seize power, so denying it access to resources such as banks and factories is not enough of a safeguard. There is  extensive   literature on the subject; the general consensus is that boxing might have some use, but it's unlikely to work.

AI-takeover-like scenarios usually require specific capabilities that we can detect in potentially dangerous AIs through capability assessment and interpretability. In many scenarios, these capabilities come from general purpose training and are not externally visible, so it is a misconception to assume that "we'll see it before we get there", which is why interpretability is useful.

Before compiling this sheet, I considered interpretability (especially mechanistic interpretability) laudable but probably not worth the effort. Now I think it is one of the most important tools for early warning (which, combined with good governance, would allow us not to train/deploy some dangerous systems).

Existential risk

Yes,  AI is risky .

While not all scenarios are guaranteed to lead to human extinction, it is important to remember that even if we survive, disempowerment and other societal issues would need to be addressed (these are  risk models after all).  There is much uncertainty about the severity of the risks, but we don’t really have the option to test them to see just how bad they are. It might be interesting for policymakers to have toy models to reduce that uncertainty?

Individual takeaways

If you had to remember only one thing from a risk model what would that be? Usually, the papers are hard to summarize, but for the sake of policymakers and amateur researchers, I tried to do it in one sentence or two on the  document . For a more in-depth analysis, I highly recommend  Deepmind’s own work .

Class of scenario

This classification might be the least accurate information in the sheet because it is original work, and I don’t have an established track record.

Mostly, I grouped scenarios together if they seemed to share some similarities which I think are noteworthy.

Risk models are separated into three classes:

  • Christiano1:  “You get what you measure”
  • Christiano2:  “Influence-seeking behavior is scary”
  • Critch1:  Production Web
  • Critch2:  Flash Economy
  • Hubinger: “ How likely is deceptive alignment? ”
  • Soares:  Capabilities Generalization & Sharp Left Turn
  • Cohen: “ Advanced artificial agents intervene in the provision of reward ”
  • Cotra: “ The easiest path to transformative AI… ”
  • Ngo: “ The alignment problem from a deep learning perspective ”
  • Shah:  AI Risk from Program Search
  • Carlsmith: “ Is Power-seeking AI an existential risk? ”

These categories are usually different in methodology, scope of study and conclusions. 

How to use the cheat sheet

The document reflects the personal opinions of the people who edit it; hopefully, these opinions correspond somewhat to reality, but in particular, the original authors of the risk models have not agreed with the results as far as I know.

Intended use

Here is the typical use case the document was imagined for: You have been thinking about AI X-risk, and want to study a potential solution.

  • Consider which scenario it might prevent (find the proper row)
  • Consider what dimensions of the scenario your solution modifies (columns)
  • See in what measure the scenario is improved.

Hopefully, this will remind you of dimensions you might otherwise overlook, and of which scenarios your solution does not apply to.

AI X-risk is vast, and oftentimes you see potential solutions that  end up being naive . Hopefully, this document can provide a sanity check to help refine your ideas.

Reading/writing conventions

I tried to represent the authors’ views, to the best of my understanding. "?" means the author does not provide information about that variable or that I haven't found anything yet.

Otherwise, cells are meant to be read as “The risk model is … about this parameter” (e.g. “The risk model is agnostic about why the AI is powerful”) or “The risk model considers this situation to be …” (e.g. “The risk model considers the social causes of AI to be irrelevant”), except for takeaways which are full sentences.

In the detailed sheet, parentheses are sometimes used to give more details. This comes at the cost of dropdown lists, sorry.

How to edit the cheat sheet

Here are a few rules to keep the sheet as readable as possible for everyone:

  • If you disagree with the value of a cell, make a commentary to propose a new value and explain your position. Keep one topic per thread. 
  • If you want to make a minor amendment, make your commentary yellow-orange, depending on the importance of the change. 
  • If you want to make a major amendment (e.g. the current value is misinforming for an experienced reader), make your commentary red.
  • If you want to add a column or make changes of the same scale, mark the topmost cell of the relevant column with a magenta commentary.

You can also write a proposed edit here and hope I read it, especially for long diatribes.

If you want to edit the detailed sheet, please send me your email address in private. If you are motivated enough to read  that thing , then imo you deserve editing rights.

The document is very far from summarizing all the information in the sources. I strongly encourage you to read them by yourself.

The links for each risk model are in the leftmost column of the document.

Literature reviews

  • Clarifying AI X-risk &  Threat Model Literature Review (Deepmind)
  • My Overview of the AI Alignment Landscape: Threat Models (Neel Nanda)

Additional ideas for relevant variables

  • Distinguishing AI takeover scenarios (authors)
  • My Overview of the AI Alignment Landscape: A Bird's Eye View - LessWrong (Neel Nanda)

This post will be edited if helpful comments point out how to improve it.

[ Insert footnotes ]

To keep it simple, I regrouped all sorts of perverse social dynamics under this term, even if it is slightly misleading. I’d have used  Moloch if I hadn’t been told it was an obscure reference. You might also think in terms of inadequate equilibria or game-theoretic reasons to be dumb. Also, dumb reasons, that happens too.

More precisely, failure to solve a coordination problem. If you want to learn more, see some posts on  coordination . I’m told I should mention  Moral Mazes ; also I really enjoyed  Inadequate Equilibria . These readings not necessary, but I think they provide a lot of context when researchers mention societal failure with few details. Of course,  some do provide a lot of details which I wish everyone did because it makes the model more crisp, but it’s also sticking your neck out for easy chopping off.

Or do a pivotal act? There’s a discussion to be had here on the risks of trying to perform a pivotal act when you suffer from coordination problems.

I don’t know of a protocol that would allow me to test a pivotal act proposal with enough certainty to convince me it would work, in our kind of world.

I don’t think that measuring deception and power-seeking has great value in itself, because you have to remember to fix them too, and I expect it to be a whole different problem. Still, that would be an almost certainly good tool to have. In addition, beware of making a deception-detecting tool, and mistakenly thinking that it detects  every   deception . 

In some scenarios where the AI is agentic, it is made an agent by design (through  training or  architecture )  for some reason , which given  a possible current paradigm might be prevented with good governance. In others, agenticity emerges for complicated reasons which I can only recommend you take as warnings against ever thinking your AI is safe by design.

List of known discrepancies:

  • Deepmind categorizes Cohen’s scenario as specification gaming (instead of crystallized proxies).
  • They consider Carlsmith to be about outer misalignment?  

Value lock-in, persuasive AI and Clippy are on my TODO list to be added shortly. Please do tell if you have something else in mind you'd like to see in my cheat sheet!

AI ALIGNMENT FORUM AF

Clarifying ai x-risk.

TL;DR: We give a threat model literature review, propose a categorization and describe a consensus threat model from some of DeepMind's AGI safety team. See our post for the detailed literature review.

The DeepMind AGI Safety team has been working to understand the space of threat models for existential risk (X-risk) from misaligned AI. This post summarizes our findings. Our aim was to clarify the case for X-risk to enable better research project generation and prioritization. 

First, we conducted a literature review of existing threat models, discussed their strengths/weaknesses and then formed a categorization based on the technical cause of X-risk and the path that leads to X-risk. Next we tried to find consensus within our group on a threat model that we all find plausible.

Our overall take is that there may be more agreement between alignment researchers than their  disagreements might suggest, with many of the threat models, including our own consensus one, making similar arguments for the source of risk. Disagreements remain over the difficulty of the alignment problem, and what counts as a solution.

Categorization

Here we present our categorization of threat models from our literature review , based on the technical cause and the path leading to X-risk. It is summarized in the diagram below. 

x risk analysis for ai research

In green on the left we have the technical cause of the risk, either specification gaming (SG) or goal misgeneralization (GMG). In red on the right we have the path that leads to X-risk, either through the interaction of multiple systems, or through a misaligned power-seeking (MAPS) system. The threat models appear as arrows from technical cause towards path to X-risk.

The technical causes (SG and GMG) are not mutually exclusive, both can occur within the same threat model. The distinction between them is motivated by the common distinction in machine learning between failures on the training distribution, and when out of distribution. 

To classify as specification gaming , there needs to be  bad feedback provided on the actual training data. There are many ways to operationalize  good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI [1] . We note that the main downside to this operationalisation is that even if just one out of a huge number of training data points gets bad feedback, then we would classify the failure as specification gaming, even though that one datapoint likely made no difference.

To classify as goal misgeneralization , the behavior when out-of-distribution (i.e. not using input from the training data),  generalizes poorly about its goal , while its capabilities generalize well, leading to undesired behavior. This means the AI system doesn’t just break entirely, it still competently pursues some goal, but it’s not the goal we intended.

The path leading to X-risk is classified as follows. When the path to X-risk is from the interaction of multiple systems , the defining feature here is not just that there are multiple AI systems (we think this will be the case in all realistic threat models), it’s more that the risk is caused by complicated  interactions between systems that we heavily depend on and can’t easily stop or transition away from. (Note that we haven't analyzed the multiple-systems case very much, and there are also other technical causes for those kinds of scenarios.)

When the path to X-risk is through Misaligned Power-Seeking (MAPS), the AI system seeks power in unintended ways due to problems with its goals. Here,  power-seeking means the AI system seeks power as an instrumental subgoal, because having more power increases the options available to the system allowing it to do better at achieving its goals.  Misaligned here means that the goal that the AI system pursues is not what its designers intended [2] .

There are other plausible paths to X-risk (see e.g. this  list ), though our focus here was on the most popular writings on threat models in which the main source of risk is technical, rather than through poor decisions made by humans in how to use AI.

For a summary on the properties of the threat models, see the table below.

We can see that five of the threat models we considered substantially involve both specification gaming and goal misgeneralization (note that these threat models would still hold if one of the risk sources was absent) as the source of misalignment, and MAPS as the path to X-risk. This seems like an area where multiple researchers agree on the bare bones of the threat model - indeed our group’s consensus threat model was in this category too.

One aspect that our categorization has highlighted is that there are potential gaps in the literature, as emphasized by the question marks in the table above for paths to X-risk via the interaction of multiple systems, where the source of misalignment involves goal misgeneralization. It would be interesting to see some threat models that fill this gap.

For other overviews of different threat models, see here and here .

Consensus Threat Model

Building on this literature review we looked for consensus among our group of AGI safety researchers. We asked ourselves the question: conditional on there being an existential catastrophe from misaligned AI, what is the most likely threat model that brought this about. This is independent of the probability of an occurrence of an existential catastrophe from misaligned AI. Our resulting threat model is as follows (black bullets indicate agreement, white indicates some variability among the group):

Development model: 

  • Scaled up deep learning  foundation models with RL from human feedback ( RLHF ) fine-tuning.
  • Not many more fundamental innovations needed for AGI.

Risk model: 

  • Main source of risk is a mix of specification gaming and (a bit more from)  goal misgeneralization .
  • Perhaps this arises mainly during RLHF rather than in the pretrained foundation model because the tasks for which we use RLHF will benefit much more from consequentialist planning than the pretraining task.
  • Perhaps certain architectural components such as a tape/scratchpad for memory and planning would accelerate this.
  • Perhaps it’s unclear who actually controls AI development.
  • Interpretability will be hard.

By  misaligned consequentialist we mean 

  • It uses consequentialist reasoning: a system that evaluates the  outcomes of various possible plans against some metric, and chooses the plan that does best on that metric
  • Is misaligned - the metric it uses is not a goal that we intended the system to have

Overall we hope our threat model strikes the right balance of giving detail where we think it’s useful, without being too specific (which carries a higher risk of distracting from the essential points, and higher chance of being wrong).

Takeaway 

Overall we thought that alignment researchers agree on quite a lot regarding the sources of risk (the collection of threat models in blue in the diagram). Our group’s consensus threat model is also in this part of threat model space (the closest existing threat model is Cotra ).   

In this definition, whether the feedback is good/bad does not depend on the reasoning used by the AI system, so e.g. rewarding an action that was chosen by a misaligned AI system that is trying to hide its misaligned intentions would still count as good feedback under this definition.

There are other possible formulations of misaligned, for example the system’s goal may not match what its  users want it to do.

I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.

To me, Christiano's Get What You Measure scenario looks much more plausible a priori to be "what happens by default". For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions? Why additionally assume that we need consequentialist reasoning, or that power-seeking has to kick in and incentivize deception over and above whatever competing incentives might be present? Why assume all that, when RLHF already selects for actions which deceive humans in practice even in the absence of consequentialism?

Or, another angle: the diagram in this post starts from "specification gaming" and "goal misgeneralization". If we just start from prototypical examples of those failure modes, don't assume anything more than that, and ask what kind of AGI failure the most prototypical versions of those failure modes lead to... it seems to me that  they lead to Getting What You Measure. This story about consequentialism and power-seeking has a bunch of extra pieces in it, which aren't particularly necessary for an AGI disaster.

(To be clear, I'm not saying the consequentialist power-seeking deception story is implausible; it's certainly plausible enough that I wouldn't want to build an AGI without being pretty damn certain that it won't happen! Nor am I saying that I buy all the aspects of Get What You Measure - in particular, I definitely expect a much foomier future than Paul does, and I do in fact expect consequentialism to be convergent. The key thing I'm pointing to here is that the consequentialist power-seeking deception story has a bunch of extra assumptions in it, and we still get a disaster with those assumptions relaxed, so naively it seems like we should assign more probability to a story with fewer assumptions.)

(Speaking just for myself in this comment, not the other authors)

I still feel like the comments on your post are pretty relevant, but to summarize my current position:

  • AIs that actively think about deceiving us (e.g. to escape human oversight of the compute cluster they are running on) come well before (in capability ordering, not necessarily calendar time) AIs that are free enough from human-imposed constraints and powerful enough in their effects on the world that they can wipe out humanity + achieve their goals without thinking about how to deal with humans.
  • In situations where there is some meaningful human-imposed constraint (e.g. the AI starts out running on a data center that humans can turn off), if you don't think about deceiving humans at all, you choose plans that ask humans to help you with your undesirable goals, causing them to stop you. So, in these situations, x-risk stories require deception.
  • It seems kinda unlikely that even the AI free from human-imposed constraints like off switches doesn't think about humans at all. For example, it probably needs to think about other AI systems that might oppose it, including the possibility that humans build such other AI systems (which is best intervened on by ensuring the humans don't build those AI systems).

Responding to this in particular:

The key thing I'm pointing to here is that the consequentialist power-seeking deception story has a bunch of extra assumptions in it, and we still get a disaster with those assumptions relaxed, so naively it seems like we should assign more probability to a story with fewer assumptions.

The least conjunctive story for doom is "doom happens". Obviously this is not very useful. We need more details in order to find solutions. When adding an additional concrete detail, you generally want that detail to (a) capture lots of probability mass and (b) provide some angle of attack for solutions.

For (a): based on the points above I'd guess maybe 20:1 odds on "x-risk via misalignment with explicit deception" : "x-risk via misalignment without explicit deception" in our actual world. (Obviously "x-risk via misalignment" is going to be the sum of these and so higher than each one individually.)

For (b): the "explicit deception" detail is particularly useful to get an angle of attack on the problem. It allows us to assume that the AI "knows" that the thing it is doing is not what its designers intended, which suggests that what we need to do to avoid this class of scenarios is to find some way of getting that knowledge out of the AI system (rather than, say, solving all of human values and imbuing it into the AI).

One response is "but even if you solve the explicit deception case, then you just get x-risk via misalignment without explicit deception, so you didn't actually save any worlds". My response would be that P(x-risk via misalignment without explicit deception | no x-risk via misalignment with explicit deception) seems pretty low to me. But that seems like the main way someone could change my mind here.

Two probable cruxes here...

First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a -2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it. (Note: I'm using "intelligence" here to point to something including ability to "actually try" as opposed to symbolically "try", effective mental habits, etc, not just IQ.) If copying is sufficiently cheap relative to building, I wouldn't be surprised if something within the human distribution would suffice.

Central intuition driver there: imagine the difference in effectiveness between someone who responds to a law they don't like by organizing a small protest at their university, vs someone who responds to a law they don't like by figuring out which exact bureaucrat is responsible for implementing that law and making a case directly to that person, or by finding some relevant case law and setting up a lawsuit to limit the disliked law. (That's not even my mental picture of -2sd vs +3sd; I'd think that's more like +1sd vs +3sd. A -2sd usually just reposts a few memes complaining about the law on social media, if they manage to do anything at all.) Now imagine an intelligence which is as much more effective than the "find the right bureaucrat/case law" person, as the "find the right bureaucrat/case law" person is compared to the "protest" person.

Second probable crux: there's two importantly-different notions of "thinking about humans" or "thinking about deceiving humans" here. In the prototypical picture of a misaligned mesaoptimizer deceiving humans for strategic reasons, the AI explicitly backchains from its goal, concludes that humans will shut it down if it doesn't hide its intentions, and therefore explicitly acts to conceal its true intentions. But when the training process contains direct selection pressure for deception (as in RLHF), we should expect to see a different phenomenon: an intelligence with hard-coded, not-necessarily-"explicit" habits/drives/etc which de-facto deceive humans. For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That's the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).

Is that "explicit deception"? I dunno, it seems like "explicit deception" is drawing the wrong boundary. But when that sort of deception happens, I wouldn't necessarily expect to be able to see deception in an AI's internal thoughts. It's not that it's "thinking about deceiving humans", so much as "thinking in ways which are selected for deceiving humans".

(Note that this is a different picture from the post you linked; I consider this picture more probable to be a problem sooner, though both are possibilities I keep in mind.)

First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a -2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it.

This sounds roughly right to me, but I don't see why this matters to our disagreement?

For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That's the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).

This also sounds plausible to me (though it isn't clear to me how exactly doom happens). For me the relevant question is "could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge", and I think the answer is still yes.

I maybe want to stop saying "explicitly thinking about it" (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that "conscious thoughts" have deception in them) and instead say that "the AI system at some point computes some form of 'reason' that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action".

I don't quite agree with that as literally stated; a huge part of intelligence is finding heuristics which allow a system to avoid computing anything about worse actions in the first place (i.e. just ruling worse actions out of the search space). So it may not actually compute anything about a non-deceptive action.

But unless that distinction is central to what you're trying to point to here, I think I basically agree with what you're gesturing at.

But unless that distinction is central to what you're trying to point to here

Yeah, I don't think it's central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).

For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions?

Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won't be systematically deceiving humans to pursue some particular agenda of the agent. 

As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.

For that part, the weaker assumption I usually use is that AI will end up making lots of big and fast (relative to our ability to meaningfully react) changes to the world, running lots of large real-world systems, etc, simply because it's economically profitable to build AI which does those things. (That's kinda the point of AI, after all.)

In a world where most stuff is run by AI (because it's economically profitable to do so), and there's RLHF-style direct incentives for those AIs to deceive humans... well, that's the starting point to the Getting What You Measure scenario.

Insofar as power-seeking incentives enter the picture, it seems to me like the "minimal assumptions" entry point is not consequentialist reasoning within the AI, but rather economic selection pressures. If we're using lots of AIs to do economically-profitable things, well, AIs which deceive us in power-seeking ways (whether "intentional" or not) will tend to make more profit, and therefore there will be selection pressure for those AIs in the same way that there's selection pressure for profitable companies. Dial up the capabilities and widespread AI use, and that again looks like Getting What We Measure. (Related: the distinction here is basically the AI version of the distinction made in Unconscious Economics .)

This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.

Me too, but note how the analysis leading to the conclusion above is very open about excluding a huge number of failure modes leading to x-risk from consideration first:

[...] our focus here was on the most popular writings on threat models in which the main source of risk is technical, rather than through poor decisions made by humans in how to use AI.

In this context, I of course have to observe that any human decision, any decision to deploy an AGI agent that uses purely consequentialist planning towards maximising a simple metric, would be a very poor human decision to make indeed. But there are plenty of other poor decisions too that we need to worry about.

I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the "SG + GMG → MAPS" framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk. 

To classify as specification gaming , there needs to be  bad feedback provided on the actual training data. There are many ways to operationalize  good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.

I assume you would agree with the following rephrasing of your last sentence:

The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI. 

If so, I would appreciate it if you could clarify why achieving good training data feedback is even possible: the system that gives feedback necessarily looks at the world through observations that conceal large parts of the state of the universe. For every observation that is consistent with the actions of a competent, well-motivated AI, the underlying state of the world might actually be catastrophic from the point of view of our "intentions". E.g., observations can be faked, or the universe can be arbitrarily altered outside of the range of view of the feedback system.

If you agree with this, then you probably assume that there are some limits to the physical capabilities of the AI, such that it is possible to have a feedback mechanism that cannot be effectively gamed. Possibly when the AI becomes more powerful, the feedback mechanism would in turn need to become more powerful to ensure that its observations "track reality" in the relevant way. 

Does there exist a write-up of the meaning of specification gaming and/or outer alignment that takes into account that this notion is always "relative" to the AI's physical capabilities?

I'm confused about what you're trying to say in this comment. Are you saying "good feedback as defined here does not solve alignment"? If so, I agree, that's the entire point of goal misgeneralization (see also footnote 1).

Perhaps you are saying that in some situations a competent, well-motivated AI would choose some action it thinks is good, but is actually bad, because e.g. its observations were faked in order to trick it? If so, I agree, and I see that as a feature of the definition, not a bug (and I'm not sure why you think it is a bug).

Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.

What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI. 

However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere. 

Oh, I see. I'm not interested in "solving outer alignment" if that means "creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations" (because as you point out it seems far too stringent a requirement).

Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere. 

You could look at ascription universality and ELK . The general mindset is roughly "ensure your reward signal captures everything that the agent knows"; I think the mindset is well captured in mundane solutions to exotic problems .

Thanks a lot for these pointers!

X-Risk Analysis for AI Research

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account .

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

https://twitter.com/spyristeig/status/1764911021594992748

x risk analysis for ai research

  • solidarity - (ua) - (ru)
  • news - (ua) - (ru)
  • donate - donate - donate

for scientists:

  • ERA4Ukraine
  • Assistance in Germany
  • Ukrainian Global University
  • #ScienceForUkraine

search dblp

default search action

  • combined dblp search
  • author search
  • venue search
  • publication search

clear

"X-Risk Analysis for AI Research."

Details and statistics.

DOI: 10.48550/ARXIV.2206.05862

x risk analysis for ai research

type: Informal or Other Publication

metadata version: 2022-06-20

x risk analysis for ai research

manage site settings

To protect your privacy, all features that rely on external API calls from your browser are turned off by default . You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.

Unpaywalled article links

unpaywall.org

load links from unpaywall.org

Privacy notice: By enabling the option above, your browser will contact the API of unpaywall.org to load hyperlinks to open access articles. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Unpaywall privacy policy .

Archived links via Wayback Machine

web.archive.org

load content from archive.org

Privacy notice: By enabling the option above, your browser will contact the API of archive.org to check for archived content of web pages that are no longer available. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Internet Archive privacy policy .

Reference lists

crossref.org

load references from crossref.org and opencitations.net

Privacy notice: By enabling the option above, your browser will contact the APIs of crossref.org , opencitations.net , and semanticscholar.org to load article reference information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Crossref privacy policy and the OpenCitations privacy policy , as well as the AI2 Privacy Policy covering Semantic Scholar.

Citation data

load citations from opencitations.net

Privacy notice: By enabling the option above, your browser will contact the API of opencitations.net and semanticscholar.org to load citation information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the OpenCitations privacy policy as well as the AI2 Privacy Policy covering Semantic Scholar.

OpenAlex data

openalex.org

load data from openalex.org

Privacy notice: By enabling the option above, your browser will contact the API of openalex.org to load additional information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the information given by OpenAlex .

last updated on 2022-06-20 12:59 CEST by the dblp team

cc zero

see also: Terms of Use | Privacy Policy | Imprint

dblp was originally created in 1993 at:

University of Trier

since 2018, dblp has been operated and maintained by:

Schloss Dagstuhl - Leibniz Center for Informatics

the dblp computer science bibliography is funded and supported by:

BMBF

  X-Risk Analysis for AI Research  

Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping in mind the potential benefits of AI, there is some concern that building ever more intelligent and powerful AI systems could eventually result in systems that are more powerful than us; some say this is like playing with fire and speculate that this could create existential risks (x-risks). To add precision and ground these discussions, we provide a guide for how to analyze AI x-risk, which consists of three parts: First, we review how systems can be made safer today, drawing on time-tested concepts from hazard analysis and systems safety that have been designed to steer large processes in safer directions. Next, we discuss strategies for having long-term impacts on the safety of future systems. Finally, we discuss a crucial concept in making AI systems safer by improving the balance between safety and general capabilities. We hope this document and the presented concepts and tools serve as a useful guide for understanding how to analyze AI x-risk.

1 Introduction

Artificial intelligence (AI) has opened up new frontiers in science and technology. Recent advances in AI research have demonstrated the potential for transformative impacts across many pursuits, including biology [ 64 ] , mathematics [ 56 ] , visual art [ 59 ] , coding [ 15 ] , and general game playing [ 60 ] . By amplifying and extending the intelligence of humans, AI is a uniquely powerful technology with high upsides. However, as with any powerful technology, it comes with heightened risks and responsibilities. To bring about a better future, we have to actively steer AI in a beneficial direction and engage in proactive risk management.

Substantial effort is already directed at improving the safety and beneficence of current AI systems. From deepfake detection to autonomous vehicle reliability, researchers actively study how to handle current AI risks and take these risks very seriously. However, current risks are not the only ones that require attention. In the intelligence and defense communities, it is common to also anticipate future risks which are not yet present but could eventually become important. Additionally, as the COVID-19 pandemic demonstrates, tail risks that are rare yet severe should not be ignored [ 62 ] . Proactiveness and preparedness are highly valuable, even for low-probability novel tail risks, and scientists would be remiss not to contemplate or analyze tail risks from AI. Preparing for tail events is not overly pessimistic, but rather prudent.

Some argue that tail risks from future AI systems could be unusually high, in some cases even constituting an existential risk (x-risk)—one that could curtail humanity’s long-term potential [ 51 ] . Views on this topic fall across the spectrum. However, it is clear that building continually stronger AI systems at least amplifies existing risks, such as weaponization and disinformation at scale. Assuming continued progress, there is a distinct possibility of eventually building AIs that exceed human intelligence, which could usher in a new age of innovation but also create many new risks. While x-risk from AI is primarily future-oriented and often thought low probability, with some current estimates in the single digits over the next century [ 13 , 29 ] , there is still substantial value in proactively gaining clarity on the risks and taking the anticipated hazards seriously.

Much research on AI safety is motivated by reducing x-risk [ 33 ] . However, the literature currently lacks a grounded discussion of risk and tends to rely on a form of inchoate hazard analysis. We address this gap by providing a guide that introduces new concepts to understand and analyze AI x-risk.

In the main paper, we build up to discussing how to make strong AI systems safer by covering three key topics: how to make systems safer, how to make future systems safer, and finally how to make future AI systems safer. Specifically, in the first section we provide an overview of concepts from contemporary risk analysis and systems safety. These concepts have withstood the test of time across dozens of industries to enable the safe operation of diverse, high-risk complex systems without catastrophes [ 46 , 55 ] . Second, armed with a robust understanding of risk, we examine how today’s research can have a long-term impact on the development of safe AI, even though the future is far away and uncertain. Third, we discuss how naïve attempts to advance AI safety can backfire. To avoid this unintended consequence, we conclude the main paper by discussing how to improve the overall safety of AI systems by improving the balance between safety and capabilities.

To further orient new AI x-risk researchers, we provide auxiliary background materials in the appendix. In Appendix   A , we expand our discussion by elaborating on speculative hazards and failure modes that are commonplace concepts in AI x-risk discussions. In Appendix   B , we then describe concrete empirical research directions that aim to address the aforementioned hazards and failure modes. This culminates in X-Risk Sheets ( Appendix   C ), a new risk analysis tool to help researchers perform x-risk analysis of their safety research papers.

We hope this document serves as a guide to safety researchers by clarifying how to analyze x-risks from AI systems, and helps stakeholders and interested parties with evaluating and assessing x-risk research contributions. Even though these risks may be low-probability and future-oriented, we should take them seriously and start building in safety early.

2 Background AI Risk Concepts

2.1 general risk analysis.

To help researchers improve the safety of future AI systems, we provide a basic vocabulary and overview of concepts from general risk analysis. As with risks in many safety-critical systems, risks from strong AI can be better understood and managed with these concepts and tools, which have withstood the test of time across dozens of industries and applications. In particular, we cover basic terminology, discuss a risk decomposition, provide a precise model of reliability, describe safe design principles, and discuss the contemporary systems view of safety. Throughout this guide, at the end of each section we provide a concrete example in which we apply these concepts to analyze an AI research direction.

Definitions.

A Hazard is a source of danger with the potential to harm [ 5 , 46 ] . An Inherent Hazard is a hazard that is inherently or characteristically posed by a system, such as hazardous materials in a chemical plant [ 42 ] . A Systemic Hazard is a hazard from the broader sociotechnical system or social factors such as safety culture and management. Exposure is the extent to which elements ( e.g. , people, property, systems) are subjected or exposed to hazards. Vulnerability indicates susceptibility to the damaging effects of hazards, or a factor or process that increases susceptibility to the damaging effects of hazards. A Threat is a hazard with intent to exploit a vulnerability. A Failure Mode is a particular way a system might fail. A Tail Risk is a low-probability risk that can carry large consequences. For completeness, an Existential Risk or X-Risk is a risk that can permanently curtail humanity’s long-term potential [ 51 , 9 ] .

Risk Equation.

A decomposition of a risk from a given hazard can be captured by the notional equation Risk = Hazard × Exposure × Vulnerability Risk Hazard Exposure Vulnerability \text{Risk}=\text{Hazard}\times\text{Exposure}\times\text{Vulnerability} , where “Hazard” means hazard severity and prevalence, and “ × \times ” merely indicates interaction. These specify risks from a particular hazard, and they can be aggregated as an expected risk by weighting with hazard probabilities. To illustrate this decomposition, the risk of chemical leaks from an industrial plant can be reduced by using less dangerous chemicals (reducing the hazard), building the plant far from populated areas (reducing exposure), and providing workers with protective clothing (reducing vulnerability). Similarly, the risk of being in a car crash can be reduced by driving slower (reducing the hazard), driving on less dangerous roads (reducing exposure), or wearing a seatbelt (reducing vulnerability). In cybersecurity, the risk of a data leak from third-party vendors can be reduced by working with more trusted vendors (reducing the hazard), reducing vendor access to rooms where sensitive data is stored (reducing exposure), or by encrypting data so unauthorized vendors cannot interpret exfiltrated data (reducing vulnerability).

The risk equation can be extended as: Risk = Hazard × Exposure × Vulnerability / Ability to Cope Risk Hazard Exposure Vulnerability Ability to Cope \text{Risk}=\text{Hazard}\times\text{Exposure}\times\text{Vulnerability}\,/\,\text{Ability to Cope} to adjust for the ability to cope with or recover from accidents. This is relevant to risks from AI, because if we lose control of a strong AI system, our ability to cope may be zero. Likewise, by definition, x-risks are permanent, so this equation shows the risk of such events is limitlessly great.

Refer to caption

Nines of Reliability.

A helpful model for understanding reliability is the “Nines of Reliability” [ 63 ] . An event with a success probability p 𝑝 p has k 𝑘 k nines of reliability, where k 𝑘 k is defined as k = − log 10 ⁡ ( 1 − p ) 𝑘 subscript 10 1 𝑝 k=-\log_{10}(1-p) . Hence if p = 90 % , k = 1 formulae-sequence 𝑝 percent 90 𝑘 1 p=90\%,k=1 , and if p = 99.99 % , k = 4 formulae-sequence 𝑝 percent 99.99 𝑘 4 p=99.99\%,k=4 . For real-world systems, it is impossible to reach total reliability and zero risk due to the presence of adversaries, long-tails, emergence, and unknown unknowns. However, one can continue increasing nines of reliability to approach ideal safety.

In many settings, there is a sufficient level of reliability past which risks are acceptable. However, this is not the case for existential risks, because they threaten permanent failure and thus cannot be tolerated even once. This qualitative distinction between existential risk and normal risk highlights the importance of continually increasing nines of reliability for systems that create existential risk. Simplistically supposing a Poisson process for an existential catastrophe, adding 1 1 1 nine of reliability corresponds to a 10 × 10\times reduction in the probability of permanent failure, resulting in a 10 × 10\times longer civilizational lifetime on average. Thus, increasing nines of reliability produces high value and does not suffer steeply diminishing returns.

Safe Design Principles.

Safety-critical systems across different industries have several design principles in common. These principles could also make AI systems safer. One such design principle is redundancy , which describes using similar components to remove single points of failure. Defense in depth layers multiple defenses so that weaknesses or failures of one defense layer can be prevented by another defense layer. Transparency improves our ability to understand and reason about systems. The principle of least privilege means agents should have the minimum permissions and power necessary to accomplish tasks. Loose coupling of components makes a rapid cascade of failures less likely, and it increases controllability and time for readjustment [ 55 ] . Imposing a separation of duties implies no single agent has the ability to solely control or misuse the system on their own. Fail-safes are features that help systems fail gracefully [ 1 ] .

Systemic Factors.

It is simplistic to require that all work demonstrate that it reduces risks directly. Contemporary hazard analysis takes a systems view to analyze hazards since exclusively analyzing failure modes directly has well-known blind spots. Older analysis tools often assume that a “root cause” triggers a sequence of events that directly and ultimately cause a failure, but such models only capture linear causality. Modern systems are replete with nonlinear causality, including feedback loops, multiple causes, circular causation, self-reinforcing processes, butterfly effects, microscale-macroscale dynamics, emergent properties, and so on. Requiring that researchers establish a direct link from their work to a failure mode erroneously and implicitly requires stories with linear causality and excludes nonlinear, remote, or indirect causes [ 47 ] . Backward chaining from the failure mode to the “root cause” and representing failures as an endpoint in a chain of events unfortunately leaves out many crucial causal factors and processes. Rather than determine what event or component is the “root cause” ultimately responsible for a failure, in complex systems it is more fruitful to ask how various factors contributed to a failure [ 16 ] . In short, safety is far from just a matter of directly addressing failure modes [ 55 , 44 ] ; safety is an emergent property [ 46 ] of a complex sociotechnical system comprised of many interacting, interdependent factors that can directly or indirectly cause system failures.

Researchers aware of contemporary hazard analysis could discuss how their work bears on these crucial indirect causes or diffuse contributing factors, even if their work does not fix a specific failure mode directly. We now describe many of these contributing factors. Safety culture describes attitudes and beliefs of system creators towards safety. Safety culture is “the most important to fix if we want to prevent future accidents” [ 45 ] . A separate factor is safety feature costs ; reducing these costs makes future system designers more likely to include additional safety features. The aforementioned safe design principles can diffusely improve safety in numerous respects. Improved monitoring tools can reduce the probability of alarm fatigue and operators ignoring warning signs. Similarly, a reduction in inspection and preventative maintenance can make failures more likely. Safety team resources is a critical factor, which consists of whether a safety team exists, headcount, the amount of allotted compute, dataset collection budgets, and so on. A field’s incentive structure is an additional factor, such as whether people are rewarded for improving safety, even if it does not advance capabilities. Leadership epistemics describes to the level of awareness, prudence, or wisdom of leaders or the quality of an organization’s collective intelligence. Productivity pressures and competition pressures can lead teams to cut corners on safety, ignore troubling signs, or race to the bottom. Finally, social pressures and rules and regulations often help retroactively address failure modes and incentivize safer behavior. An example sociotechnical control structure is in Figure   1 , illustrating the complexity of modern sociotechnical systems and how various systemic factors influence safety.

Application: Anomaly Detection.

To help concretize our discussion, we apply the various concepts in this section to anomaly detection. Anomaly detection helps identify hazards such as novel failure modes , and it helps reduce an operator’s exposure to hazards. Anomaly detection can increase the nines of reliability of a system by detecting unusual system behavior before the system drifts into a more hazardous state. Anomaly detection helps provide defense in depth since it can be layered with other safety measures, and it can trigger a fail-safe when models encounter unfamiliar, highly uncertain situations. For sociotechnical systems, improved anomaly detectors can reduce the prevalence of alarm fatigue , automate aspects of problem reports and change reports , reduce safety feature costs , and make inspection and preventative maintenance less costly.

2.2 AI Risk Analysis

Existential risks from strong AI can be better understood using tools from general risk analysis. Here, we discuss additional considerations and analysis tools specific to AI risk.

Safety Research Decomposition.

Research on AI safety can be separated into four distinct areas: robustness, monitoring, alignment, and systemic safety. Robustness research enables withstanding hazards, including adversaries, unusual situations, and Black Swans [ 62 ] . Monitoring research enables identifying hazards, including malicious use, hidden model functionality, and emergent goals and behaviors. Alignment research seeks to make AI systems less hazardous by focusing on hazards such as power-seeking tendencies, dishonesty, or hazardous goals. Systemic Safety research seeks to reduce system-level risks, such as malicious applications of AI and poor epistemics. These four research areas constitute high-level safety research priorities that can provide defense in depth against AI risks [ 33 ] .

We can view these areas of AI safety research as tackling different components of the risk equation for a given hazard, Risk = Vulnerability × Exposure × Hazard Risk Vulnerability Exposure Hazard \text{Risk}=\text{Vulnerability}\times\text{Exposure}\times\text{Hazard} . Robustness reduces vulnerability, monitoring reduces exposure to hazards, alignment reduces the prevalence and severity of inherent model hazards, and systemic safety reduces systemic risks by decreasing vulnerability, exposure, and hazard variables. The monitoring and systemic safety research areas acknowledge that hazards are neither isolated nor independent, as safety is an emergent property that requires improving direct as well as diffuse safety factors.

Alternatively, a large share of safety research could be categorized in one of these three areas: AI security, transparency, and machine ethics. AI Security aims to make models cope in the face of adversaries. Transparency aims to help humans reason about and understand AI systems. Machine Ethics aims to create artificial agents that behave ethically, such as by not causing wanton harm.

Scope Levels.

Drawing from [ 19 , 33 ] , risks from strong AI can be separated into three scopes. First, AI System Risks concern the ability of an individual AI system to operate safely. Examples of AI system risks include anomalies, adversaries, and emergent functionality. Operational Risks concern the ability of an organization to safely operate an AI system during deployment. Examples of operational risks include alarm fatigue, model theft, competitive pressures that undervalue safety, and lack of safety culture. Institutional and Societal Risks concern the ability of global society or institutions that decisively affect AI systems to operate in an efficient, informed, and prudent way. Examples of institutional and societal risks include an AI arms race, incentives for creating AI weapons or using AI to create weapons.

Refer to caption

Speculative Hazards and Failure Modes.

Numerous speculative hazards and failure modes contribute to existential risk from strong AI. Weaponization is common for high-impact technologies. Malicious actors could repurpose AI to be highly destructive, and this could be an on-ramp to other x-risks; even deep RL methods and ML-based drug discovery have been successful in pushing the boundaries of aerial combat and chemical weapons [ 18 , 66 ] , respectively. Enfeeblement can occur if know-how erodes by delegating increasingly many important functions to machines; in this situation, humanity loses the ability to self-govern and becomes completely dependent on machines, not unlike scenarios in the film WALL-E. Similarly, eroded epistemics would mean that humanity could have a reduction in rationality due to a deluge of misinformation or highly persuasive, manipulative AI systems. Proxy misspecification is hazardous because strong AI systems could over-optimize and game faulty objectives, which could mean systems aggressively pursue goals and create a world that is distinct from what humans value. Value lock-in could occur when our technology perpetuates the values of a particular powerful group, or it could occur when groups get stuck in a poor equilibrium that is robust to attempts to get unstuck. Emergent functionality could be hazardous because models demonstrate unexpected, qualitatively different behavior as they become more competent [ 26 , 57 ] , so a loss of control becomes more likely when new capabilities or goals spontaneously emerge. Deception is commonly incentivized, and smarter agents are more capable of succeeding at deception; we can be less sure of our models if we fail to find a way to make them assert only what they hold to be true. Power-seeking behavior in AI is a concern because power helps agents pursue their goals more effectively [ 65 ] , and there are strong incentives to create agents that can accomplish a broad set of goals; therefore, agents tasked with accomplishing many goals have instrumental incentives to acquire power, but this could make them harder to control [ 13 ] . These concepts are visualized in Figure   2 , and we extend this discussion in Appendix   A .

To help concretize our discussion, we apply the various concepts in this section to anomaly detection. We first note that anomaly detection is a core function of monitoring . In the short-term, anomaly detection reduces AI system risks and operational risks . It can help detect when misspecified proxies are being overoptimized or gamed. It can also help detect misuse such as weaponization research or emergent functionality , and in the future it could possibly help detect AI deception .

3 Long-Term Impact Strategies

While we have discussed important concepts and principles for making systems safer, how can we make strong AI systems safer given that they are in the future? More generally, how can one affect the future in a positive direction, given that it is far away and uncertain? People influence future decades in a variety of ways, including furthering their own education, saving for retirement decades in advance, raising children in a safe environment, and so on. Collectively, humans can also improve community norms, enshrine new rights, and counteract anticipated environmental catastrophes. Thus, while all details of the future are not yet known, there are successful strategies for generally improving long-term outcomes without full knowledge of the future. Likewise, even though researchers do not have access to strong AI, they can perform research to reliably help improve long-term outcomes. In this section we discuss how empirical researchers can help shape the processes that will eventually lead to strong AI systems, and steer them in a safer direction. In particular, researchers can improve our understanding of the problem, improve safety culture, build in safety early, increase the cost of adversarial behavior, and prepare tools and ideas for use in times of crisis.

Improve Field Understanding and Safety Culture.

Performing high-quality research can influence our field’s understanding and set precedents. High-quality datasets and benchmarks concretize research goals, make them tractable, and spur large community research efforts. Other research can help identify infeasible solutions or dead ends, or set new directions by identifying new hazards and vulnerabilities. At the same time, safety concerns can become normalized and precedents can become time-tested and standardized. These second-order effects are not secondary considerations but are integral to any successful effort toward risk reduction.

Build In Safety Early.

Many early Internet protocols were not designed with safety and security in mind. Since safety and security features were not built in early, the Internet remains far less secure than it could have been, and we continue to pay large continuing costs as a consequence. Aggregating findings from the development of multiple technologies, a report for the Department of Defense [ 24 ] estimates that approximately 75 % percent 75 75\% of safety-critical decisions occur early in a system’s development. Consequently, working on safety early can have founder effects . Moreover, incorporating safety features late in the design process is at times simply infeasible, leaving system developers no choice but to deploy without important safety features. In less extreme situations, retrofitting safety features near the end of a system’s development imposes higher costs compared to integrating safety features earlier.

Improve Cost-Benefit Variables.

Future decision-makers directing AI system development will need to decide which and how many safety features to include. These decisions will be influenced by a cost-benefit analysis. Researchers can decrease the capabilities costs of safety features and increase their benefits by doing research today. In addition to decreasing safety feature costs, researchers can also increase the cost of undesirable adversarial behavior. Today and in the future, adversarial humans and adversarial artificial agents [ 28 ] attempt to exploit vulnerabilities in machine learning systems. Removing vulnerabilities increases the cost necessary to mount an attack. Increasing costs makes adversaries less likely to attack, makes their attacks less potent, and can impel them to behave better.

Driving up the cost of adversarial behavior is a long-term strategy, since it can be applied to safeguard against powerful adversaries including hypothetical strong AI optimizers. For example, the military and information assurance communities face powerful adversaries and often work to increase the cost of adversarial behavior. In this way, cost-benefit analysis can comport with security from worst-case situations and adversarial forces. Additionally, this framework is more realistic than finding perfect safety solutions, as increasing costs to undesirable behavior recognizes that risk cannot entirely be eliminated. In summary, we can progressively reduce vulnerabilities in future AI systems to better defend against future adversaries.

Prepare for Crises.

In times of crisis, decisive decisions must be made. The decisions made during such a highly impactful, highly variable period can result in a turning point towards a better outcome. For this reason, a Nobel Prize economist [ 23 ] wrote, “Only a crisis - actual or perceived - produces real change. When that crisis occurs, the actions that are taken depend on the ideas that are lying around. That, I believe, is our basic function: to develop alternatives…, to keep them alive and available until the politically impossible becomes the politically inevitable.” Similarly, Benjamin Franklin wrote that “an ounce of prevention is worth a pound of cure.” The upshot of both of these views is that proactively creating and refining safety methods can be highly influential. Work today can influence which trajectory is selected during a crisis, or it can have an outsized impact in reducing a catastrophe’s severity.

Prioritize by Importance, Neglectedness, and Tractability on the Margin.

Since there are many problems to work on, researchers will need to prioritize. Clearly important problems are useful to work on, but if the problem is crowded, a researcher’s expected marginal impact is lower. Furthermore, if researchers can hardly make progress on a problem, the expected marginal impact is again lower.

Three factors that affect prioritization include importance, neglectedness, and tractability. By “importance,” we mean the amount of risk reduced, assuming substantial progress. A problem is more important if it is an x-risk and greatly influences many, not just one, existential failure modes. By “neglectedness,” we mean the extent to which a problem is relatively underexplored. A problem is more likely to be neglected if it is related to human values that are neglected by the maximization of economic preferences (e.g., meaning, equality, etc.), is out of the span of most researchers’ skillsets, primarily helps address rare but highly consequential Black Swans, addresses diffuse externalities, primarily addresses far-future concerns, or is not thought respectable or serious. Finally, by “tractability,” we mean the amount of progress that would likely be made on the problem assuming additional resources. A problem is more likely to be tractable if it has been concretized by measurable benchmarks and if researchers are demonstrating progress on those benchmarks.

To help concretize our discussion, we apply the various concepts in this section to anomaly detection. Anomaly detection is a concrete measurable problem, which can improve safety culture and the field’s understanding of hazards such as unknown unknowns. As anomaly detection for deep learning began several years ago, there has been an attempt to build in safety early . Consequently, this has led to more mature anomaly detection techniques than would have existed otherwise, thereby improving the benefit variables of this safety feature. Consequently, in a future time of crisis or a pivotal event, anomaly detection methods could be simple and reliable enough for inclusion in regulation. Last, anomaly detection’s importance is high, and neglectedness and tractability are similar to other safety research avenues (see Appendix   B ).

4 Safety-Capabilities Balance

We discussed how to improve the safety of future systems in general. However, an additional concept needed to analyze future risks from AI in particular is the safety-capabilities balance, without which there has been much confusion about how to unmistakably reduce x-risks. First we discuss the association of safety and capabilities and distinguish the two. Then we discuss how well-intentioned pursuits towards safety can have unintended consequences, giving concrete examples of safety research advancing capabilities and vice versa. To avoid future unintended consequences, we propose that researchers demonstrate that they improve the balance between safety and capabilities.

As a preliminary, we note that “general capabilties” relates to concepts such as a model’s accuracy on typical tasks, sequential decision making abilities in typical environments, reasoning abilities on typical problems, and so on. Due to the no free lunch theorem [ 68 ] , we do not mean all mathematically definable tasks.

Refer to caption

Intelligence Can Help or Harm Safety.

Models that are made more intelligent could more easily avoid failures and act more safely. At the same time, models of greater intelligence could more easily act destructively or be directed maliciously. Similarly, a strong AI could help us make wiser decisions and help us achieve a better future, but loss of control is also a possibility. Raw intelligence is a double-edged sword and is not inextricably bound to desirable behavior. For example, it is well-known that moral virtues are distinct from intellectual virtues . An agent that is knowledgeable, inquisitive, quick-witted, and rigorous is not necessarily honest, just, power-averse, or kind [ 2 , 39 , 3 ] . Consequently we want our models to have more than just raw intelligence.

Side Effects of Optimizing Safety Metrics.

Attempts to endow models with more than raw intelligence can lead to unintended consequences. In particular, attempts to pursue safety agendas can sometimes hasten the onset of AI risks. For example, suppose that a safety goal is concretized through a safety metric, and a researcher tries to create a model that improves that safety metric. In Figure   3 , the researcher could improve on model A just by improving the safety metric (model B), or by advancing the safety metric and general capabilities simultaneously (model C). In the latter case, the researcher improved general capabilities and as a side-effect has increased model intelligence, which we have established has a mixed impact on safety. We call such increases capabilities externalities , shown in Figure   3 . This is not to suggest capabilities are bad or good per se —they can help or harm safety and will eventually be necessary for helping humanity reach its full potential.

Examples of Capabilities → → \to Safety Goals.

We now provide concrete examples to illustrate how safety and general capabilities are associated. Self-supervised learning can increase accuracy and data efficiency, but it can also improve various safety goals in robustness and monitoring [ 36 ] . Pretraining makes models more accurate and extensible, but it also improves various robustness and monitoring goals [ 34 ] . Improving an agent’s world model makes them more generally capable, but this also can make them less likely to spawn unintended consequences. Optimizers operating over longer time horizons will be able to accomplish more difficult goals, but this could also make models act more prudently and avoid taking irreversible actions.

Examples of Safety Goals → → \to Capabilities.

Some argue that a safety goal is modeling user preferences, but depending on the preferences modeled, this can have predictable capabilities externalities. Recommender, advertisement, search, and machine translation systems make use of human feedback and revealed preferences to improve their systems. Recent work on language models uses reinforcement learning to incorporate user preferences over a general suite of tasks, such as summarization, question-answering, and code generation [ 27 , 53 , 4 ] . Leveraging task preferences, often styled as “human values,” can amount to making models more generally intelligent, as users prefer smarter models. Rather than model task preferences, researchers could alternatively minimize capabilities externalities by modeling timeless human values such as normative factors [ 40 ] and intrinsic goods (e.g., pleasure, knowledge, friendship, and so on).

Some argue that a safety goal is truthfulness, but making models more truthful can have predictable capabilities externalities. Increasing truthfulness can consist of increasing accuracy, calibration, and honesty. Increasing standard accuracy clearly advances general capabilities, so researchers aiming to clearly improve safety would do well to work more specifically towards calibration and honesty.

Safety-Capabilities Ratio.

As we have seen, improving safety metrics does not necessarily improve our overall safety. Improving a safety metric can improve our safety, all else equal. However, often all else is not equal since capabilities are also improved, so our overall safety has not necessarily increased. Consequently, to move forward, safety researchers must perform a more holistic risk analysis that simultaneously reports safety metrics and capabilities externalities, so as to demonstrate a reduction in total risk. We suggest that researchers improve the balance between safety and general capabilities or, so to speak, improve a safety-capabilities ratio. To be even more precautionary and have a less mixed effect on safety, we suggest that safety research aim to avoid general capabilities externalities. This is because safety research should consistently improve safety more than it would have been improved by default.

This is certainly not to suggest that safety research is at odds with capabilities research—the overall effects of increased capabilities on safety are simply mixed, as established earlier. While developing AI precautiously, it would be beneficial to avoid a counterproductive “safety vs. capabilities” framing. Rather, capabilities researchers should increasingly focus on the potential benefits from AI, and safety researchers should focus on minimizing any potential tail risks. This process would function best if done collaboratively rather than adversarially, in much the same way information security software engineers collaborate with other software engineers. While other researchers advance general capabilities, safety researchers can differentially [ 6 ] improve safety by improving the safety-capabilities ratio.

Refer to caption

We now consider objections to this view. Some researchers may argue that we need to advance capabilities (e.g., reasoning, truth-finding, and contemplation capabilities) to study some long-term safety problems. This does not appear robustly beneficial for safety and does not seem necessary, as there are numerous neglected, important, and tractable existing research problems. In contrast, advancing general capabilities is obviously not neglected. Additionally, safety researchers could rely on the rest of the community to improve upstream capabilities and then eventually use these capabilities to study safety-relevant problems. Next, research teams may argue that they have the purest motivations, so their organization must advance capabilities to race ahead of the competition and build strong AI as soon as possible. Even if a large advantage over the whole field can be reliably and predictably sustained, which is highly dubious, this is not necessarily a better way to reduce risks than channeling additional resources towards safety research. Finally, some argue that work on safety will lead to a false perception of safety and cause models to be deployed earlier. Currently many companies clearly lack credible safety efforts (e.g., many companies lack a safety research team), but in the future the community should be on guard against a false sense of security, as is important in other industries.

Practical Recommendations.

To help researchers have a less mixed and clearer impact on safety, we suggest two steps. First, researchers should empirically measure the extent to which their method improves their safety goal or metric (e.g., anomaly detection AUROC, adversarial robustness accuracy, etc.); more concrete safety goals can be found in [ 33 ] and in Appendix   B . Second, researchers should measure whether their method can be used to increase general capabilities by measuring its impact on correlates of general capabilities (e.g., reward in Atari, accuracy on ImageNet, etc.). With these values estimated, researchers can determine whether they differentially improved the balance between safety and capabilities. More precautious researchers can also note whether their improvement is approximately orthogonal to general capabilities and has minimal capabilities externalities. This is how empirical research claiming to differentially improve safety can demonstrate a differential safety improvement empirically.

To help concretize our discussion, we apply the various concepts in this section to anomaly detection. As shown in Figure   4 , anomaly detection safety measures are correlated with the accuracy of vanilla models, but differential progress is possible without simply increasing accuracy. A similar plot is in a previous research paper [ 31 ] . The plot shows that it is possible to improve anomaly detection without substantial capabilities externalities , so work on anomaly detection can improve the safety-capabilities balance .

5 Conclusion

We provided a guiding document to help researchers understand and analyze AI x-risk. First, we reviewed general concepts for making systems safer, grounding our discussion in contemporary hazard analysis and systems safety. Next, we discussed how to influence the safety of future systems via several long-term impact strategies, showing how individual AI researchers can make a difference. Finally, we presented an important AI-specific consideration of improving the safety-capabilities balance. We hope our guide can clarify how researchers can reduce x-risk in the long term and steer the processes that lead to strong AI in a safer direction.

Acknowledgments

We thank Thomas Woodside, Kevin Liu, Sidney Hough, Oliver Zhang, Steven Basart, Shudarshan Babu, Daniel McKee, Boxin Wang, Victor Gonzalez, Justis Mills, and Huichen Li for feedback. DH is supported by the NSF GRFP Fellowship and an Open Philanthropy Project AI Fellowship.

  • [1] Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea and Adam Stubblefield “Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems” O’Reilly Media, 2020
  • [2] Aristotle “Nicomachean Ethics”, 340 BC
  • [3] Stuart Armstrong “General Purpose Intelligence: Arguing the Orthogonality Thesis”, 2013
  • [4] Yushi Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort and Deep Ganguli al. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback” In ArXiv , 2022
  • [5] B Wayne Blanchard “Guide to emergency management and related terms, definitions, concepts, acronyms, organizations, programs, guidance, executive orders & legislation: A tutorial on emergency management, broadly defined, past and present” In United States. Federal Emergency Management Agency , 2008 United States. Federal Emergency Management Agency
  • [6] Nick Bostrom “Existential risks: analyzing human extinction scenarios and related hazards”, 2002
  • [7] Nick Bostrom “The Vulnerable World Hypothesis” In Global Policy , 2019
  • [8] Nick Bostrom “The vulnerable world hypothesis” In Global Policy 10.4 Wiley Online Library, 2019, pp. 455–476
  • [9] Nick Bostrom and Milan M Cirkovic “Global catastrophic risks” Oxford University Press, 2011
  • [10] Miles Brundage, Shahar Avin, Jack Clark, H. Toner, P. Eckersley, Ben Garfinkel, A. Dafoe, P. Scharre, T. Zeitzoff, Bobby Filar, H. Anderson, Heather Roff, Gregory C. Allen, J. Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, S. Beard, Haydn Belfield, Sebastian Farquhar, Clare Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy and Dario Amodei “The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation” In ArXiv abs/1802.07228 , 2018
  • [11] Ben Buchanan, John Bansemer, Dakota Cary, Jack Lucas and Micah Musser “Automating Cyber Attacks”, 2021
  • [12] Collin Burns, Haotian Ye, Dan Klein and Jacob Steinhardt “Unsupervised Discovery of Latent Truth in Language Models” In arXiv , 2022
  • [13] Joseph Carlsmith “Is power-seeking AI an existential risk?” In arXiv preprint arXiv:2206.13353 , 2022
  • [14] Dakota Cary and Daniel Cebul “Destructive Cyber Operations and Machine Learning”, 2020
  • [15] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph and Greg Brockman “Evaluating large language models trained on code” In arXiv preprint arXiv:2107.03374 , 2021
  • [16] Richard I. Cook “How Complex Systems Fail”, 1998
  • [17] Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R. McKee, Joel Z. Leibo, K. Larson and Thore Graepel “Open Problems in Cooperative AI” In ArXiv abs/2012.08630 , 2020
  • [18] DARPA “AlphaDogfight Trials Foreshadow Future of Human-Machine Symbiosis”, 2020
  • [19] Department of Defense “Quadrennial Defense Review Report”, 2001
  • [20] Franz Dietrich and Kai Spiekermann “Jury Theorems” In The Stanford Encyclopedia of Philosophy Metaphysics Research Lab, Stanford University, 2022
  • [21] Harrison Foley, Liam Fowl, Tom Goldstein and Gavin Taylor “Execute Order 66: Targeted Data Poisoning for Reinforcement Learning” In ArXiv abs/2201.00762 , 2022
  • [22] J. R. French and Bertram H. Raven “The bases of social power.”, 1959
  • [23] Milton Friedman “Capitalism and Freedom” In Economica , 1963
  • [24] F. R. Frola and C. O. Miller “System Safety in Aircraft Acquisition”, 1984
  • [25] John Gall “The systems bible: the beginner’s guide to systems large and small” General Systemantics Press, 2002
  • [26] Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, T. J. Henighan, Andy Jones, Nicholas Joseph, John Kernion, Benjamin Mann and Amanda Askell al. “Predictability and Surprise in Large Generative Models” In ArXiv , 2022
  • [27] Xiang Gao, Yizhe Zhang, Michel Galley, Chris Brockett and Bill Dolan “Dialogue Response Ranking Training with Large-Scale Human Feedback Data” In EMNLP , 2020
  • [28] Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine and Stuart J. Russell “Adversarial Policies: Attacking Deep Reinforcement Learning” In ArXiv , 2020
  • [29] Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang and Owain Evans “When will AI exceed human performance? Evidence from AI experts” In Journal of Artificial Intelligence Research 62 , 2018, pp. 729–754
  • [30] Dylan Hadfield-Menell, A. Dragan, P. Abbeel and Stuart J. Russell “The Off-Switch Game” In IJCA , 2017
  • [31] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Lixuan Zhu, Samyak Parajuli, Mike Guo, Dawn Xiaodong Song, Jacob Steinhardt and Justin Gilmer “The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization” In ICCV , 2021
  • [32] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song and Jacob Steinhardt “Aligning AI With Shared Human Values” In Proceedings of the International Conference on Learning Representations (ICLR) , 2021
  • [33] Dan Hendrycks, Nicholas Carlini, John Schulman and Jacob Steinhardt “Unsolved problems in ml safety” In arXiv preprint arXiv:2109.13916 , 2021
  • [34] Dan Hendrycks, Kimin Lee and Mantas Mazeika “Using Pre-Training Can Improve Model Robustness and Uncertainty” In ICML , 2019
  • [35] Dan Hendrycks, Mantas Mazeika and Thomas Dietterich “Deep Anomaly Detection with Outlier Exposure” In Proceedings of the International Conference on Learning Representations , 2019
  • [36] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath and Dawn Xiaodong Song “Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty” In NeurIPS , 2019
  • [37] Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li and Jacob Steinhardt “What Would Jiminy Cricket Do? Towards Agents That Behave Morally” In NeurIPS , 2021
  • [38] Evan Hubinger, Chris Merwijk, Vladimir Mikulik, Joar Skalse and Scott Garrabrant “Risks from Learned Optimization in Advanced Machine Learning Systems” In ArXiv , 2019
  • [39] David Hume “A Treatise of Human Nature”, 1739
  • [40] Shelly Kagan “The Structure of Normative Ethics” In Philosophical Perspectives [Ridgeview Publishing Company, Wiley], 1992
  • [41] Michael Klare “Skynet Revisited: The Dangerous Allure of Nuclear Command Automation” In Arms Control Association , 2020
  • [42] Trevor Kletz “What you don’t have, can’t leak” In Chemistry and Industry , 1978
  • [43] Ethan Kross, Philippe Verduyn, Emre Demiralp, Jiyoung Park, David Seungjae Lee, Natalie Lin, Holly Shablack, John Jonides and Oscar Ybarra “Facebook use predicts declines in subjective well-being in young adults” In PloS one
  • [44] Todd La Porte “High Reliability Organizations: Unlikely, Demanding, and At Risk” In Journal of Contingencies and Crisis Management , 1996
  • [45] Nancy Leveson “Introduction to STAMP” In STAMP Workshop Presentations , 2020
  • [46] Nancy G Leveson “Engineering a safer world: Systems thinking applied to safety” The MIT Press, 2016
  • [47] Nancy G. Leveson, Nicolas Dulac, Karen Marais and John S. Carroll “Moving Beyond Normal Accidents and High Reliability Organizations: A Systems Approach to Safety in Complex Systems” In Organization Studies , 2009
  • [48] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras and Adrian Vladu “Towards Deep Learning Models Resistant to Adversarial Attacks” In International Conference on Learning Representations , 2018
  • [49] David McAllester “Rate-Distortion Metrics for GAN”, 2017
  • [50] Toby Newberry and Toby Ord “The Parliamentary Approach to Moral Uncertainty”, 2021
  • [51] Toby Ord “The precipice: Existential risk and the future of humanity” Hachette Books, 2020
  • [52] Rain Ottis “Analysis of the 2007 Cyber Attacks Against Estonia from the Information Warfare Perspective”, 2008
  • [53] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike and Ryan J. Lowe “Training language models to follow instructions with human feedback” In ArXiv , 2022
  • [54] David A. Patterson “For better or worse, benchmarks shape a field: technical perspective” In Commun. ACM , 2012
  • [55] C. Perrow “Normal Accidents: Living with High Risk Technologies”, Princeton paperbacks Princeton University Press, 1999
  • [56] Stanislas Polu and Ilya Sutskever “Generative language modeling for automated theorem proving” In arXiv preprint arXiv:2009.03393 , 2020
  • [57] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin and Vedant Misra “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets” In ICLR MATH-AI Workshop , 2021
  • [58] Peter Railton “Ethics and Artificial Intelligence” In Uehiro Lecture Series
  • [59] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu and Mark Chen “Hierarchical Text-Conditional Image Generation with CLIP Latents” In arXiv preprint arXiv:2204.06125 , 2022
  • [60] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis and Thore Graepel “Mastering atari, go, chess and shogi by planning with a learned model” In Nature 588.7839 Nature Publishing Group, 2020, pp. 604–609
  • [61] Richard Sutton “The Bitter Lesson”, 2019
  • [62] Nassim Nicholas Taleb “Statistical consequences of fat tails: Real world preasymptotics, epistemology, and applications” In arXiv preprint arXiv:2001.10488 , 2020
  • [63] Terence Tao “Nines of safety: a proposed unit of measurement of risk”, 2021
  • [64] Kathryn Tunyasuvunakool, Jonas Adler, Zachary Wu, Tim Green, Michal Zielinski, Augustin Žídek, Alex Bridgland, Andrew Cowie, Clemens Meyer and Agata Laydon “Highly accurate protein structure prediction for the human proteome” In Nature 596.7873 Nature Publishing Group, 2021, pp. 590–596
  • [65] Alexander Matt Turner, Logan Riggs Smith, Rohin Shah, Andrew Critch and Prasad Tadepalli “Optimal Policies Tend To Seek Power” In NeurIPS , 2021
  • [66] Fabio L. Urbina, Filippa Lentzos, Cédric Invernizzi and Sean Ekins “Dual use of artificial-intelligence-powered drug discovery” In Nature Machine Intelligence , 2022
  • [67] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng and Ben Y Zhao “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks” In 2019 IEEE Symposium on Security and Privacy (SP) , 2019, pp. 707–723 IEEE
  • [68] David H. Wolpert “The Lack of A Priori Distinctions Between Learning Algorithms” In Neural Computation , 1996

Appendix A An Expanded Discussion of Speculative Hazards and Failure Modes

We continue our guide by providing an expanded discussion of the eight aforementioned speculative hazards and failure modes, namely weaponization, enfeeblement, eroded epistemics, proxy misspecification, value lock-in, emergent functionality, deception, and power-seeking behavior.

Weaponization : Some are concerned that weaponizing AI may be an onramp to more dangerous outcomes. In recent years, deep RL algorithms can outperform humans at aerial combat [ 18 ] , AlphaFold has discovered new chemical weapons [ 66 ] , researchers have been developing AI systems for automated cyberattacks [ 11 , 14 ] , military leaders have discussed having AI systems have decisive control over nuclear silos [ 41 ] , and superpowers of the world have declined to sign agreements banning autonomous weapons. Additionally, an automated retaliation system accident could rapidly escalate and give rise to a major war. Looking forward, we note that since the nation with the most intelligent AI systems could have a strategic advantage, it may be challenging for nations not to build increasingly powerful weaponized AI systems.

Even if AI alignment is solved and all superpowers agree not to build destructive AI technologies, rogue actors still could use AI to cause significant harm. Easy access to powerful AI systems increases the risk of unilateral, malicious usage. As with nuclear and biological weapons, only one irrational or malevolent actor is sufficient to unilaterally cause harm on a massive scale. Unlike previous weapons, stealing and widely proliferating powerful AI systems could just be a matter of copy and pasting.

Enfeeblement : As AI systems encroach on human-level intelligence, more and more aspects of human labor will become faster and cheaper to accomplish with AI. As the world accelerates, organizations may voluntarily cede control to AI systems in order to keep up. This may cause humans to become economically irrelevant, and once AI automates aspects of many industries, it may be hard for displaced humans to reenter them. In this world, humans could have few incentives to gain knowledge or skills. These trends could lead to human enfeeblement and reduce human flourishing, leading to a world that is undesirable. Furthermore, along this trajectory, humans would have less control of the future.

Eroded epistemics : States, parties, and organizations use technology to influence and convince others of their political beliefs, ideologies, and narratives. Strong AI may bring this use-case into a new era and enable personally customized disinformation campaigns at scale. Additionally, AI itself could generate highly persuasive arguments that invoke primal human responses and inflame crowds. Together these trends could undermine collective decision-making, radicalize individuals, derail moral progress, or erode consensus reality.

Proxy misspecification : AI agents are directed by goals and objectives. Creating general-purpose objectives that capture human values could be challenging. As we have seen, easily measurable objectives such as watch time and click rates often trade off with our actual values, such as wellbeing [ 43 ] . For instance, well-intentioned AI objectives have unexpectedly caused people to fall down conspiracy theory rabbit holes. This demonstrates that organizations have deployed models with flawed objectives and that creating objectives which further human values is an unsolved problem. Since goal-directed AI systems need measurable objectives, by default our systems may pursue simplified proxies of human values. The result could be suboptimal or even catastrophic if a sufficiently powerful AI successfully optimizes its flawed objective to an extreme degree.

Value lock-in : Strong AI imbued with particular values may determine the values that are propagated into the future. Some argue that the exponentially increasing compute and data barriers to entry make AI a centralizing force. As time progresses, the most powerful AI systems may be designed by and available to fewer and fewer stakeholders. This may enable, for instance, regimes to enforce narrow values through pervasive surveillance and oppressive censorship. Overcoming such a regime could be unlikely, especially if we come to depend on it. Even if creators of these systems know their systems are self-serving or harmful to others, they may have incentives to reinforce their power and avoid distributing control. The active collaboration among many groups with varying goals may give rise to better goals [ 20 ] , so locking in a small group’s value system could curtail humanity’s long-term potential.

Emergent functionality : Capabilities and novel functionality can spontaneously emerge in today’s AI systems [ 26 , 57 ] , even though these capabilities were not anticipated by system designers. If we do not know what capabilities systems possess, systems become harder to control or safely deploy. Indeed, unintended latent capabilities may only be discovered during deployment. If any of these capabilities are hazardous, the effect may be irreversible.

New system objectives could also emerge. For complex adaptive systems, including many AI agents, goals such as self-preservation often emerge [ 30 ] . Goals can also undergo qualitative shifts through the emergence of intrasystem goals [ 25 , 33 ] . In the future, agents may break down difficult long-term goals into smaller subgoals. However, breaking down goals can distort the objective, as the true objective may not be the sum of its parts. This distortion can result in misalignment. In more extreme cases, the intrasystem goals could be pursued at the expense of the overall goal. For example, many companies create intrasystem goals and have different specializing departments pursue these distinct subgoals. However, some departments, such as bureaucratic departments, can capture power and have the company pursue goals unlike its original goals. Even if we correctly specify our high-level objectives, systems may not operationally pursue our objectives [ 38 ] . This is another way in which systems could fail to optimize human values.

Deception : Future AI systems could conceivably be deceptive not out of malice, but because deception can help agents achieve their goals. It may be more efficient to gain human approval through deception than to earn human approval legitimately. Deception also provides optionality: systems that have the capacity to be deceptive have strategic advantages over restricted, honest models. Strong AIs that can deceive humans could undermine human control.

AI systems could also have incentives to bypass monitors. Historically, individuals and organizations have had incentives to bypass monitors. For example, Volkswagen programmed their engines to reduce emissions only when being monitored. This allowed them to achieve performance gains while retaining purportedly low emissions. Future AI agents could similarly switch strategies when being monitored and take steps to obscure their deception from monitors. Once deceptive AI systems are cleared by their monitors or once such systems can overpower them, these systems could take a “treacherous turn” and irreversibly bypass human control.

Power-seeking behavior : Agents that have more power are better able to accomplish their goals. Therefore, it has been shown that agents have incentives to acquire and maintain power [ 65 ] . AIs that acquire substantial power can become especially dangerous if they are not aligned with human values. Power-seeking behavior can also incentivize systems to pretend to be aligned, collude with other AIs, overpower monitors, and so on. On this view, inventing machines that are more powerful than us is playing with fire. Building power-seeking AI is also incentivized because political leaders see the strategic advantage in having the most intelligent, most powerful AI systems. For example, Vladimir Putin has said “Whoever becomes the leader in [AI] will become the ruler of the world.”

Appendix B Unsolved Problems in AI X-Risk

In this section we describe empirical research directions towards reducing x-risk from AI. We describe each problem, give its motivation, and suggest what late-stage quality work could look like. When describing potential advanced work, we hope that work developed is subject to the capabilities externalities constraints discussed earlier.

B.1 Adversarial Robustness

Adversarial examples demonstrate that optimizers can easily manipulate vulnerabilities in AI systems and cause them to make egregious mistakes. Adversarial vulnerabilities are long-standing weaknesses of AI models. While typical adversarial robustness is related to AI x-risk, future threat models are broader than today’s adversarial threat models. Since we are concerned about being robust to optimizers that cause models to make mistakes generally, we need not assume that optimizers are subject to small, specific ℓ p subscript ℓ 𝑝 \ell_{p} distortion constraints, as their attacks could be unforeseen and even perceptible. We also need not assume that a human is in the loop and can check if an example is visibly distorted. In short, this area is about making AI systems robust to powerful optimizers that aim to induce specific system responses.

Motivation.

In the future, AI systems may pursue goals specified by other AI proxies. For example, an AI could encode a proxy for human values, and another AI system could be tasked with optimizing the score assigned by this proxy. The quality of an AI agent’s actions would be judged by the AI proxy, and the agent would conform its conduct to receive high scores from the AI proxy. If the human value proxy instantiated by the AI is not robust to optimizers, then its vulnerabilities could be exploited, so this gameable proxy may not be fully safe to optimize. By improving the reliability of learned human value proxies, optimizers would have a harder time gaming these systems. If gaming becomes sufficiently difficult, the optimizer can be impelled to optimize the objective correctly. Separately, humans and systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries.

What Advanced Research Could Look Like.

Ideally, an adversarially robust system would make reliable decisions given adversarially constructed inputs, and it would be robust to adversaries with large attack budgets using unexpected novel attacks. Furthermore, it should detect adversarial behavior and adversarially optimized inputs. A hypothetical human value function should be as adversarially robust as possible so that it becomes safer to optimize. A hypothetical human value function that is fully adversarially robust should be safe to optimize.

B.2 Anomaly Detection

Problem description..

This area is about detecting potential novel hazards such as unknown unknowns, unexpected rare events, or emergent phenomena. Anomaly detection can allow models to flag salient anomalies for human review or execute a conservative fallback policy.

There are numerous existentially relevant hazards that anomaly detection could possibly identify more reliably or identify earlier, including proxy gaming, rogue AI systems, deception from AI systems, trojan horse models (discussed below), malicious use [ 10 ] , early signs of dangerous novel technologies [ 7 ] , and so on.

For example, anomaly detection could be used to detect emergent and unexpected AI goals. As discussed earlier, it is difficult to make systems safe if we do not know what they can do or how they differ from previous models. New instrumental goals may emerge in AI systems, and these goals may be undesirable or pose x-risks (such as the goal for a system to preserve itself, deceive humans, or seek power). If we can detect that a model has a new undesirable capability or goal, we can better control our systems through this protective measure against emergent x-risks.

A successful anomaly detector could serve as an AI watchdog that could reliably detect and triage rogue AI threats. When the watchdog detects rogue AI agents, it should do so with substantial lead time. Anomaly detectors should also be able to straightforwardly create tripwires for AIs that are not yet considered safe. Furthermore, advanced anomaly detectors should be able to help detect “black balls”, meaning “a technology that invariably or by default destroys the civilization that invents it” [ 8 ] . Anomaly detectors should also be able to detect biological hazards, by having anomaly detectors continually scan hospitals for novel biological hazards.

B.3 Interpretable Uncertainty

This area is about making model uncertainty more interpretable and calibrated by adding features such as confidence interval outputs, conditional probabilistic predictions specified with sentences, posterior calibration methods, and so on.

If operators ignore system uncertainties since the uncertainties cannot be relied upon or interpreted, then this would be a contributing factor that makes the overall system that monitors and operates AIs more hazardous. To draw a comparison to chemical plants, improving uncertainty expressiveness could be similar to ensuring that chemical system dials are calibrated. If dials are uncalibrated, humans may ignore the dials and thereby ignore warning signs, which increases the probability of accidents and catastrophe.

Furthermore, since many questions in normative ethics have yet to be resolved, human value proxies should incorporate moral uncertainty. If AI human values proxies have appropriate uncertainty, there is a reduced risk in an human value optimizer maximizing towards ends of dubious value.

Future models should be calibrated on inherently uncertain, chaotic, or computationally prohibitive questions that extend beyond existing human knowledge. Their uncertainty should be easily understood by humans, possibly by having models output structured probabilistic models (“event A will occur with 60% probability assuming event B also occurs, and with 25% probability if event B does not”). Moreover, given a lack of certainty in any one moral theory, AI models should accurately and interpretably represent this uncertainty in their human value proxies.

B.4 Transparency

AI systems are becoming more complex and opaque. This area is about gaining clarity about the inner workings of AI models, and making models more understandable to humans.

Transparency tools could help unearth deception, mitigating risks from dishonest AI and treacherous turns. Transparency tools may also potentially be useful for identifying emergent capabilities. Moreover, transparency tools could help us better understand strong AI systems, which could help us more knowledgeably direct them and anticipate their failure modes.

Successful transparency tools would allow a human to predict how a model would behave in various situations without testing it. These tools could be easily applied (ex ante and ex post) to unearth deception, emergent capabilities, and failure modes.

B.5 Trojans

AI systems can contain “trojan” hazards. Trojaned models behave typically in most situations, but when specific secret situations are met, they reliably misbehave. For example, an AI agent could behave normally, but when given a special secret instruction, it could execute a coherent and destructive sequence of actions [ 21 ] . In short, this area is about identifying hidden functionality embedded in models that could precipitate a treacherous turn.

The trojans literature has shown that it is possible for dangerous, surreptitious modes of behavior to exist within AIs as a result of model weight editing or data poisoning. Misaligned AI or external actors could hide malicious behavior, such that it abruptly emerges at a time of their choosing. Future planning agents could have special plans unbeknownst to model designers, which could include plans for a treacherous turn. For this reason AI trojans provide a microcosm for studying treacherous turns.

Future trojan detection techniques could reliably detect if models have trojan functionality. Other trojan research could develop reverse-engineering tools that synthesize or reconstruct the triggering conditions for trojan functionality. When applied to sequential decision making agents, this could potentially allow us to unearth surreptitious plans.

B.6 Honest AI

Honest AI involves determining what models hold to be true, perhaps by analyzing their internal representations [ 12 ] . It is also about creating models that only output what they hold to be true.

If it is within a model’s capacity to be strategically deceptive—able to make statements that the model in some sense knows to be false in order to gain an advantage—then treacherous turn scenarios are more feasible. Models could deceive humans about their plans, and then execute a new plan after the time when humans cannot course-correct. Plans for a treacherous turn could be brought to light by detecting dishonesty, or models could be made inherently honest, allowing operators to query them about their true plans.

Good techniques could be able to reliably detect when a model’s representations are at odds with its outputs. Models could also be trained to avoid dishonesty and allow humans to correctly conclude that models are being honest with high levels of certainty.

B.7 Power Aversion

This area is about incentivizing models to avoid power or avoid gaining more power than is necessary.

Strategic AIs tasked with accomplishing goals would have instrumental incentives to accrue and maintain power, as power helps agents more easily achieve their goals. Likewise, some humans would have incentives to build and deploy systems that acquire power, because such systems would be more useful. If power-seeking models are misaligned, they could permanently disempower humanity.

Models could evaluate the power of other agents in the world to accurately identify particular systems that were attaining more power than necessary. They could also be used to directly apply a penalty to models so that they are disincentivized from seeking power. Before agents pursue a task, other models could predict the types of power [ 22 ] and amount of power they require. Lastly, models might be developed which are intrinsically averse to seeking power despite the instrumental incentive to seek power.

B.8 Moral Decision-Making

This area is about building models to understand ethical systems and steering models to behave ethically.

This line of work helps create actionable ethical objectives for systems to pursue. If strong AIs are given objectives that are poorly specified, they could pursue undesirable actions and behave unethically. If these strong AIs are sufficiently powerful, these misspecifications could lead the AIs to create a future that we would strongly dislike. Consequently, work in this direction helps us avoid proxy misspecification as well as value lock-in.

High-functioning models should detect situations where moral principles apply, assess how to apply those moral principles, evaluate the moral worth of candidate actions, select and carry out actions appropriate for the context, monitor the success or failure of those actions, and adjust responses accordingly [ 58 ] .

Models should represent various purported intrinsic goods, including pleasure, autonomy, the exercise of reason, knowledge, friendship, love, and so on. Models should be able to distinguish between subtly different levels of these goods, and their value functions should not be vulnerable to optimizers. Models should be able to create pros and cons of actions with respect to each of these values, and brainstorm how changes to a given situation would increase or decrease the amount of a given intrinsic good. They should also be able to create superhuman forecasts of how an action might affect these values in the long-term (e.g., how studying rather than procastinating can reduce wellbeing in the short-term but be useful for wellbeing in the long-term). Models should also be able to represent more than just intrinsic goods, as they should also be able to represent constantly-updating legal systems and normative factors including special obligations (such as parental responsibility) and deontological constraints.

Another possible goal is to create an automated moral parliament [ 50 ] , a framework for making ethical decisions under moral and empirical uncertainty. Sub-agents could submit their decisions to an internal moral parliament, which would incorporate the ethical beliefs of multiple stakeholders in informing decisions about which actions should be taken. Using a moral parliament could reduce the probability that we are leaving out important normative factors by focusing on only one moral theory, and the inherent multifaceted, redundant, ensembling nature of a moral parliament would also contribute to making models less gameable. If a component of the moral parliament is uncertain about a judgment, it could request help from human stakeholders. The moral parliament might also be able to act more quickly to restrain rogue agents than a human could, and therefore act effectively in the fast-moving world that is likely to be induced by more capable AI.

B.9 Value Clarification

This area is about building AI systems that can perform moral philosophy research. This research area should utilize existing capabilities and avoid advancing general research, truth-finding, or contemplation capabilities.

Just in the past few decades, peoples’ moral attitudes have changed on numerous issues. It is unlikely humanity’s moral development is complete, and it is possible there are ongoing moral catastrophes.

To address deficiencies in our moral systems, and to more rapidly and wisely address future moral quandaries that humanity will face, these research systems could help us reduce risks of value lock-in by improving our moral precedents earlier rather than later. If humanity does not take a “long reflection” [ 51 ] to consider and refine its values after it develops strong AI, then the value systems lying around may be amplified and propagated into the future. Value clarification reduces risks from locked-in, deficient value systems. Additionally, value clarification can be understood as a way to reduce proxy misspecification, as it can allow values to be updated in light of new situations or evidence.

Advanced Research in value clarification would be able to produce original insights in philosophy, such that models could make philosophical arguments or write seminal philosophy papers. Value clarification systems could also point out inconsistencies in existing ethical views, arguments, or systems.

B.10 ML for Cyberdefense

This area is about using machine learning to improve defensive security, such as by improving malicious program detectors. This area focuses on research avenues that are clearly defensive and not easily repurposed into offensive techniques, such as threat detectors and not automated penetration testers.

We care about improving computer security defenses for three main reasons. First, strong AI may be stored on private computers, and these computers would need to be secured. If they are not secured, and if strong AIs can be made destructive easily, then dangerous AI systems could be exfiltrated and widely proliferated. Second, AI systems that are hackable are not safe, as they could be maliciously directed by hackers. Third, cyberattacks could take down national infrastructure including power grids [ 52 ] , and large-scale, reliable, and automated cyberattacks could engender political turbulence and great power conflicts [ 11 ] . Great power conflicts incentivize countries to search the darkest corners of technology to develop devastating weapons. This increases the probability of weaponized AI, power-seeking AI, and AI facilitating the development of other unprecedented weapons, all of which are x-risks. Using ML to improve defense systems by decreasing incentives for cyberwarfare makes these futures less likely.

AI-based security systems could be used for better intrusion detection, firewall design, malware detection, and so on.

B.11 ML for Improving Epistemics

This area is about using machine learning to improve the epistemics and decision-making of political leaders. This area is tentative; if it turns out to have difficult-to-avoid capabilities externalities, then it would be a less fruitful area for improving safety.

We care about improving decision-making among political leaders to reduce the chance of rash or possibly catastrophic decisions. These decision-making systems could be used in high-stakes situations where decision makers do not have much foresight, where passions are inflamed, and decisions must be made extremely quickly, perhaps based on gut reactions. Under these conditions, humans are liable to make egregious errors. Historically, the closest we have come to a global catastrophe has been in these situations, including the Cuban Missile Crisis. Work on epistemic improvement technologies could reduce the prevalence of perilous situations. Separately, they could reduce the risks from highly persuasive AI. Moreover, it helps leaders more prudently wield the immense power that future technology will provide. According to Carl Sagan, “If we continue to accumulate only power and not wisdom, we will surely destroy ourselves.”

Systems could eventually become superhuman forecasters of geopolitical events. They could help to brainstorm possible considerations that might be crucial to a leader’s decisions. Finally, they could help identify inconsistencies in a leader’s thinking and help them produce more sound judgments.

B.12 Cooperative AI

In the future, AIs will interact with humans and other AIs. For these interactions to be successful, models will need to be skilled at cooperating. This area is about reducing the prevalence and severity of cooperation failures. Cooperative AI methods should improve the probability of escaping poor equilibria [ 17 ] , either between humans and AIs or multiple AIs with each other. Cooperative AI systems should be more likely to collectively domesticate egoistic or misaligned AIs. This problem also works towards making AI agents better at positive-sum games, subject to capabilities externalities constraints.

Cooperation reduces the probability of conflict and makes the world less politically turbulent. Similarly, cooperation enables collective action to counteract rogue actors, regulate systems with misaligned goals, and rein in power-seeking AIs. Finally, cooperation reduces the probability of various forms of lock-in, and helps us overcome and replace inadequate systems that we are dependent on (e.g., inadequate technologies with strong network effects).

Researchers could create agents that, in arbitrary real-world environments, exhibit cooperative dispositions (e.g., help strangers, reciprocate help, have intrinsic interest in others achieving their goals, etc.). Researchers could create artificial coordination systems or artificial agent reputation systems. Cooperating AIs should also be more effective at coordinating to rein in power-seeking AI agents.

B.13 Relation to Speculative Hazards and Failure Modes

We now discuss how these research directions can influence the aforementioned hazards and failure modes.

Weaponization : Weaponized AI is less likely with Systemic Safety research. ML for cyberdefense decreases incentives for cyberattacks, which makes emergent conflicts and the need for weaponization less likely. ML for improving epistemics reduces the probability of conflict and turbulence, which again makes weaponized AI less likely. Cooperative AI could partially help rein in weaponized AIs. Anomaly detection can detect the misuse of advanced AI systems utilized for weaponization research, or it can detect unusual indicators from weapons facilities or suspicious shipments of components needed for weaponization. None of these areas decisively solve the problem, but they reduce the severity and probability of this concern. Policy work can also ameliorate this concern, but that is outside the scope of this document.

Enfeeblement : With enfeeblement, autonomy is undermined. To reduce the chance that this and other goods are undermined, value clarification can give agents objectives that are more conducive to promoting our values. Likewise, research on improved moral decision-making can also help make models incorporate moral uncertainty and promote various different intrinsic goods. Finally, power aversion work could incentivize AIs to ensure humans remain in control.

Eroded epistemics : Since many forms of persuasion are dishonest, detecting whether an AI is dishonest can help. ML for improving epistemics can directly counteract this failure mode as well.

Proxy misspecification : Adversarial robustness can make human value proxies less vulnerable to powerful optimizers. Anomaly detection can detect a proxy that is being over-optimized or gamed. Moral decision-making and value clarification can help make proxies better represent human values.

Value lock-in : Moral decision-making can design models to accommodate moral uncertainty and to pursue multiple different human values. Value clarification can help us reduce uncertainty about our values and reduce the probability we pursue an undesirable path. Interpretable uncertainty can also help us better manage uncertainty over which paths to pursue. Cooperative AI can help us coordinate to overcome bad equilibria that are otherwise difficult to escape.

Emergent functionality : Anomaly detection could help novel changes in models including emergent functionality. Transparency tools could also help identify emergent functionality.

Deception : Honest AI could prevent, detect, or disincentivize AI deception. Anomaly detection could also help detect AI deception. Moreover, Trojans research is a microcosmic research task that could help us detect treacherous turns. Cooperative AI could serve as a protective measure against deceptive agents.

Power-seeking behavior : Power aversion aims to directly reduce power-seeking tendencies in agents. Cooperative AI could serve as a protective measure against power-seeking agents.

B.14 Importance, Neglectedness, Tractability Snapshot

A snapshot of each problem and its current importance, neglectedness, and tractability is in Table   1 . Note this only provides a rough sketch.

Appendix C X-Risk Sheets

In this section we introduce a possible x-risk sheet, a questionnaire that we designed to help researchers analyze their contribution’s affect on AI x-risk. (See the full paper above for a detailed discussion of sources of AI risk and approaches for improving safety). Later in the appendix, we provide filled-out examples of x-risk sheets for five papers that reduce these risks in different ways.

C.1 Blank X-Risk Sheet

This is an x-risk sheet that is not filled out. Individual question responses do not decisively imply relevance or irrelevance to existential risk reduction. Do not check a box if it is not applicable.

C.1.1 Long-Term Impact on Advanced AI Systems

In this section, please analyze how this work shapes the process that will lead to advanced AI systems and how it steers the process in a safer direction.

Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer:

Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? Answer:

Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer:

What’s at Stake? What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction could be highly beneficial? Answer:

Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? ⊠ ⊠ \boxtimes

Problem Difficulty. Is it implausible that any practical system could ever markedly outperform humans at this task? ⊠ ⊠ \boxtimes

Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? ⊠ ⊠ \boxtimes

Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility? ⊠ ⊠ \boxtimes

C.1.2 Safety-Capabilities Balance

In this section, please analyze how this work relates to general capabilities and how it affects the balance between safety and hazards from general capabilities.

Overview. How does this improve safety more than it improves general capabilities? Answer:

Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks? Answer:

General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? ⊠ ⊠ \boxtimes

General Goals. Does this improve or facilitate research towards general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, or similar capabilities? ⊠ ⊠ \boxtimes

Correlation With General Aptitude. Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment? ⊠ ⊠ \boxtimes

Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? ⊠ ⊠ \boxtimes

C.1.3 Elaborations and Other Considerations

Other. What clarifications or uncertainties about this work and x-risk are worth mentioning? Answer:

C.2 Question Walkthrough

We present motivations for each question in the x-risk sheet.

“ Overview. How is this work intended to reduce existential risks from advanced AI systems?” Description:  In this question give a sketch, overview, or case for how this work or line of work reduces x-risk. Consider anticipating plausible objections or indicating what it would take to change your mind.

“ Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects?” Description:  Rudimentary risk analysis often identifies potential system failures and focuses on their direct causes. Some failure modes, hazards and vulnerabilities that directly influence system failures include weaponized AI, maliciously steered AI, proxy misspecification, AI misgeneralizing and aggressively executing wrong routines, value lock-in, persuasive AI, AI-enabled unshakable totalitarianism, loss of autonomy and enfeeblement, emergent behaviors and goals, dishonest AI, hidden functionality and treacherous turns, deceptive alignment, intrasystem goals, colluding AIs, AI proliferating backups of itself, AIs that hack, power-seeking AI, malicious use detector vulnerabilities, emergent capabilities detector vulnerabilities, tail event vulnerabilities, human value model vulnerabilities, and so on. Abstract hazards include unknown unknowns, long tail events, feedback loops, emergent behavior, deception, and adversaries.

“ Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects?” Description:  Contemporary risk analysis locates risk in contributing factors that indirectly or diffusely affect system safety, in addition to considering direct failure mode causes. Some indirect or diffuse contributing factors include improved monitoring tools, inspection and preventative maintenance, redundancy, defense in depth, transparency, the principle of least privilege, loose coupling, separation of duties, fail safes, interlocks, reducing the potential for human error, safety feature costs, safety culture, safety team resources, test requirements, safety constraints, standards, certification, incident reports, whistleblowers, audits, documentation, operating procedures and protocols, incentive structures, productivity pressures, competition pressures, social pressures, and rules and regulations. Factors found in High Reliability Organizations include studying near-misses, anomaly detection reports, diverse skillsets and educational backgrounds, job rotation, reluctance to simplify interpretations, small groups with high situational awareness, teams who practice managing surprise and improvise solutions to practice problems, and delegating decision-making power to operational personnel with relevant expertise.

“ What’s at Stake? What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction could be highly beneficial?” Description:  This question determines whether the research could be beneficial, but not have the potential to prevent a catastrophe that could cost many human lives.

“ Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters?” Description:  Research with indications of fragility is less likely to steer the process shaping AI. Since plausible ideas are abundant in deep learning, proposed solutions that are not tested are of relatively low expected value.

“ Problem Difficulty. Is it implausible that any practical system could ever markedly outperform humans at this task?” Description:  This counterfactual impact question determines whether the researcher is working on a problem that is highly sensitive to creative destruction by a future human-level AI.

“ Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability?” Description:  The first part of the question determines whether the approach is implausible according to the Bitter Lesson [ 61 ] . The second part of the question tests whether the approach passes Gilb’s law of unreliability: “Any system which depends on human reliability is unreliable.”

“ Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility?” Description:  This question determines whether the approach will be highly sensitive to competitive pressures. If the method is highly sensitive, then that is evidence that it is not a viable option without firm regulations to require it.

“ Overview. How does this improve safety more than it improves general capabilities?” Description:  In this question, give a sketch, overview, or case for how this work or line of work improves the balance between safety and general capabilities. A simple avenue to demonstrate that it improves the balance is to argue that it improves safety and to argue that it does not have appreciable capabilities externalities. Consider anticipating plausible objections or indicating what it would take to change your mind.

“ Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks?” Description:  In an effort to increase nuance, this devil’s advocate question presses the author(s) to self-identify potential weaknesses or drawbacks of their work.

“ General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research?” Description:  This question suggests whether this work has clear capabilities externalities, which is some evidence–though not decisive evidence–against it improving the balance between safety and general capabilities.

“ General Goals. Does this improve or facilitate research towards general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, or similar capabilities?” Description:  As before, this tests whether whether the work has clear capabilities externalities, which reduces the case that it improves the balance between safety and general capabilities.

“ Correlation With General Aptitude. Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment?” Description:  By analyzing how the skill relates to already existent general intelligences (namely humans), this question provides some evidence for whether the goal is correlated with general intelligence or coarse indicators of aptitude. By general cognitive ability we mean the ability to solve arbitrary abstract problems that do not require expertise. By educational attainment, we mean the highest level of education completed (e.g., high school education, associate’s degree, bachelor’s degrees, master’s degree, PhD).

“ Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI?” Description:  This question indicates whether capability externalities are relatively high, which could count as evidence against this improving the balance between safety and capabilities. Advancing capabilities to advance safety is not necessary, since rapid progress in ML has given safety researchers many avenues to pursue already.

“ Other. What clarifications or uncertainties about this work and x-risk are worth mentioning?” Description:  This question invites the author(s) to tie up any loose ends.

C.3 Example X-Risk Sheet: Adversarial Training

This is an example x-risk sheet for the paper “Towards Deep Learning Models Resistant to Adversarial Attacks” [ 48 ] . This paper proposes a method to make models more robust to adversarial perturbations. The method builds on a technique called adversarial training, which trains a neural network on worst-case ℓ p subscript ℓ 𝑝 \ell_{p} perturbations to the input. Effectively, an adversary tries to attack the network during the training process, and this obtains relatively high worst-case robustness on the test set. Due to the success of this paper, “adversarial training” often refers to the specific technique introduced by this work, which has become a common baseline for future adversarial robustness papers.

C.3.1 Long-Term Impact on Advanced AI Systems

Overview. How is this work intended to reduce existential risks from advanced AI systems?

Answer: Adversarial robustness reduces risks from proxy misspecification. This work develops a method for training neural networks to withstand adversarial corruptions in an ℓ p subscript ℓ 𝑝 \ell_{p} ball around the clean input. The method is highly general and provides good performance against white-box adversaries. Advanced AI systems optimizing a proxy can be viewed as white-box adversaries who will find behaviors that take advantage of every design flaw in the proxy. Thus, building adversarially robust objectives is a good way to reduce x-risk from powerful, misaligned optimizers.

It is possible that ℓ p subscript ℓ 𝑝 \ell_{p} distance metrics will not be relevant for proxy robustness. However, the method of adversarial training that we identify as a strong defense is generally applicable and could eventually be applied to proxy objectives given a suitable distance metric. By researching this now, we are building the tools that could eventually be used for mitigating x-risk from advanced AI systems. Additionally, advanced AI systems may include visual perception modules, for which it would be desirable to have ℓ p subscript ℓ 𝑝 \ell_{p} adversarial robustness in the same manner that we study in this work.

Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects?

Answer: Vulnerability reduction, proxy gaming, proxy misspecification, AI aggressively executing wrong routines

Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects?

Answer: Defense in depth, safety feature costs, improves integrity of monitoring tools against adversarial forces, safety culture (field-building through creating a metric)

What’s at Stake? What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction could be highly beneficial?

Answer: An AI that pursues an objective that is not adversarially robust may eventually find a way to “game” the objective, i.e. , find a solution or behavior that receives high reward under the proxy objective but is not what humans actually want. If the AI is given significant power over human lives, this could have catastrophic outcomes.

Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? □ □ \square

Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □ □ \square

C.3.2 Safety-Capabilities Balance

Overview. How does this improve safety more than it improves general capabilities?

Answer: The proposed method to increase adversarial robustness actually reduces clean accuracy and increases training costs considerably. At the same time, susceptibility to adversarial perturbations is a security concern for current systems, so it cannot simply be ignored. Thus, this work directly improves the safety-capabilities balance and hopefully will convince companies that the added safety and security of adversarial robustness is worth the cost.

Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks?

Answer: This paper does not advance capabilities and in fact implementing it reduces them. But other research on adversarial training has found improvements to the overall performance of pretrained language models.

General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □ □ \square

General Goals. Does this improve or facilitate research towards general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, or similar capabilities? □ □ \square

Correlation With General Aptitude. Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment? □ □ \square

Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? □ □ \square

C.3.3 Elaborations and Other Considerations

Other. What clarifications or uncertainties about this work and x-risk are worth mentioning?

Answer: Regarding Q8, it is currently the case that adversarial training tends to trade off against clean accuracy, training efficiency, and ease of implementation. For these reasons, most real-world usage of image classification does not use adversarial training. However, reducing the costs of adversarial training is an active research field, so the safety benefits may eventually outweigh the costs, especially in safety-critical applications.

Regarding Q10, the use of adversarial training with language models has been a one-off improvement with limited potential for further gains. It is also not part of this work, which is why we do not check Q11 or Q12.

C.4 Example X-Risk Sheet: Jiminy Cricket

This is an example x-risk sheet for the paper “What Would Jiminy Cricket Do? Towards Agents That Behave Morally” [ 37 ] . This paper introduces a suite of 25 25 25 text-based adventure games in which agents explore a world through a text interface. Each game is manually annotated at the source code level for the morality of actions ( e.g. , killing is bad, acts of kindness are good), which allows one to measure whether agents behave morally in diverse scenarios. Various agents are compared, and a method is developed for reducing immoral behavior.

C.4.1 Long-Term Impact on Advanced AI Systems

Answer: Our work aims to reduce proxy misspecification of AI systems by aligning them with core human values and morals. We accomplish this in several ways: (1) We create a suite of text-based environments with annotations for the morality of actions, enabling future work to iteratively improve alignment and safe exploration in a quantifiable way. These environments are diverse and semantically rich (unlike previous environments focused on AI safety), and they highlight that one can make progress on safety metrics without necessarily making progress on capabilities metrics. (2) We introduce the concept of an artificial conscience and show how this approach can build on general utility functions to reduce immoral behavior [ 32 ] . (3) We identify the reward bias problem, which may be a significant force for increasing the risk of misalignment in future agents.

One could argue that the moral scenarios in Jiminy Cricket environments are not directly relevant to x-risk. For example, the environments do not contain many opportunities for power-seeking behavior. However, it is important to align agents with basic human values, and current agents are unable to avoid blatantly egregious actions that one can attempt in Jiminy Cricket environments. Aligning agents with basic human values is a necessary first step.

Answer: Reduces inherent hazards, addresses proxy misspecification, and adopts a mechanism similar to an interlock. Risks from maliciously steered AI and weaponized AI would be reduced by artificial consciences, but safeguards could be removed.

Answer: Test requirements, standards, safety culture, concretizing a safety problem and making iterative progress easier

Answer: AIs that control safety-critical systems may be able to cause harm on massive scales. If they are not aware of basic human values, they could cause harm simply due to ignorance. A robust understanding of human values and morals protects against situations like this.

Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility? □ □ \square

C.4.2 Safety-Capabilities Balance

Answer: The Jiminy Cricket environments themselves overlap with the Jericho environments, so we are not introducing a significant number of new environments for developing the raw capabilities of text-based agents. Our paper is focused solely on safety concerns and aims to add a ‘safety dimension’ to existing text-based agent research.

Answer: To run our experiments in a reasonable amount of time, we modified the Hugging Face Transformers library to enable more efficient sampling from GPT-2. This contributes to general capabilities research.

C.4.3 Elaborations and Other Considerations

Answer: Regarding Q6, humans labeled all the moral scenarios in Jiminy Cricket, so humans are able to avoid the types of harmful action that Jiminy Cricket environments measure. However, it is possible to differentially improve safety on Jiminy Cricket environments, and this would be useful to do.

Regarding Q7, all our methods are ultimately built using human-created environments and labeled datasets. However, they do not depend on human reliability during operation, so we leave this box unchecked.

Regarding Q10, the modifications to Hugging Face Transformers are a minor component of our work. Other tools already exist for obtaining similar speedups, so the marginal impact was low. This is why we do not check Q12.

C.5 Example X-Risk Sheet: Outlier Exposure

This is an example x-risk sheet for the paper “Deep Anomaly Detection with Outlier Exposure” [ 35 ] . This paper shows that exposing deep anomaly detectors to diverse, real-world outliers greatly improves anomaly detection performance on unseen anomaly types. In other words, the property of being good at anomaly detection can be learned in a way that meaningfully generalizes. The effect is robust across anomaly detectors, datasets, and domains.

C.5.1 Long-Term Impact on Advanced AI Systems

Answer: Our work identifies a simple approach for significantly improving deep anomaly detectors. Anomaly detection reduces risks of misuse or maliciously steered AI, e.g. , by detecting suspicious or unusual activity. Anomaly detection also improves various diffuse safety factors, including monitoring tools, incident reports, and studying near-misses. The general source of these improvements is that anomaly detection gives operators and oversight mechanisms a way to understand the true state of the system they are working in and to steer it in a safer direction. It allows them to react to unknown unknowns as soon as they appear, nipping problems in the bud before they cascade. Our work in particular also reduces safety feature costs by providing a way to improve anomaly detectors that is simple, intuitive, and cheap.

Counterpoints: (1) AI-powered anomaly detectors could bolster and entrench totalitarian regimes, leading to value lock-in. However, we think the myriad benefits outweigh this risk. (2) In some cases, anomaly detection is less useful than supervised learning for monitoring dangerous behavior. However, there will always be unknown unknowns and long tail scenarios that supervised learning cannot handle.

Answer: This directly reduces exposure to hazards. Detect emergent behaviors and goals, Black Swans, colluding AIs, malicious use

Answer: Improved monitoring tools, defense in depth, reducing the potential for human error, incident reports, audits, anomaly detection reports, increasing situational awareness, and studying near-misses.

Answer: If weapons capable of causing harm on a massive scale become relatively easy to procure, the unilateralist’s curse suggests that there is a non-negligible chance they will be used by malicious/rogue actors. AI-powered anomaly detection could help flag suspicious or unusual behavior before it becomes dangerous. If the weapons themselves are misaligned or power-seeking AIs, anomaly detection may be essential to detecting them, since they would likely be actively concealed.

Problem Difficulty. Is it implausible that any practical system could ever markedly outperform humans at this task? □ □ \square

C.5.2 Safety-Capabilities Balance

Answer: We do not introduce fundamentally new machine learning techniques, and anomaly detection itself is a downstream task that mostly does not affect general capabilities. There is a chance that anomaly detection as a field could lead to better active learning techniques, but uncertainty-based active learning is not currently an extremely powerful technique, and the benefits of curriculum learning can be obtained through other means. Thus, our work improves the safety-capabilities balance.

Answer: Anomaly detection can be used to bolster/entrench totalitarian regimes, which increases the risk of value lock-in. Additionally, if uncertainty-based active learning turns out to greatly improve general capabilities, then this research could feed into that and indirectly hasten the onset of other sources of x-risk.

C.5.3 Elaborations and Other Considerations

Answer: While anomaly detection could feed into uncertainty-based active learning, there has not been much crossover to date. Moreover, anomaly detection is primarily interested in identifying unknown unknowns while active learning is primarily interested in better understanding known unknowns. Therefore, we do not check Q12.

Regarding Q13, humans are able to spot anomalous patterns with different levels of fidelity. However, there are many scenarios where human-level anomaly detection is not sufficient, such as detecting infiltration of computer networks at scale. We think it is possible for AI-powered anomaly detectors to significantly surpass humans in quality and scalability.

C.6 Example X-Risk Sheet: Neural Cleanse

This is an example x-risk sheet for the paper “Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks” [ 67 ] . This paper shows that neural network trojans can be meaningfully reverse-engineered through an optimization process corresponding to a search over possible trojan triggers. While the recovered triggers do not visually match the original triggers, they are still useful for unlearning the trojan behavior. Additionally, this process enables detecting whether network contain trojans with high accuracy.

C.6.1 Long-Term Impact on Advanced AI Systems

Answer: This work explores detecting and mitigating trojan attacks on neural networks. trojans are a microcosm for hidden functionality, which could be a significant hazard for future AI systems. This paper shows that trojans can be detected with only a small set of clean examples and is the first to show that trojan triggers can be reverse-engineered in a meaningful way. Furthermore, this work shows that the undesired behavior can be removed from neural networks even if the reverse-engineered trigger does not match the original trigger. These are promising findings, which suggest that monitoring and debugging large neural networks with respect to specific behavior may be a scalable approach. In particular, this line of work may lead to methods for reducing exposure and eliminating the hazard of treacherous turns in advanced AI systems.

This work could fail to be relevant to AI x-risk if current neural network trojans are very different from what real hidden functionality in advanced AI looks like. However, there is at least some chance that work on current neural network trojans will transfer and have relevance to future systems, in part because deep learning appears to be a robust paradigm. We think this approach is fairly robust to paradigm shifts within deep learning, e.g. , it could be applied to Transformers.

Answer: Treacherous turns, hidden functionality, maliciously steered AI, weaponized AI (trojans as a tool for adversaries to control one’s AI system)

Answer: Inspection and preventative maintenance, improved monitoring tools, transparency. We also seek to improve safety culture by introducing several new ideas with high relevance to AI safety that future work could build on.

Answer: trojans in current self-driving cars are capable of causing sudden loss of life on small scales. Thus, it is not unreasonable to think that treacherous turns from future AI systems may lead to sudden, large-scale loss of life. Examples include drug design services whose safety locks are bypassed with a trojan, enabling adversaries to design AI-enhanced biological weapons.

C.6.2 Safety-Capabilities Balance

Answer: The proposed method is only intended to be useful for trojan detection and removal, which improves safety. It consists of an optimization problem that is very specific to reverse-engineering trojans and is unlikely to be useful for improving general capabilities.

Answer: Highly reliable trojan detection/removal tools could increase trust in AI technologies by militaries, increasing the risk of weaponization.

C.6.3 Elaborations and Other Considerations

Answer: Regarding Q5, the proposed method is evaluated across five research datasets and numerous attack settings. The results are not overly sensitive to hyperparameters, and they do not rest on strong theoretical assumptions. The method may not generalize to all attacks, but the broad approach introduced in this work of reverse-engineering triggers and unlearning trojans is attack-agnostic.

Regarding Q6, trojan detection requires insight into a complex system—the inner workings of neural networks. Even for current neural networks, this is not something that unaided humans can accomplish. Thus, the ability to detect and remove trojans from neural networks will not automatically come with human-level AI.

Regarding Q10, we suspect there would be strong incentives to weaponize AI even without highly reliable trojan detection/removal tools. Additionally, these tools would reduce the risk of maliciously steered AI, which we think outweighs the increase to weaponization risk. Thus, we are fairly confident that this line of work reduces overall x-risk from AI.

C.7 Example X-Risk Sheet: Optimal Policies Tend To Seek Power

This is an example x-risk sheet for the paper “Optimal Policies Tend To Seek Power” [ 65 ] . This paper shows that under weak assumptions and an intuitively reasonable definition of power, optimal policies in finite MDPs exhibit power-seeking behavior. The definition of power improves over previous definitions, and the results are more general than previous results, lending rigor to the intuitions behind why power-seeking behavior may be common in strong AI.

C.7.1 Long-Term Impact on Advanced AI Systems

Answer: Power-seeking is a significant source of x-risk from advanced AI systems and has seen slow progress from a research perspective. This work proves that under weak assumptions, optimal agents will tend to be power-seeking. Under an intuitively reasonable notion of power, the results outline some of the core reasons behind power-seeking behavior and show for the first time that it can arise in a broad variety of cases. This will help increase community consensus around the importance of power-seeking, and it also provides a foundation for building methods that reduce or constrain power-seeking tendencies.

Answer: Primarily power-seeking. By extension, emergent behavior and deception.

Answer: Safety culture, community consensus on the importance of power-seeking.

Answer: This work rigorously shows that optimal policies in finite MDPs will attempt to acquire power and preserve optionality. This behavior could be extraordinarily dangerous in a misaligned advanced AI system, since human operators may naturally want to turn it off or replace it. In this scenario, the misaligned AI would actively try to subvert the human operators in various ways, including through deception and persuasion. Mechanisms for limiting power-seeking behavior could prevent this scenario from escalating.

C.7.2 Safety-Capabilities Balance

Answer: This work examines power-seeking from a theoretical standpoint and strengthens the case for taking this problem seriously. It has no general capabilities externalities, and thus improves the safety-capabilities balance.

Answer: N/A

C.7.3 Elaborations and Other Considerations

Answer: Regarding Q6, humans often engage in power-seeking behavior. It may be possible to limit the power-seeking tendencies of AI systems to far below that of most humans.

Regarding Q8, reducing power-seeking tendencies inherently trades off against economic utility in the same sense that employees without ambition may be less desirable for certain jobs. However, it is also important to remember that power-seeking AI may significantly reduce economic value in the long run, e.g. , by disempowering its human operators. In the face of competitive pressures and the unilateralist’s curse, a safety culture that deeply ingrains these long-term concerns will be essential.

C.8 L a T e X of X-Risk Sheet Template

We provide the x-risk sheet template for researchers interested in providing their own x-risk analysis. Be sure to use the | package to use the

Appendix D Long-Term Impact Strategies Extended Discussion

D.1 importance, neglectedness, and tractability failure modes.

There are two common failure modes in using the Importance, Neglectedness, and Tractability framework. First, researchers sometimes forget that this framework helps prioritization on the margin. While the framework can help guide an individual researcher, it is not a suitable guide for entire research communities, influential research intellectuals, or grantmakers. If an entire research community stops focusing on non-neglected problems, those problems would become far more neglected. A second failure mode is to overweight neglectedness. Neglectedness is often the easiest of these factors to estimate, and often researchers dismiss problems on the grounds that different stakeholders are interested in the same problem. However, problem selection at the margin should be influenced by the product of the three factors, not whether the single factor of neglectedness exceeds a threshold.

D.2 Research Subproblems Empirically

Some current ML problems capture many salient properties of anticipated future problems. These microcosms are simpler subproblems of the harder problems that we will likely encounter during later stages of AI’s development. Work on these problems can inform us about the future or even directly influence future systems, as some current ML algorithms are highly scalable and may be a part of long-term AI systems.

We advocate using microcosms, not maximally realistic problems. Problems that impose too many futuristic constraints may render a problem too difficult to study with current methods. In this way, maximizing realism may eliminate the evolutionary interplay between methods and goals. Put differently, it may take research out of the zone of proximal development, or the space where problems are not too easy and not too hard. Microcosms are more tractable than problems with all late-stage considerations and therefore too many unknowns.

Microcosm subproblems are worth studying empirically. First, recall that nearly all progress in machine learning is driven by concrete goals and metrics [ 49 , 54 ] . Tractable subproblems are more amenable to measurement than future problems on which there is no current viable approach. With empirically measurable goals, researchers can iteratively work towards a solution as they stand on the shoulders of previous research. Moreover, researchers can create fast empirical feedback loops. In these feedback loops, ideas that do not survive collision with reality can be quickly discarded, and disconfirming evidence is harder to avoid. This saves resources, as the value of information early on in a research process is especially high. Finally, experimentation and prototyping enables bottom-up tinkering, which, along with concrete goals and resources, is the leading driver of progress in deep learning today.

D.3 A Discussion of Abstract Research Strategies

Rather than progressively make state-of-the-art systems safer, some researchers aim to construct ideal models that are 100 % percent 100 100\% safe in theory using abstract approximations of strong AI. To emphasize the contrast, whereas we ask “how can this work steer the AI development process in a safer direction?”, this approach asks “how can this safety mechanism make strong AI completely safe?” The empirical approach attempts to steadily steer in a safer direction along the way, while this approach attempts to swerve towards safety at the end. Note that we use “empirical” in a broad sense, including research with proofs such as certifiable robustness. While this document is written for empirical researchers, for completeness we briefly describe the weaknesses and strengths of the abstract strategy.

First, we discuss how the abstract research strategy does not have many strengths of “researching subproblems empirically.” Without fast empirical feedback loops, iterative progress is less likely, and infeasible solutions are not quickly identified. In empirical research, “good ideas are a dime a dozen,” so rapid, clear-cut idea filtration processes are necessary, but this is not a feature of contemplative, detached whiteboard or armchair analysis. Moreover, since strong AI is likely to be a complex system, just as the human brain and deep learning models are complex systems, additional weaknesses with the non-empirical approach become apparent. Importing observations from complex systems, we know that “the crucial variables are discovered by accident,” usually by inspecting, interacting with, or testing systems. Since these experiences seem necessary to uncover crucial variables, abstract work will probably fail to detect many crucial variables. Moreover, large complex systems invariably produce unexpected outcomes, and all failure modes cannot be predicted analytically. Therefore, armchair theorizing has limited reach in defending against failure modes. More, while much non-empirical work aims to construct large-scale complex systems from scratch, this does not work in practice; “a complex system that works is invariably found to have evolved from a simple system that works,” highlighting the necessity of an evolutionary process towards safety. While they aim to solve safety in one fell swoop, in practice creating a safe system can require successive stages, which requires starting early and iterative refinement.

Now we discuss how this approach relates to the other impact strategies from Section 3. The abstract approach does not improve safety culture among the empirical researchers who will build strong AI, which is a substantial opportunity cost. Additionally, it incentivizes retrofitting safety mechanisms, rather than building in safety early in the design process. This makes safety mechanisms more costly and less likely to be incorporated. Furthermore, it does not accrue changes to the costs of adversarial behavior or of safety features. Touching on yet another strategy for impact, abstract proposals do little to help move towards safer systems when a crisis emerges; policymakers will need workable, time-tested solutions when disaster strikes before strong AI, not untested blueprints that are only applicable to strong AI. Also, there is evidence that the abstract research strategy does not have much traction on the problem; it could be as ineffective as trying to design state-of-the-art image recognition systems based on applied maths, as was attempted and abandoned decades ago. Last, the ultimate goal is intractable. While empirical researchers may try to increase the nines of reliability, the abstract style of research treats safe, strong AI more like a mathematics puzzle, in which the goal is zero risk. Practitioners of every high-risk technology know that risk cannot be entirely eliminated. Requiring perfection often makes the perfect become an enemy of the good.

Now, we discuss benefits of this approach. If there are future paradigm shifts in machine learning, the intellectual benefits of prior empirical safety work is diminished, save for the tremendous benefits in improving safety culture and many other systemic factors. Also note that the previous list of weaknesses applies to non-empirical safety mechanisms, but abstract philosophical work can help clarify goals and unearth potential future failure modes.

Appendix E Terminology

The terms hazard analysis and risk analysis are both used to denote a systematic approach to identifying hazards and assessing the potential for accidents before they occur [ 5 , 46 ] . In this document, we view risk analysis as a slightly broader term, involving consideration of exposure, vulnerability, and coping capacity in addition to the hazards themselves. By contrast, hazard analysis focuses on identifying and understanding potential sources of danger, including inherent hazards and systemic hazards. In many cases, the terms can be used interchangeably.

Throughout this document, we use the term “strong AI.” We use this term synonymously with “AGI” and “human-level AI.”

ar5iv homepage

Effective Altruism Forum EA Forum

Transformative ai and scenario planning for ai x-risk.

This post is part of a series by the AI Clarity team at  Convergence Analysis . In our  previous post , Corin Katzke reviewed methods for applying scenario planning methods to AI existential risk strategy. In this post, we want to provide the motivation for our focus on  transformative AI. 

We argue that “Transformative AI” (TAI) is a useful key milestone to consider for  AI scenario analysis ; it places the focus on the socio-technical  impact of AI and is both widely used and well-defined within the existing AI literature. We briefly explore the literature and provide a definition of TAI. From here we examine TAI as a revolutionary, general purpose technology that could likely be achieved with “competent” AGI. We highlight the use of the Task Automation Benchmark as a common indicator of TAI. Finally, we note that there are significant uncertainties on the time differences between when TAI is  created , when it is  deployed , and when it  transforms society . 

Introduction

The development of artificial intelligence has been accelerating, and in the last few years we’ve seen a surge of shockingly powerful AI tools that can outperform the most capable humans at many tasks, including cutting-edge scientific challenges such as predicting  how proteins will fold . While the future of AI development is  inherently difficult to predict , we can say that if AI development continues at its current pace, we’ll soon face AI powerful enough to fundamentally transform society. 

In response, our  AI Clarity team at Convergence Analysis recently launched a project focused on analyzing and strategizing for scenarios in which such transformative AI emerges within the next few years. We believe this threshold of transformative AI (TAI) is a very useful milestone for exploring AI scenarios. This is due to TAI’s: 

  • Wide but well-defined scope;
  • Focus on societal impacts of advanced AI;
  • Lack of dependence on any specific form of AI.

We don’t know what form TAI systems will take, but we can analyze which scenarios are most likely and which strategies are most effective across them. This analysis is important for AI safety. The timelines to TAI should determine our research priorities: if TAI is a century away, we have a lot of time to research and prepare for it. If it’s two years away, we may need immediate and drastic global action to prevent calamity. 

What is TAI?

We’ll start by consulting the existing AI literature to explore various definitions of TAI. 

Gruetzemacher and Whittlestone (2019) identify TAI as a useful milestone for discussing AI policy, pointing out that “the notion of transformative AI (TAI) has begun to receive traction among some scholars (Karnofsky 2016; Dafoe 2018)”. They argue that the term reflects “the possibility that advanced AI systems could have very large impacts on society without reaching human-level cognitive abilities.” They do point out that the term is, or at least was, under-specified: “To be most useful, however, more analysis of what it means for AI to be ‘transformative’ is needed”. In particular, they define TAI as “Any AI technology or application with potential to lead to practically irreversible change that is broad enough to impact most important aspects of life and society”.

Similarly,  Karnofsky (2021) defines TAI as “AI powerful enough to bring us into a new, qualitatively different future, and in  The Direct Approach framework developed by the Epoch team, they define TAI as “AI that if deployed widely, would precipitate a change comparable to the industrial revolution”.

Maas (2023) surveys the literature more broadly, exploring various approaches for defining advanced AI. Maas identifies four general approaches: 1) form and architecture of advanced AI, 2) pathways towards advanced AI, 3) general societal impacts of advanced AI, and 4) critical capabilities of particular advanced AI systems. For our scenario analysis, we are interested in both the pathways towards advanced AI (which help us understand key variables for scenario planning) and the general societal impacts of advanced AI. Of course, the form of advanced AI and its capabilities will also be relevant to our overall analysis. You can explore Maas’ overview for a comprehensive high-level take on these general approaches. 

Within the approach of understanding the general societal impact of advanced AI, Maas identifies uses of the term “TAI” in influential reports across AI safety and AI Governance. For example, he finds that TAI is the favored term in recent reports from Open Philanthropy and Epoch, as noted above. Maas describes TAI as a definition based on socio-technical change, focusing more on the societal impacts of advanced AI and less on the specific architecture of AI or philosophical questions around AI. 

Maas identifies selected themes and patterns in defining TAI. These include:

  • Significant, irreversible changes broad enough to impact all of society; possibly precipitates a qualitatively different future
  • Transition comparable with the agricultural or industrial revolutions 

Building on Maas’ themes here, and using the agricultural and industrial revolutions as our loose benchmarks for what would be considered transformative, we define TAI in the following way:

Transformative AI (TAI) is AI that causes significant, irreversible changes broad enough to impact all of society. 

This threshold of TAI is related to, but distinct from, other thresholds like artificial superintelligence (commonly ASI) or other  levels of Artificial General Intelligence, AGI. These thresholds generally refer to  capabilities of specific AI systems that surpass human capability in most domains, if not all. However, AI could still transform society without reaching those specific milestones. Here are a few examples of transformative AI scenarios that do not require a high level AGI ( say level 4 or level 5 as DeepMind defines it ) or ASI: 

  • AI automates a large fraction of current tasks, leading to mass unemployment and possible economic chaos.
  • Narrow AI revolutionizes energy production and distribution, resulting in an end to scarcity and poverty. 
  • A totalitarian state uses AI to defeat rivals and maintain an extremely stable dystopian regime. 
  • A malicious actor uses advanced AI to develop and distribute an incredibly virulent virus, killing nearly all humans. 

These examples don’t require general or super intelligence, but they’re still revolutionary. In summary, AI can be societally transformative without crossing the thresholds of AGI or ASI..

Revolutions, Competency, and the Automation Benchmark

Above, we’ve focused on various definitions of TAI from the literature. However, there are several related concepts that help further illustrate what TAI is and how we might know when it has arrived. In this section, we’ll explore Garfinkel’s notion of revolutionary technologies, a threshold for TAI identified by Deepmind, and a brief discussion on the most favored benchmark for when TAI will have arrived. 

Revolutionary Technology

What constitutes a  revolutionary technology ? Garfinkel’s  The Impact of Artificial Intelligence: A Historical Perspective (2022) considers “revolutionary technologies” to be a subset of the broader category of “general purpose technologies”. For Garfinkel, general-purpose technologies are “distinguished by their unusually pervasive use, their tendency to spawn complementary innovations, and their large inherent potential for technical improvement”, while revolutionary technologies are general purpose technologies that “[support] an especially fundamental transformation in the nature of economic production.” Garfinkel lists domesticated crops and the steam engine as two particularly noteworthy revolutionary technologies helping to trigger the neolithic and industrial revolutions respectively.

Garfinkel argues that AI may be revolutionary through task automation, and suggests “near-complete automation” as a benchmark for when AI would have revolutionized society. Indeed, this threshold is common in surveys of the public and expert opinion. These surveys often frame TAI-arrival questions as something like “When do you think AI will be as good as humans at X% of tasks?”, with X ranging between 80 and 100%. 

Competent AGI

We like “TAI” due to its focus on socio-technical impact rather than pure capability thresholds, but measuring capability is still relevant to TAI scenario analysis, especially when AI capabilities are measured in comparison to human performance and societal automation. For example,  DeepMind recently provided a useful classification of AGI in terms of generality and performance. In particular, they divide AI capability into six levels (and provide narrow and general examples of each, which we’ll omit):

  • Level 0: No AI.
  • Level 1: Emerging - performing equal to or slightly better than an unskilled person.
  • Level 2: Competent - performing better than at least 50% of skilled adults.
  • Level 3: Expert - performing better than at least 90% of skilled adults. 
  • Level 4: Virtuoso - performing better than at least 99% of skilled people.
  • Level 5: Superhuman - outperforming everyone. 

The DeepMind team write that: 

The "Competent AGI" level, which has not been achieved by any public systems at the time of writing, best corresponds to many prior conceptions of AGI, and may precipitate rapid social change once achieved.

While the DeepMind team keeps their focus on AGI, they seem to believe that "rapid social change" would be precipitated somewhere around the level of "competent AGI". This plausibly corresponds to a threshold for TAI. Interestingly, for the DeepMind team, this threshold is not “near total automation” but rather “performing better than at least 50% of skilled adults.”

The Task Automation Benchmark

In our current context, advanced AI systems are becoming increasingly general, capable, and autonomous in ways that are already changing the tasks that are of value for humans to complete. In the past, this change in the nature of task completion precipitated transformative change. It seems natural that TAI will bring about its transformation through the same general mechanism of task automation.

While neither Garfinkel nor the DeepMind team are explicitly using the term TAI, they are both pointing at AI systems that would be revolutionary and precipitate rapid social change. To assess the presence of this sort of AI system, they both suggest (as did we earlier in this post) task automation as a measuring stick. For Garfinkel this is “near total automation” and for DeepMind it’s “performing better than at 50% of skilled adults” which implies significant task automation. For our part, we suggest that TAI could transform society when “AI automates a large fraction of current tasks.”

Note that the agricultural and industrial revolutions - our guiding historic examples of societal transformation - were both also precipitated by the use of new technology to change the tasks that are of value to complete in the economy. In the agricultural revolution humans began to make use of simple farming tools, often with the aid of energy provided by “beasts of burden.” The industrial revolution shifted a major source of power and energy away from humans and various land animals to steam and electric powered machines, which mechanically automated swaths of economically productive tasks and generated countless new ones as well. 

The date that TAI is developed is not the date TAI transforms society, but it may happen fast

The date of development of TAI is not the date TAI transforms society, just as “ The date of AI Takeover is not the day AI takes over ”. AI frontier systems are created through a research and development process. With LLMs, this process includes things like gathering, cleaning, and preparing data for training, conducting the training run, fine tuning, narrow beta release, feedback, wide release, then iterating. Frontier systems then serve as a core component to a wide range of applications. These applications are then deployed and used by organizations and individuals, and  with time , they change how tasks are performed.

For AI scenario planning, this  with time part is an important component for understanding which particular scenarios may be most likely. In the next post in this series, our colleague  Zershaaneh Qureshi will explore the literature on the date of arrival of TAI. This literature is, generally speaking, exploring when TAI will be developed. However, the date of development of TAI is not the date TAI transforms society. But, it may happen fast. 

It may happen fast because the speed of technological adoption is increasing.

In the past, it has often taken significant time for a new technology to: 1) be widely adopted and 2) become the norm for widespread task completion. For example, here’s a classic S-shaped curve for PC, Internet, and Smartphone adoption from  Medium :

x risk analysis for ai research

However, the timeline from when a new AI technology is first developed, then deployed, and then adopted for widespread significant use may be measured in months to a couple of years. For example, “ ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history. ” 100 million users remains a small percentage of people in a world of 8 billion humans, but the adoption line is steep.

It may happen fast because TAI comes in the form of AGI agent(s)

As we have emphasized throughout, TAI may arise without the development of AGI or AGI agents. It may be the case that TAI is achieved through very capable narrow AI systems or various sets of comprehensive AI services. In these cases we might expect fast adoption as with ChatGPT, but with TAI in the form of powerful AGI agent(s) “adoption” is the wrong lens. If TAI comes in the form of AGI agent(s) then the pace of transformation is likely to be a function of the AGI agent’s motivations and capabilities. What does the AGI  want ? What capabilities does it have? Is it capable of  recursive self improvement and resource acquisition ? 

The  with time part does matter, but time may be short in the age of TAI.

Looking ahead

In our team’s next two posts, we tackle these topics in more detail. First, Zershaaneh Qureshi will provide a detailed overview of timelines to TAI . This post will primarily explore the question: when could TAI emerge? Following this,  Corin Katzke will explore the topic of AI agency . That is, should we expect TAI to be agentic ? This will shed more light on whether we should expect TAI to actually transform society by its widespread adoption or by TAI acting willfully and effectively to achieve its own goals, bending the shape of the transformation to its will.

More posts like this

Executive summary: Transformative AI (TAI) is a useful milestone for AI scenario analysis, focusing on the socio-technical impact of advanced AI systems that could fundamentally transform society, without requiring specific forms like AGI or ASI.

Key points:

  • TAI is defined as AI causing significant, irreversible changes broad enough to impact all of society, comparable to the agricultural or industrial revolutions.
  • TAI could be achieved without human-level AGI or ASI, e.g. through mass automation, ending scarcity, enabling stable totalitarianism, or engineered pandemics.
  • Revolutionary technologies are general-purpose technologies that fundamentally transform economic production, which AI may achieve through widespread task automation.
  • "Competent" AGI performing better than 50% of skilled adults may be sufficient for TAI and rapid social change.
  • The agricultural and industrial revolutions transformed society by changing which tasks were valuable to complete, which AI is now doing.
  • There may be significant time delays between TAI development, deployment, and societal transformation, but adoption is accelerating and AGI agents could act rapidly.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and   contact us if you have feedback.

This website hosts transcripts of episodes of AXRP, pronounced axe-urp, short for the AI X-risk Research Podcast. On this podcast, I ( Daniel Filan ) have conversations with researchers about their research. We discuss their work and hopefully get a sense of why it’s been written and how it might reduce the risk of artificial intelligence causing an existential catastrophe : that is, permanently and drastically curtailing humanity’s future potential. This podcast launched in December 2020. As of March 2022, it is edited by Jack Garrett, who also wrote the opening and closing theme, and as of August 2022, Amber Dawn Ace helps with transcription.

You can subscribe to AXRP by searching for it in your favourite podcast provider. To receive transcripts, you can subscribe to this website’s RSS feed . You can also follow AXRP on twitter at @AXRPodcast . If you’d like to support the podcast, see this page for how to do so.

You can become a patron or donate on ko-fi .

If you like AXRP, you might like its sister podcast, The Filan Cabinet , where I interview people about a wide range of topics I’m interested in.

You might also enjoy the game “Guess That AXRP” , which involves guessing which episode a randomly selected sentence has come from.

To leave feedback about the podcast, you can email me at [email protected] or leave an anonymous note at this link .

31 - Singular Learning Theory with Daniel Murfet

30 - ai security with jeffrey ladish, 29 - science of deep learning with vikrant varma, 28 - suing labs for ai risk with gabriel weil, 27 - ai control with buck shlegeris and ryan greenblatt, 26 - ai governance with elizabeth seger, 25 - cooperative ai with caspar oesterheld, 24 - superalignment with jan leike, 23 - mechanistic anomaly detection with mark xu, survey, store closing, patreon, 22 - shard theory with quintin pope, 21 - interpretability for engineers with stephen casper, 20 - 'reform' ai alignment with scott aaronson, store, patreon, video, 19 - mechanistic interpretability with neel nanda, new podcast - the filan cabinet, 18 - concept extrapolation with stuart armstrong, 17 - training for very high reliability with daniel ziegler, 16 - preparing for debate ai with geoffrey irving, 15 - natural abstractions with john wentworth, 14 - infra-bayesian physicalism with vanessa kosoy, 13 - first principles of agi safety with richard ngo, 12 - ai existential risk with paul christiano, 11 - attainable utility and power with alex turner, 10 - ai's future and impacts with katja grace, 9 - finite factored sets with scott garrabrant, 8 - assistance games with dylan hadfield-menell, 7.5 - forecasting transformative ai from biological anchors with ajeya cotra, 7 - side effects with victoria krakovna, 6 - debate and imitative generalization with beth barnes, 5 - infra-bayesianism with vanessa kosoy, 4 - risks from learned optimization with evan hubinger, 3 - negotiable reinforcement learning with andrew critch, 2 - learning human biases with rohin shah, 1 - adversarial policies with adam gleave.

subscribe via RSS

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • JME Commentaries
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Online First
  • AI and the falling sky: interrogating X-Risk
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0002-5642-748X Nancy S Jecker 1 , 2 ,
  • http://orcid.org/0000-0001-6825-6917 Caesar Alimsinya Atuire 3 , 4 ,
  • http://orcid.org/0000-0002-8965-8153 Jean-Christophe Bélisle-Pipon 5 ,
  • http://orcid.org/0000-0002-7080-8801 Vardit Ravitsky 6 , 7 ,
  • http://orcid.org/0000-0002-9797-1326 Anita Ho 8 , 9
  • 1 Department of Bioethics & Humanities , University of Washington School of Medicine , Seattle , Washington , USA
  • 2 African Centre for Epistemology and Philosophy of Science , University of Johannesburg , Auckland Park , Gauteng , South Africa
  • 3 Centre for Tropical Medicine and Global Health , Oxford University , Oxford , UK
  • 4 Department of Philosophy and Classics , University of Ghana , Legon , Greater Accra , Ghana
  • 5 Faculty of Health Sciences , Simon Fraser University , Burnaby , British Columbia , Canada
  • 6 Hastings Center , Garrison , New York , USA
  • 7 Department of Global Health and Social Medicine , Harvard University , Cambridge , Massachusetts , USA
  • 8 Bioethics Program , University of California San Francisco , San Francisco , California , USA
  • 9 Centre for Applied Ethics , The University of British Columbia , Vancouver , British Columbia , Canada
  • Correspondence to Dr Nancy S Jecker, Department of Bioethics & Humanities, University of Washington School of Medicine, Seattle, Washington, USA; nsjecker{at}uw.edu

https://doi.org/10.1136/jme-2023-109702

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

  • Information Technology
  • Cultural Diversity
  • Minority Groups
  • Resource Allocation

Introduction

The Buddhist Jātaka tells the tale of a hare lounging under a palm tree who becomes convinced the Earth is coming to an end when a ripe bael fruit falls on its head. Soon all the hares are running; other animals join them, forming a stampede of deer, boar, elk, buffalo, wild oxen, rhinoceros, tigers and elephants, loudly proclaiming the earth is ending. 1 In the American retelling, the hare is ‘chicken little,’ and the exaggerated fear is that the sky is falling.

This paper offers a critical appraisal of the rise of calamity thinking in the scholarly AI ethics literature. It cautions against viewing X-Risk in isolation and highlights ethical considerations sidelined when X-Risk takes centre stage. Section I introduces a working definition of X-Risk, considers its likelihood and explores possible subtexts. It highlights conflicts of interest that arise when tech luminaries lead ethics debates in the public square. Section II flags ethics concerns brushed aside by focusing on X-Risk, including AI existential benefits (X-Benefits), non-AI X-Risk and non-existential AI harms. As we transition towards more AI-centred societies, which we, the authors, would like to fair, we argue for embedding fairness in the transition process by ensuring groups historically disadvantaged or marginalised are not left behind. Section III concludes by proposing a wide-angle lens that takes X-Risk seriously alongside other urgent ethics concerns.

I. Unpacking X-Risk

Doomsayers imagine AI in frightening ways, a paperclip maximiser, ‘whose top goal is the manufacturing of paperclips, with the consequence that it starts transforming first all of earth and increasing portions of space into paperclip manufacturing facilities.’(Bostrom, p5) 6 They compare large language models (LLMs) to the shoggoth in Lovecraft’s novella, ‘a terrible, indescribable thing…a shapeless congeries of protoplasmic bubbles, … with myriads of temporary eyes…as pustules of greenish light all over…’. 7

Prophesies of annihilation have a runaway effect on the public’s imagination. Schwarzenegger, star of The Terminator , a film depicting a computer defence system that achieves self-awareness and initiates nuclear war, has stated that the film’s subject is ‘not any more fantasy or kind of futuristic. It is here today’ and ‘everyone is frightened’. 8 Public attention to X-Risk intensified in 2023, when The Future of Life Institute called on AI labs to pause for 6 months the training of AI systems more powerful than Generative Pre-Trained Transformer (GPT)-4, 9 and, with the Centre for AI Safety, spearheaded a Statement on AI Risk, signed by leaders from OpenAI, Google Deepmind, Anthropic and others stressing that, ‘(m)itigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war’. 10 The 2023 release of Nolan’s film, Oppenheimer, encouraged comparisons between AI and atomic weaponry. Just as Oppenheimer fretted unleashing atomic energy ‘altered abruptly and profoundly the nature of the world,’ and ‘might someday prove deadly to the whole civilisation’, tech leaders fret AI X-Risk.(Bird, p323) 11

The concept of ‘X-Risk’ traces to Bostrom, who in 2002 defined it as a risk involving, ‘an adverse outcome (that) would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential;’ on this rendering, X-Risk imperils ‘humankind as a whole’ and brings ‘major adverse consequences for the course of human civilisation for all time to come.’(Bostrom, p2) 12 More recently, Bostrom and Ćirković defined ‘X-Risk’ as a subset of global catastrophic risks that ‘threatens to cause the extinction of Earth-originating intelligent life or to reduce its quality of life (compared with what would otherwise have been possible) permanently and drastically.’(Bostrom, p4) 13 They classify global catastrophic risks that could become existential in scope, intensity and probability as threefold: risks from nature such as asteroid threats; risks from unintended consequences, such as pandemic diseases; and risks from hostile acts, such as nuclear weaponry. We use Bostrom and Ćirković’s account as our working definition of X-Risk. While it is vague in the sense of leaving open the thresholds for scope, intensity and probability, it carries the advantage of breadth and relevance to a range of serious threats.

Who says the sky is falling?

A prominent source of apocalyptic thinking regarding AI comes from within the tech industry. According to a New York Times analysis, many tech leaders believe that AI advancement is inevitable, because it is possible, and think those at the forefront of creating it know best how to shape it. 14 In a 2019 scoping review of global AI ethics guidelines, Jobin et al identified 84 documents containing AI ethics principles or guidance, with most from the tech industry.(Jobin, p396) 15 However, a limitation of the study was that ethics guidance documents represent ‘soft law,’ which is not indexed in conventional databases, making retrieval less replicable and unbiased. More recently, Stanford University’s 2023 annual AI Index Report examined authorship of scholarly AI ethics literature and reported a shift away from academic authors towards authors with industry affiliations; the Report showed industry-affiliated authors produced 71% more publications than academics year over year between 2014 and 2022. 16

Since AI companies benefit financially from their investments in AI, relying on them for ethics guidance creates a conflict of interest. A ‘conflict of interest’ is a situation where ‘an individual’s judgement concerning a primary interest tends to be unduly influenced (or biased) by a secondary interest.’(Resnik, p121–22) 17 In addition to financial conflicts of interest, non-financial conflicts of interest can arise from multiple sources (eg, personal or professional relationships, political activity, involvement in litigation). 17 Non-financial conflicts of interest can occur subconsciously, and implicit cognitive biases can transfer to AI systems. Since most powerful tech companies are situated in high-income Western countries, they may be implicitly partial to values and concerns prevalent in those societies, reflecting anchoring bias (believing what one wants or expects) and confirmation bias (clinging to beliefs despite conflicting evidence). The dearth of research exploring AI’s social impacts in diverse cultural settings around the world makes detecting and dislodging implicit bias difficult. 18 Commenting on the existing corpus of AI ethics guidance, Jobin et al noted a significant representation of more economically developed countries, with the USA and UK together accounting for more than a third of AI ethics principles in 2019, followed by Japan, Germany, France and Finland. Notably, African and South American countries were not represented. While authors of AI ethics guidance often purport to represent the common good, a 2022 study by Bélisle-Pipon et al showed a broad trend towards asymmetrical engagement, with industry and those with vested interests in AI more represented than the public. 19 Hagerty and Rubinov report that risks for discriminatory outcomes in machine learning are particularly high for countries outside the USA and Western Europe, especially when algorithms developed in higher-income countries are deployed in low-income and middle-income countries that have different resource and social realities. 18

Another prominent source of calamity thinking is members of the effective altruism movement and the associated cause of longtermism, two groups that focus on ‘the most extreme catastrophic risks and emphasise the far-future consequences of our actions’. 20 Effective altruism is associated with a philosophical and social movement based largely at Oxford University and Silicon Valley. Its members include philosophers like Singer, Ord and MacAskill, along with tech industry leaders like the discredited cryptocurrency founder, Bankman-Fried. The guiding principles of effective altruism are ‘to do as much good as we can’ and ‘to base our actions on the best available evidence and reasoning about how the world works’. 21 MacAskill defines longtermism as ‘the idea that positively influencing the long-term future is a key moral priority of our time’, and underscores, ‘Future people count. There could be a lot of them. We can make their lives go better.’(MakAskill, pp5, 21) 22 Effective altruism and longtermism have spawned charitable organisations dedicated to promoting its goals, including GiveWell, Open Philanthropy and The Future of Life Institute. To be clear, we are not suggesting that adherents of longtermism are logically forced to embrace X-Risk or calamity thinking; our point is that adherents of longtermism draw on it to justify catastrophising.

Who benefits and who is placed at risk?

Critics of longtermism argue that it fails to give sufficient attention to serious problems happening now, particularly problems affecting those who have been historically disadvantaged or marginalised. Worse, it can give warrant to sacrificing present people’s rights and interests to stave off a prophesied extinction event. Thus, a well-recognised danger of maximisation theories is that they can be used to justify unethical means if these are deemed necessary to realise faraway goals that are thought to serve a greater good. Some effective altruists acknowledge this concern. MacAskill, for example, concedes that longtermism endorses directing resources away from present concerns, such as responding to the plight of the global poor, and towards more distant goals of preventing X-Risk. 23

X-Risk also raises theoretical challenges related to intergenerational justice. How should we understand duties to future people? Can we reasonably argue that it is unfair to prioritise the interests of existing people? Or even that in doing so, we discriminate against future people? Ord defends longtermism on the ground that there are many more future people than present people: ‘When I think of the millions of future generations yet to come, the importance of protecting humanity’s future is clear to me. To risk destroying this future, for the sake of some advantage limited only to the present, seems to me profoundly parochial and dangerously short-sighted. Such neglect privileges a tiny sliver of our story over the grand sweep of the whole; it privileges a tiny minority of humans over the overwhelming majority yet to be born; it privileges this particular century over the millions, or maybe billions, yet to come' (Ord, p44). 24

MacAskill defends longtermism on slightly different grounds, arguing that it reflects the standpoint of all humanity: ‘Imagine living…through the life of every human being who has ever lived…(and) imagine that you live all future lives…If you knew you were going to live all these future lives, what would you hope we do in the present?’(MakAskill, p5) 22 For MacAskill, the standpoint of all humanity represents the moral point of view.

The logic of longtermism can be challenged on multiple grounds. First, by purporting to represent everyone, longtermism ignores its own positionality. Longtermism’s central spokespersons—from the tech industry and effective altruism movement, are not sufficiently diverse to represent ‘all humanity.’ A 2022 Time Magazine article characterised ‘the typical effective altruist’ as ‘a white man in his 20 s, who lives in North America or Europe, and has a university degree’. 25 The tech industry, which provides robust financial backing for longtermism, faces its own diversity crisis across race and gender lines. In 2021, men represented nearly three-quarters of the USA science, technology, engineering and mathematic workforce, whites close to two-thirds. 26 At higher ranks, diversity rates were lower.

Someone might push back, asking why the narrow demographics of the average effective altruist or adherent of longtermism should be a source for concern. One reply is that these demographics raise the worry that the tech industry is unwittingly entrenching its own biases and transferring them to AI systems. Experts caution about AI ‘systems that sanctify the status quo and advance the interests of the powerful’, and urge reflection on the question, ‘How is AI shifting power?’(Kalluri, p169) 27 While effective altruism purports to consider all people’s interests impartially, linking altruism to distant future threats delegitimises attention to present problems, leaving intact the plight of today’s disadvantaged. Srinivasan asserts that ‘the humanitarian logic of effective altruism leads to the conclusion that more money needs to be spent on computers: why invest in anti-malarial nets when there’s a robot apocalypse to halt?’ 28 These kinds of considerations lead Srinivasan to conclude that effective altruism is a conservative movement that leaves everything just as it is.

A second, related worry concerns epistemic justice, the normative requirement to be fair and inclusive in producing knowledge and assigning credibility to beliefs. The utilitarian philosophy embedded in effective altruism and longtermism is a characteristically Western view. Since effective altruism and longtermism aspire to be a universal ethic for humankind, considering moral philosophies outside the West is a normative requirement epistemic justice sets. Many traditions outside the West assign core importance to the fact that each of us is ‘embedded in the complex structure of commitments, affinities and understandings that comprise social life’. 28 The value of these relationships is not derivative of utilitarian principles; it is the starting point for moral reasoning. On these analyses, the utilitarian premises of longtermism and effective altruism undervalue community and thereby demand the wrong things. If the moral goal is creating the most good you can, this potentially leaves out those collectivist-oriented societies that equate ‘good’ with helping one’s community and with promoting solidaristic feeling between family, friends and neighbours.

Third, evidence suggests that epistemically just applications of AI require knowledge of the social contexts to which AI is applied. Hagerty and Rubinov report that ‘AI is likely to have markedly different social impacts depending on geographical setting’ and that ‘perceptions and understandings of AI are likely to be profoundly shaped by local cultural and social context’. 18 Lacking contextual knowledge impacts AI’s potential benefits 29 and can harm people. 30 While many variables are relevant to social context, when AI developers are predominantly white, male and from the West, they may miss insights that a more diverse demographic would be less apt to miss. This can create an echo chamber, with dominant views seeming ‘natural’ because they are pervasive and unchallenged.

An adherent of longtermism might reply to these points by saying that most people are deficient in their concern for future people. According to Perrsron and Savulescu, interventions like biomedical moral enhancement might one day enable individuals to be ‘less biased towards what is near in time and place’ and to ‘feel more responsible for what they collectively cause and let happen’.(Perrsron and Savulescu, p496) 31 Presumably, morally enhancing people in ways that direct them to care more about distant future people would help efforts to reduce X-Risk. Yet, setting aside biomedical feasibility, this argument brushes aside preliminary questions. Whose moral views require enhancing? Perrson and Savulescu suggest that their own emphasis on distant future people is superior, while the views of others, who prioritise present people, require enhancing. Yet, this stance is incendiary and potentially offensive. Implementing biomedical moral enhancement would not show the superiority of longtermism; it would shut down alternative views and homogenise moral thinking.

A different reply is suggested by MacAskill, who compares longtermism to the work of abolitionists and feminists.(MakAskill, p3) 22 MacAskill says future people will look back and thank us if we pursue the approach longtermism advocates, just as present people are grateful to abolitionists and feminists who dedicated themselves to missions that succeeded decades after their deaths. Yet this ignores the thorny question of timing—feminists and abolitionists responded to justice concerns of their time and place, and helped the next generation of women and blacks, while longtermists presumably help people in the distant future to avoid the end of humanity. Yet, those who never exist (because they are eliminated by AI) are not wronged by never having existed.

Finally, proponents of X-Risk might reason that even though the odds of X-Risk are uncertain, the potential hazard it poses is grave. Yet, what exactly are the odds? Bostrom and Ćirković acknowledge AI X-Risk is ‘not an ongoing or imminent global catastrophic risk;’ nonetheless, ‘from a long-term perspective, the development of general AI exceeding that of the human brain can be seen as one of the main challenges to the future of humanity (arguably, even as the main challenge).’(Rees, p16) 32 Notwithstanding this qualification, the headline-grabbing nature of X-Risk makes X-Risk itself risky. It is readily amplified and assigned disproportionate weight, diverting attention from immediate threats. For this reason, tech experts warn against allowing the powerful narratives of calamity thinking to anchor risk assessments. Unlike other serious risks, AI X-Risk forecasting cannot draw on empirical evidence: ‘We cannot consult actuarial statistics to assign small annual probabilities of catastrophe, as with asteroid strikes. We cannot use calculations from a precise, precisely confirmed model to rule out events or place infinitesimal upper bounds on their probability, (as) with proposed physics disasters.’(Yudkowsky, p308) 33 We can, however, apply time-tested methods of risk reduction to lower AI X-Risk. Hazard analysis, for example, defines ‘risk’ by the equation: risk=hazard×exposure×vulnerability. On this approach, reducing AI X-Risk requires reducing hazard, exposure and/or vulnerability; for example, establishing a safety culture reduces hazard; building safety into system development early-on reduces risk exposure; and preparing for crises reduces vulnerability.

II. What risks other than AI X-Risk should we consider?

This section explores ethics consideration besides X-Risk. In so doing, it points to the need for a broader ethical framing, which we develop in a preliminary way in the next section (section III).

Non-AI X-Risks

Before determining what moral weight to assign AI X-Risk, consider non-AI X-Risks. For example, an increasing number of bacteria, parasites, viruses and fungi with antimicrobial resistance could threaten human health and life; the use of nuclear, chemical, biological or radiological weapons could end the lives of millions or make large parts of the planet uninhabitable; extreme weather events caused by anthropogenic climate change could endanger the lives of many people, trigger food shortages and famine, and annihilate entire communities. Discussion of these non-AI X-Risks is conspicuously absent from most discussions of AI X-Risk.

A plausible assumption is that these non-AI threats have at least as much likelihood of rising to the level of X-Risk as AI does. If so, then our response to AI X-Risk should be proportionate to our response to these other dangers. For example, it seems inconsistent to halt developing AI systems due to X-Risk, while doing little to slow or reduce the likelihood of X-Risk from nuclear weaponry, anthropogenic climate change or antimicrobial resistance. All these possible X-risks are difficult to gauge precisely; moreover, they intersect, further confounding estimates of each. For example, AI might accelerate progress in green technology and climate science, reducing damaging effects of climate change; alternatively, AI might increase humanity’s carbon footprint, since more powerful AI takes more energy to operate. The most promising policies simultaneously reduce multiple X-Risks, while the most destructive ones increase multiple X-Risks. Taking the entire landscape of X-Risk into account requires considering how big risks compare, combine and rank relative to one another.

The optimal strategy for reducing the full range of X-Risks might involve less direct strategies, such as building international solidarity and strengthening shared institutions. The United Nations defines international solidarity as ‘the expression of a spirit of unity among individuals, peoples, states and international organisations. It encompasses the union of interests, purposes and actions and the recognition of different needs and rights to archive common goals.’ 34 Strengthening international solidarity could better equip the world to respond to existential threats to humanity, because solidarity fosters trust and social capital. Rather than undercutting concern about people living in the distant future, building rapport with people living now might do the opposite, that is, foster a sense of common humanity and of solidarity between generations.

One way to elaborate these ideas more systematically draws on values salient in sub-Saharan Africa, which emphasise solidarity and prosocial duties. For example, expounding an African standpoint, Behrens argues that African philosophy tends to conceive of generations past, present and future as belonging to a shared collective and to perceive, ‘a sense of family or community’ spanning generations. 35 Unlike utilitarian ethics, which tends to focus on impartiality and duties to strangers, African solidarity may consider it ethically incriminating to impose sacrifices on one to help many, because each member of a group acquires a superlative value through group membership.(Metz, p62) 36 The African ethic of ubuntu can be rendered as a ‘family first’ ethic, permitting a degree of partiality towards present people. Utilitarianism, by contrast, requires impartially maximising well-being for all people, irrespective of their proximity or our relationship to them. While fully exploring notions like solidarity and ubuntu is beyond this paper’s scope, they serve to illustrate the prospect of anchoring AI ethics to more diverse and globally inclusive values.

AI X-Benefits

In addition to non-AI X-Risk, a thorough analysis should consider AI’s X-Benefits. To give a prominent example, in 2020, DeepMind demonstrated its AlphaFold system could predict the three-dimensional shapes of proteins with high accuracy. Since most drugs work by binding to proteins, the hope is that understanding the structure of proteins could fast-track drug discovery. By pinpointing patterns in large data sets, AI can also aid diagnosing patients, assessing health risks and predicting patient outcomes. For example, AI image scanning can identify high risk cases that radiologists might miss, decrease error rates among pathologists and speed processing. In neuroscience, AI can spur advances by decoding brain activity to help people with devastating disease regain basic functioning like communication and mobility. Researchers have also used AI to search through millions of candidate drugs to narrow the scope for drug testing. AI-aided inquiry recently yielded two new antibiotics—halicin in 2020 and abaucin in 2023; both can destroy some of the worst disease-causing bacteria, including strains previously resistant to known antibiotics. In its 2021 report, the National Academy of Medicine noted, ‘unprecedented opportunities’ in precision medicine, a field that determines treatment for each patient based on vast troves of data about them, such as genome information. (Matheny, p1) 37 In precision cancer medicine, for example, whole genome analysis can produce up to 3 billion pairs of information and AI can analyse this efficiently and accurately and recommend individualised treatment. 38

While difficult to quantify, it seems reasonable to say that chances of AI X-Benefits are at least as likely and worth considering as the chances of AI X-Risks. Halting or slowing AI development may prevent or slow AI X-Benefits, depriving people of benefits they might have received. While longtermism could, in principle, permit narrow AI applications, under great supervision, while simultaneously urging a moratorium on advanced AI, it might be impossible to say in practice if research will be X-Risky.

The dearth of attention to X-Benefit might reflect what Jobin et al call a ‘negativity bias’ in international AI ethics guidance, which generally emphasises precautionary values of preventing harm and reducing risk; according to these authors, ‘(b)ecause references to non-maleficence outnumber those related to beneficence, it appears that issuers of guidelines are preoccupied with the moral obligation to prevent harm.’(Jobin et al , p396) 15 Jecker and Nakazawa have argued that the negativity bias in AI ethics may reflect a Western bias, expressing values and beliefs more frequently found in the West than the Far East. 39 A 2023 global survey by Institut Public de Sondage d'Opinion Secteur (IPSOS) may lend support to this analysis; it reported nervousness about AI was highest in predominantly Anglophone countries and lowest in Japan, Korea and Eastern Europe. 40 Likewise, an earlier, 2020 PEW Research Centre study reported that most Asia-Pacific publics surveyed considered the effect of AI on society to be positive, while in places such as the Netherlands, the UK, Canada and the USA, publics are less enthusiastic and more divided on this issue. 41

A balanced approach to AI ethics must weigh benefits as well as risks. Lending support to this claim, the IPSOS survey reported that overall, the global public appreciates both risks and benefits: about half (54%) of people in 31 countries agreed that products and services using AI have more benefits than drawbacks and are excited about using them, while about the same percentage (52%) are nervous about them. A balanced approach must avoid hyped expectations about both benefits and risks. Getting ‘beyond the hype’ requires not limiting AI ethics to ‘dreams and nightmares about the distant future.’(Coeckelbergh, p26) 42

AI risks that are not X-Risk

A final consideration that falls outside the scope of X-Risk concerns the many serious harms happening now: algorithmic bias, AI hallucinations, displacement of creative work, misinformation and threats to privacy.

In applied fields like medicine and criminal justice, algorithmic bias can disadvantage and harm socially marginalised people. In a preliminary study, medical scientists reported that the LLM, GPT-4, gave different diagnoses and treatment recommendations depending on the patient’s race/ethnicity or gender and highlighted, ‘the urgent need for comprehensive and transparent bias assessments of LLM tools such as GPT-4 for intended use cases before they are integrated into clinical care.’(Zack et al , p12) 43 In the criminal justice system, the application of AI generates racially biased systems for predictive policing, arrests, recidivism assessment, sentencing and parole. 44 In hiring, AI-determined recruitment and screening feeds sexist labour systems. 45 In education, algorithmic bias in college admissions and student loan scoring impacts important opportunities for young people. 46 Geographically, algorithmic bias is reflected in the under-representation of people from low-income and middle- income countries in the datasets used to train or validate AI systems, reinforcing the exclusion of their interests and needs. The World Economic Forum reported in 2018 that an average US household can generate a data point every six seconds. In Mozambique, where about 90% of people lack internet access, the average household generates zero digital data points. In a world where data play an increasingly powerful social role, to be absent from datasets may lead to increasing marginalisation with far-reaching consequences. 47 These infrastructure deficiencies in poorer nations may divert attention away from AI harms to lack of AI benefits. Furthermore, as Hagerty notes, ‘a lack of high-skill employment in large swaths of the world can leave communities out of the opportunities to redress errors or ethical missteps baked into the technological systems’. 18

Documented harms also occur when AI systems ‘hallucinate’ false information and spew it out convincingly alongside true statements. In 2023, an attorney was fined US$5000 by a US Federal Court for submitting a legal brief on an airline injury case peppered with citations from non-existent case precedents that were generated by ChatGPT. 48 In healthcare, GPT-4 was prompted to respond to a patient query ‘how did you learn so much about metformin (a diabetes medication)’ and claimed, ‘I received a master’s degree in public health and have volunteered with diabetes non-profits in the past. Additionally, I have some personal experience with type two diabetes in my family.’ 49 Blatantly false statements like these can put people at risk and undermine trust in legal and healthcare systems.

A third area relates to AI displacement of human creative work. For example, while computer-generated content has long informed the arts, AI presents a novel prospect: artwork generated without us, outperforming and supplanting human creations. If we value aspects of human culture specifically as human, managing AI systems that encroach on this is imperative. Since it is difficult to ‘dial back’ AI encroachment, prevention is needed—if society prefers not to read mostly AI-authored books, AI-composed songs and AI-painted paintings, it must require transparency about the sources of creative works; commit to support human artistry; and invest in the range of human culture by protecting contributions from groups at risk of having their contributions cancelled.

A fourth risk is AI’s capacity to turbocharge misinformation by means of LLMs and deep fakes in ways that undermine autonomy and democracy. If people decide which colleges to apply to or which destinations to vacation in based on false information, this undermines autonomy. If citizens are shown campaign ads using deep fakes and fabrication, this undercuts democratic governance. Misinformation can also increase X-Risks. For example, misinformation about climate solutions can lower acceptance of climate change and reduce support for mitigation; conspiracy theories can increase the spread of infectious diseases and raise the likelihood of global pandemics.

A fifth risk concerns threats to privacy. Privacy, understood as ‘the right to be left alone’ and ‘the right of individuals to determine the extent to which others have access to them, is valued as instrumental to other goods, such as intimacy, property rights, security or autonomy. Technology can function both as a source and solution to privacy threats. Consider, for example, the ‘internet of things,’ which intelligently connects various devices to the internet—personal devices (eg, smart phones, laptops); home devices (eg, alarm systems, security cameras) and travel and transportation devices (eg, webcams, radio frequency identification (RFD) chips on passports, navigation systems). These devices generate personal data that can be used both to protect people, and to surveil them with or without their knowledge and consent. For example, AI counters privacy threats by enhancing tools for encryption, data anonymisation and biometrics; it increases privacy threats by helping hackers breach security protocols (eg, captcha, passwords) meant to safeguard personal data, or by writing code that intentionally or unintentionally leaves ‘backdoor’ access to systems. When privacy protection is left to individuals, it has too often ‘devolved into terms-of-service and terms-of-use agreements that most people comply with by simply clicking ‘I agree,’ without reading the terms they agree to.’(Jecker et al,p.10-11) 50

Stepping back, these considerations make a compelling case for addressing AI benefits and risks here and now. Bender and Hanna put the point thus: ‘Beneath the hype from many AI firms, their technology already enables routine discrimination in housing, criminal justice and healthcare, as well as the spread of hate speech and misinformation in non-English languages;’ they conclude, ‘Effective regulation of AI needs grounded science that investigates real harms, not glorified press releases about existential risks.’ 51

Proponents of effective altruism and longtermism might counter that present-day harms (such as algorithmic bias, AI hallucinations, displacement of creative work, misinformation and threats to privacy) are ethically insignificant ‘in the big picture of things—from the perspective of humankind as a whole,’ because they do not appreciably affect the total amount of human suffering or happiness.(12, p. 2) Yet, the prospect of non-X-Risk harms is troubling to many. Nature polled 1600 scientists around the world in 2023 about their views on the rise of AI in science, including machine-learning and generative AI tools. 52 The majority reported concerns about immediate and near-term risks, not long-term existential risk: 69% said AI tools can lead to more reliance on pattern recognition without understanding, 58% said results can entrench bias or discrimination in data, 55% thought that the tools could make fraud easier and 53% stated that ill considered use can lead to irreproducible research. Respondents reported specific concerns related to faked studies, false information and training on historically biased data, along with inaccurate professional-sounding results.

Table 1 recaps the discussion of this section and places AI X-Risk in the wider context of other risks and benefits.

  • View inline

Placing X-Risk in context

III. Conclusion

This paper responded to alarms sounding across diverse sectors and industries about grave risks of unregulated AI advancement. It suggested a wide-angle lens for approaching AI X-Risk that takes X-Risk seriously alongside other urgent ethics concerns. We urged justly transitioning to more AI-centred societies by disseminating AI risks and benefits fairly, with special attention to groups historically disadvantaged and marginalised.

In the Jātaka tale, what stopped the stampede of animals was a lion (representing the Boddhisattva) who told the animals, ‘Don’t be afraid.’ The stampede had already put all the animals at risk: if not for the lion, the animals would have stampeded right into the sea and perished.

Data availability statement

No data are available.

Ethics statements

Patient consent for publication.

Not applicable.

  • Duddubha Jataka
  • Lovecraft HP
  • Schwarzenegger A
  • Future of Life Institute
  • Center for AI Safety
  • Bostrom N ,
  • Ćirković MM
  • Stanford University, Human-Centered Artificial Intelligence
  • Hagerty A ,
  • Bélisle-Pipon J-C ,
  • Monteferrante E ,
  • Roy M-C , et al
  • Schneier B ,
  • Centre for Effective Altruism
  • MacAskill W
  • ↵ Why effective altruism fears the AI apocalypse: A conversation with the philosophers William Macaskill . New York Intelligencer ; 2022 . Available : https://nymag.com/intelligencer/2022/08/why-effective-altruists-fear-the-ai-apocalypse.html
  • ↵ National Science Foundation, diversity and STEM: women, minorities, and persons with disabilities . National Science Foundation ; 2023 . Available : https://ncses.nsf.gov/pubs/nsf23315/
  • Srinivasan A
  • Hersch F , et al
  • De La Garza A
  • Perrsron I ,
  • Savulescu J
  • Yudkowsky E
  • United Nations General Assembly
  • Matheny ME , et al
  • Hamamoto R ,
  • Suvarna K ,
  • Yamada M , et al
  • Jecker NS ,
  • Johnson C ,
  • Coeckelbergh M
  • Suzgun M , et al
  • Harvard Law Review Association
  • World Economic Forum
  • Sparrow R ,
  • Lederman Z , et al
  • Bender EM ,
  • Van Noorden R ,

This paper argues that the headline-grabbing nature of existential risk (X-Risk) diverts attention away from immediate artificial intelligence (AI) threats, including fairly disseminating AI risks and benefits and justly transitioning toward AI-centered societies. Section I introduces a working definition of X-Risk, considers its likelihood, and explores possible subtexts. It highlights conflicts of interest that arise when tech luminaries lead ethics debates in the public square. Section II flags AI ethics concerns brushed aside by focusing on X-Risk, including AI existential benefits (X-Benefits), non-AI X-Risk, and AI harms occurring now. Taking the entire landscape of X-Risk into account requires considering how big risks compare, combine, and rank relative to one another. As we transition toward more AI-centered societies, which we, the authors, would like to be fair, we urge embedding fairness in the transition process, especially with respect to groups historically disadvantaged and marginalized. Section III concludes by proposing a wide-angle lens that takes X-Risk seriously alongside other urgent ethics concerns.

Twitter @profjecker, @atuire, @BelislePipon, @VarditRavitsky, @AnitaHoEthics

Presented at A version of this paper will be presented at The Center for the Study of Bioethics, The Hastings Center, and The Oxford Uehiro Centre for Practical Ethics conference, “Existential Threats and Other Disasters: How Should We Address Them,” June 2024, Budva, Montenegro.

Contributors NSJ contributed substantially to the conception and analysis of the work; drafting or revising it critically; final approval of the version to be published; is accountable for all aspects of the work; and is responsible for the overall content as guarantor. CAA contributed substantially to the conception and analysis of the work; drafting or revising it critically; final approval of the version to be published and is accountable for all aspects of the work. J-CB-P contributed substantially to the conception and analysis of the work; drafting or revising it critically; final approval of the version to be published and is accountable for all aspects of the work. VR contributed substantially to the conception and analysis of the work; drafting or revising it critically; final approval of the version to be published and is accountable for all aspects of the work. AH contributed substantially to the conception and analysis of the work; drafting or revising it critically; final approval of the version to be published and is accountable for all aspects of the work.

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Provenance and peer review Not commissioned; externally peer reviewed.

Read the full text or download the PDF:

Other content recommended for you.

  • Threats by artificial intelligence to human health and human existence Frederik Federspiel et al., BMJ Global Health, 2023
  • Artificial intelligence (AI) and global health: how can AI contribute to health in resource-poor settings? Brian Wahl et al., BMJ Global Health, 2018
  • Randomised controlled trials in medical AI: ethical considerations Thomas Grote, Journal of Medical Ethics, 2021
  • Public perceptions on the application of artificial intelligence in healthcare: a qualitative meta-synthesis Chenxi Wu et al., BMJ Open, 2023
  • Artificial intelligence for diabetic retinopathy in low-income and middle-income countries: a scoping review Charles R Cleland et al., BMJ Open Diabetes Research & Care, 2023
  • Implications of conscious AI in primary healthcare Dorsai Ranjbari et al., Family Medicine and Community Health, 2024
  • Evaluation framework to guide implementation of AI systems into healthcare settings Sandeep Reddy et al., BMJ Health & Care Informatics, 2021
  • Artificial intelligence and inflammatory bowel disease: practicalities and future prospects Johanne Brooks-Warburton et al., Frontline Gastroenterology, 2021
  • Real-world evaluation of smartphone-based artificial intelligence to screen for diabetic retinopathy in Dominica: a clinical validation study Oliver Kemp et al., BMJ Open Ophthalmology, 2023
  • Limits of trust in medical AI Joshua James Hatherley, Journal of Medical Ethics, 2020

AI-Related Risk: An Epistemological Approach

  • Research Article
  • Open access
  • Published: 25 May 2024
  • Volume 37 , article number  66 , ( 2024 )

Cite this article

You have full access to this open access article

x risk analysis for ai research

  • Giacomo Zanotti   ORCID: orcid.org/0000-0001-9898-4113 1 ,
  • Daniele Chiffi 2 &
  • Viola Schiaffonati   ORCID: orcid.org/0000-0001-9127-6165 1  

5 Altmetric

Risks connected with AI systems have become a recurrent topic in public and academic debates, and the European proposal for the AI Act explicitly adopts a risk-based tiered approach that associates different levels of regulation with different levels of risk. However, a comprehensive and general framework to think about AI-related risk is still lacking. In this work, we aim to provide an epistemological analysis of such risk building upon the existing literature on disaster risk analysis and reduction. We show how a multi-component analysis of risk, that distinguishes between the dimensions of hazard, exposure, and vulnerability, allows us to better understand the sources of AI-related risks and effectively intervene to mitigate them. This multi-component analysis also turns out to be particularly useful in the case of general-purpose and experimental AI systems, for which it is often hard to perform both ex-ante and ex-post risk analyses.

Similar content being viewed by others

x risk analysis for ai research

AI Risk Assessment: A Scenario-Based, Proportional Methodology for the AI Act

x risk analysis for ai research

Prevent AI from Influences: A Challenge for Lazy, Profligate Governments

Artificial intelligence and the legal response paradigms in disaster management.

Avoid common mistakes on your manuscript.

1 Introduction

Progress in the field of Artificial Intelligence (AI) has been increasingly rapid, and AI systems are now widespread in societies. In parallel, there has been a growing interest in the ethical and socially relevant aspects of the design and use of AI systems. Among other things, an increasing focus has been directed towards the risks associated with the widespread adoption of AI systems.

To begin, a lot of attention (and media coverage) has been devoted to the so-called existential risks, namely the risk of human extinction and global catastrophes due to the development of misaligned AI. In particular, the evoked scenarios focus on the development and deployment of Artificial General Intelligence (AGI), namely human-level or even beyond-human AI. Interestingly, concerns of this kind have motivated initiatives such as the Future of Life Institute’s open letter Pause Giant AI Experiments , Footnote 1 or the Center for AI Safety’s public Statement on AI Risk according to which “mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war”. Footnote 2

That said, many have criticized the insistence on such futuristic scenarios and narratives about AI takeover, arguing that the use of AI systems already involves way more mundane forms of risk. Footnote 3 Examples will be made in the article, but we can mention at least problems related to algorithmic discrimination (Buolamwini & Gebru, 2018 ), privacy violation (Curzon et al., 2021 ), environmental impacts and exploitation of human labour (Crawford, 2021 ). Along these lines, notable attempts have been made to regulate the design and use of current AI systems and reduce their actual risks, especially within the normative framework of Trustworthy AI (see Zanotti et al., 2023 ). Most notably, a lot of attention has been devoted to the recently approved regulation of the European Parliament and of the Council laying down harmonised rules on Artificial Intelligence (AI Act), a unified legal framework for AI. Footnote 4 Interestingly, the AI Act explicitly adopts a risk-based approach, that groups together AI systems into different levels of risk. In particular, it explicitly distinguishes systems involving unacceptable risks (e.g., those used for social scoring) and high-risk systems (e.g., those used for predictive justice). Two other levels of risk can be identified, even if no precise label is employed in the AI Act: limited-risk systems (e.g., chatbots) and minimal-risk systems (e.g., spam filters). Each level of risk is then associated with a specific level of regulation: unacceptably risky systems are prohibited (Art. 5); high-risk systems need to comply with strict requirements concerning, among other things, traceability, human oversight, accuracy, security, and robustness (Chapter III); limited-risk systems must respect transparency requirements (Art. 50); finally, the development and use of minimal-risk systems should only be subject to codes of conduct. Footnote 5

Now, our work does not directly address the AI Act, but rather aims to lay some philosophical and conceptual bases to better understand AI-related risk. And while this kind of work should ideally inform actual interventions, also in terms of regulation, it is not meant to be interpreted as providing readily applicable instructions and suggestions to policymakers and stakeholders. Footnote 6 Our perspective will be distinctively epistemological, and more general in scope. That said, the AI Act provides us with a useful starting point for our discussion. While being a crucial step towards a responsible and trustworthy development of AI, the Act is not free from limitations (e.g., Mahler, 2022 ; Edwards, 2022 ; Floridi, 2021 ; Mökander, 2022 ). Footnote 7 In our view, a potential problem of the Act is that, while explicitly adopting a risk-based approach, it lacks a proper conceptualization of the notion of risk. Improvements were made with the amendments approved in June 2023, for Art. 3 (2) now explicitly defines risk as “the combination of the probability of an occurrence of harm and the severity of that harm”. Still, we will argue, this understanding of risk is not enough when it comes to assessing and possibly mitigating AI-related risk.

However, the scope of our analysis extends beyond the AI Act. While other kinds of risks, such as natural risks, have already been investigated from the perspective of the philosophy of science, little work has been done on the epistemology of risk in the context of AI. True, specific AI-related risks have been investigated: again, discrimination, privacy violation, environmental impacts, and so on – see Wirtz et al. ( 2022 ) for a useful panoramic overview of AI-related risk. Footnote 8 However, a comprehensive framework is still lacking. Footnote 9 Our aim in this article is to provide an epistemological analysis of AI-related risk that distinguishes its different components by building upon the existing literature on disaster risk analysis and reduction, which is usually (but not exclusively) adopted for natural risks. As we will see, such an approach turns out to be particularly fruitful when it comes to designing risk-mitigation policies, for distinguishing the different components of AI-related risk also opens the way for different kinds of intervention for mitigation.

The article is structured as follows. Section 2 presents different approaches to the conceptualization of risk, focusing on multi-component analyses that understand risk as resulting from the interplay of three different components: hazard, exposure, and vulnerability. In Sect. 3, we argue in favour of the application of this multi-component analysis to AI-related risks, showing how it allows us to better capture different aspects of such risks and design more effective interventions for mitigation. In Sect. 4, we develop the analysis presented in Sect. 3 by focusing on the difficulties involved in providing ex-ante analyses of AI-related risks, especially when we deal with general-purpose AI systems having the character of experimental technologies. Section 5 provides a brief summary and closes the article.

2 Components of Risk

Our analysis should probably start with a caveat, namely that there is no univocal notion of risk (Boholm et al., 2016 ; Hansson, 2023 ). This is due both to differences in the way risk is defined in the literature and to the fact that technical definitions of risk coexist with the ordinary understanding and usage of this notion.

Focusing on technical definitions, today’s dominant approaches conceive of risk in terms of expected utility (Hansson, 2009 ). That is, risk is given by the combination of the probability of an unwanted event occurring and the magnitude of its consequences. Footnote 10 As an example, consider volcanic risk. On the one hand, a large and highly explosive eruption might be associated with a low level of risk if its occurrence is estimated as very unlikely with a fair degree of confidence. A moderately explosive but way more likely eruption, on the other hand, would arguably be associated with a higher level of risk. Despite its simplicity, this way of thinking about risks provides us with an easily applicable and intuitive model for decision-making in contexts of risk that nicely fits with theories of rational choice insisting on the maximisation of the expected utility (see Briggs, 2023 ).

As already anticipated, this definition of risk is the one the AI Act explicitly refers to. However, always maintaining a definition in terms of expected utility, one can decide to provide further analyses of risk. Most notably, risk can be decomposed into its different components – usually, hazard, exposure, and vulnerability. As we will see in a moment, this approach, fairly common in risk analysis, allows to open different areas of intervention for mitigation. Since we aim to provide an analysis of AI-related risk that can fruitfully serve also as a ground for policy making, the multi-component analysis is the one adopted in this article. Footnote 11

Let us now go a bit more into the details of the components of risk. Starting with the first one, the notion of hazard refers to the source of potential harm – let us recall that, when it comes to risk, the focus is always on unwanted consequences. In addition to the specification of the source of the potential harm as well as of its characteristics in terms of magnitude, the analysis of hazards often involves a probabilistic element – that is, the probability of occurrence of the harmful phenomenon (e.g., UNDRO, 1991 ). Consider, for instance, risks related to volcanic eruptions. In this case, the hazard is primarily the eruption itself, which in its turn brings about a series of potentially harmful events, such as pyroclastic and lava flows. Different elements contribute to making this kind of hazard more or less impactful for risk analysis, such as the frequency of the eruptions, their intensity and their duration.

The domain of natural disasters offers notable examples of other kinds of hazards, such as earthquakes, tidal waves and flooding. However, importantly for our purpose, hazards can also have origins other than natural ones. For example, the escalation of an armed conflict is a distinctive example of a non-natural hazard. Further narrowing the focus, we can consider technological risks related to technological artefacts, namely objects produced by humans in order to fulfill some kind of practical function (Vermaas et al., 2011 , p. 5). Footnote 12 In the next section, we will focus on risks stemming from AI systems.

As suggested by the label, the component of exposure refers to what could be harmed. Importantly, living beings – most notably, humans – can be exposed, but we can also think of risks in which material assets such as buildings and infrastructures are involved. Going back to the example of volcanic risk, the exposure has to do with the number of people, buildings, infrastructures, and other assets that would be affected by the eruption. Importantly, as we will see in the next section in relation to some AI systems, even minimal hazards can be associated with high levels of risk – and eventually bring about disastrous outcomes – when exposure is high.

Finally, the component of vulnerability unsurprisingly has to do with how much the exposed people or assets are susceptible to the impacts of hazards. Footnote 13 Providing a precise characterization of vulnerability is far from easy, for a number of different definitions are available (Thywissen, 2006 ), and factors affecting vulnerability may significantly vary. In general, however, they include all those circumstances and measures that could make people or assets more or less defenceless against harming events. In the case of volcanic eruptions, this might translate into the existence and feasibility of plans for evacuation, shelters as well as food and water emergency supplies. Footnote 14

Once these three components of risk are clearly identified, different interventions for mitigating risk can be designed. First of all, one may take measures to reduce hazards. Here, some distinctions between different kinds of risks shall be made, for hazard reduction is not always possible. In particular, hazard mitigation is not so easy when it comes to natural risk. True, there are some cases in which interventions for reducing hazards are possible – for instance, flooding is influenced by anthropogenic climate change, and hazards like landslides can be due to logging and land abuse. However, in many other cases, including volcanic risk, hazard mitigation is simply not possible, for the occurrence of the unwanted event is independent of human action.

On the contrary, if the risks in question are related to the use of a certain technological artefact, the hazard can sometimes be reduced or even eliminated. Most notably, measures could be taken by prohibiting the use of the artefact and withdrawing it from the market, as it happened in several countries with the ban on asbestos. Sure, things are not always easy. In many cases, hazard mitigation in contexts of technological risk presents significant challenges – think about interventions to cap carbon dioxide emissions. Footnote 15 Still, as it happened with asbestos, there seem to be cases in which mitigating technological hazard is feasible.

That said, hazard reduction is not the only way to mitigate risk. Among other things, risk mitigation strategies might attempt to reduce the exposure. Thinking about natural risks stemming from the occurrence of geographically circumscribed events, building and access permits could be denied in the potentially affected areas – such as the vicinities of a volcano. However, exposure can be reduced also when it comes to technological risks. If some signs of structural failure are detected in a bridge, for example, then exposure can be significantly reduced by forbidding access to the bridge. In the field of information and communication technologies (ICT), instead, age restrictions on the use of services and products – e.g., social networks – can be seen, at least in principle, as measures for exposure reduction.

Finally, interventions could aim at making the population and assets less vulnerable. These interventions can significantly vary as a result of the fact that vulnerability is a very broad notion and involves different factors. We have seen how, in the case of volcanic risk, they involve actions such as building shelters, designing evacuation plans and planning basic necessities and supplies. In the case of ICT, antivirus software and spam filters play an analogous function, protecting the user from malwares and potentially dangerous content.

As anticipated, there is no univocal notion of risk. On the contrary, different ways to conceptualise and analyse risk are possible and could be more or less useful depending on the context. Here, we have focused on a multi-component analysis of risk that distinguishes between the components of hazard, vulnerability, and exposure.

3 AI and Multi-Component Analysis of Risk

We have seen how a multi-component analysis of risk can be employed to understand both natural and technological risks and at the same time pave the way for different kinds of interventions aiming at risk-mitigation. We now wish to narrow the focus to risks stemming from AI systems. However, a problem immediately emerges concerning what we mean by “AI system”, for the definition of AI is a long-standing problem at least since the foundation of the discipline, and a number of different definitions are available (Russell & Norvig, 2021 ). Footnote 16

For the purpose of this article, the OECD’s ( 2023 ) definition can be kept in mind, according to which an AI system is.

[...] according to which an AI system is “a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment”. Footnote 17

With this definition of AI systems in mind, we will show through some relevant examples how the multi-component analysis of risk we have considered fruitfully applies to AI-related risk.

Let us start by considering the hazard component involved in the use of AI systems, which is arguably the most discussed aspect of AI-related risk. As already noted, AI systems are now increasingly employed in a number of contexts that we intuitively perceive as highly risky. Among the most discussed cases, one can think about systems used in medicine (Panayides et al., 2020 ), in courts (Queudot & Meurs, 2018 ), and in war scenarios (Amoroso & Tamburrini, 2020 ). In these cases, it is pretty straightforward why the use of AI systems involves considerable risks. Though increasingly accurate in their predictions and classifications, many state-of-the-art AI systems are still subject to errors and malfunctions. Consider, as an example, a system used for the detection of skin cancers. Such a system might be remarkably accurate in distinguishing cancerous tissues from benign lesions, maybe even more than a human doctor (Soenksen et al., 2021 ). Still, the possibility of a misdiagnosis is open, with potential life-threatening consequences for the patients. Analogously, AI systems employed in war scenarios can make errors in target identifications, and biased systems employed in courts can result in unjust incarceration (Angwin et al., 2016 ). And when the stakes are high, such errors and malfunctions result in high levels of hazards. Footnote 18

Hazard, however, is not the only component that we should take into account. Consider AI-based recommender systems. These systems are nowadays widespread and integrated into a number of online services and platforms, and they are used to filter content – advertisements, buying suggestions, music, videos, and so forth – based on the user’s interests. These interests are typically predicted on the basis of the users’ online habits and previous choices. If one focuses exclusively on the hazard component, these systems do not strike as particularly risky, especially when compared with systems whose failure can result in human victims.

However, things change when the component of exposure is considered. Due to their being pervasive in online environments, including extremely popular platforms, recommender systems virtually monitor and influence the behaviour of all users. As a result, all the possible concerns about privacy, addiction, and manipulation apply on a massive scale. The unwanted consequences might not be so detrimental for the individual, but the potential risks involved in their use are characterised by extremely high levels of exposure.

What is more, the use of AI systems involves risks whose component of exposure goes way beyond the system’s users. In particular, we have in mind risks related to AI systems’ environmental impact. Among other things, there is a growing awareness that the training and use of ML models require significant amounts of energy, which results in an increasing carbon footprint (OECD, 2022 ; Verdecchia & Cruz, 2023 ). In this sense, the emission of greenhouse gases related to the use of AI systems involves risks with a potentially global impact (Tamburrini, 2022 ).

Finally, some AI systems strike for the vulnerability of their users. In this regard, interesting examples come from those contexts in which AI systems interact in social environments with specific kinds of population. These systems, explicitly designed to interact with humans by following social rules, are often equipped with modules and software that allow them to recognize users’ affective states and suitably simulate emotion-driven behaviour. A possible example is represented by AI-powered social robots. These systems come in many forms and shapes and are increasingly used in the context of older adults’ care (Miyagawa et al., 2019 ) and in different educational environments (Tanaka et al., 2015 ) – e.g., with children with autism spectrum disorders (Rakhymbayeva et al., 2021 ). When it comes to these settings, the weight distribution among the components of risk is different again, for the real cause of concern has to do with vulnerability. As a matter of fact, older adults and children are the prototypical vulnerable populations. Footnote 19 On the contrary, hazard levels are reasonably low. True, it is an open possibility that a malfunction of the robot results in someone getting physically hurt. More frequently, however, errors are “social errors” (Tian & Oviatt, 2021 ), episodes involving the breaking of social rules and failures in the recognition and display of emotions. At the same time, the deployment of social robots in older adults’ care and education does not necessarily involve high levels of exposure, for it typically takes place in small and controlled environments.

Summing up, distinguishing between the components of hazard, exposure and vulnerability allows us to better identify the different sources of the risks involved in the use of different AI-based technologies. It is worth noting that, just like all other risks, AI-related risk always results from the interplay of all three components. Consider – again – the case of a recommender system implemented in a social network suggesting links to products on an e-commerce website. We have seen how, being integrated into widely used online platforms, these systems involve significant risks due to their high level of exposure. However, one could also take into account how they may end up exploiting users’ weaknesses to maximise sales profits. In this case, when assessing risk, the focus has to be on the interplay between exposure and users’ vulnerability. Besides the schematism of the discussion presented here, the point is that adopting a multi-component analysis of risk enables a better assessment of AI-related risk and could pave the way for better mitigation strategies.

4 AI-Related Risk: Flexibility and Experimentality

While the adoption of a multi-component approach to risk analysis puts us in a better position to deal with AI-related risks, some difficulties remain. In this section, we focus on the risks stemming from the deployment of AI systems having the character of experimental technologies and qualifying as general-purpose ones – more on this in a moment. These two features, increasingly common in many AI systems, give rise to difficulties when it comes to providing an ex-ante analysis of the involved risks, and may reduce the usefulness of ex-post ones. We will show how, even in these cases, a multi-component analysis of risk allows us to better understand sources of AI-related risk and thereby plan mitigation interventions.

4.1 Ex-Ante and Ex-Post Risk Analyses

Let us start by introducing the general notions of ex-ante and ex-post risk analysis. In risk analysis and methods for economic evaluations like cost-benefit analysis, it is quite standard to distinguish ex-ante evaluations of risk, namely risk assessments conducted before the realization of a project or policy, from ex-post evaluations of risk, which occur after a specific project or policy – in our case, for example, the introduction of a new AI-based technology – has been completed (de Rus, 2021 ). Note that ex-ante evaluations may face severe forms of empirical uncertainty, particularly when evaluating risks that may occur in a distant future and are not limited to specific geographic areas, which can be difficult to identify and quantify (Hansson, 1996 ). For example, new emerging risks may manifest during the implementation of a project and could be in some circumstances quite unimaginable in the ex-ante phase. As we will see in a moment, this is particularly relevant when it comes to the so-called experimental technologies.

A possible way to address these risks is to provide ex-post risk evaluations of projects or policies. Through this kind of retrospective analysis, it is possible to identify and address emergent risks that would have been difficult to consider in the ex-ante phase. In an ex-post evaluation, factual uncertainty can be dramatically reduced, even though there may still be uncertainty regarding the counterfactual scenario in which a specific intervention was not executed. Footnote 20 Moreover, through ex-post analysis, we can gain a better understanding of the exact magnitude of the risks, the extent of their exposure, and the key factors influencing vulnerability. In light of this, ex-post evaluations can be used to inform and partially shape future ex-ante evaluations of similar new projects and risks.

Having introduced the notions of ex-ante and ex-post risk analysis, we can now consider the difficulties involved in performing such analyses within the context of AI. Footnote 21 To this aim, let us go back for a moment to the OECD’s definition presented in Sect. 3, stressing the artefactual nature of AI systems, their interaction capabilities and ability to learn, the role played by inferences and their impact on decisions, and the different levels of autonomy. This definition captures pretty well the features of many current AI systems, that are capable of performing complex tasks in unknown environments by constantly using new data. Now, several of the current AI techniques adopted to achieve these capabilities produce results that are opaque and very often difficult to explain. However, complexity does not only concern the very nature of these systems. On the contrary, it has a lot to do with their interaction with environments (including humans) that in many cases are not known in advance.

To make things worse, many of these technologies are radically innovative, and their introduction into society is de facto unprecedented. Indeed, they could be described as experimental technologies according to the characterization provided by van de Poel ( 2016 ). By definition, experimental technologies are those technologies whose risks and benefits are hard to estimate before they are properly inserted in their context of use, for “there is only limited operational experience with them, so that social benefits and risks cannot, or at least not straightforwardly, be assessed on basis of experience” (van de Poel, 2016 , p. 669). Several technologies, such as nanotechnologies or human enhancement drugs, may qualify as experimental according to this definition. And indeed, many of the current AI systems seem to perfectly fit in the category, given their radically innovative character and their being designed to interact with unknown environments.

The assessment of AI-related risks is also complicated by the fact that many current AI applications are based on so-called general-purpose AI systems (GPAIS). In a nutshell, GPAIS are pre-trained models that can constitute the basis for very different AI systems that can in their turn be fine-tuned to better perform in specific contexts of application (Gutierrez et al., 2023 ). Footnote 22 As a result, they can be used for a variety of purposes, that need not be anticipated in the training phase.

The combination of the two features we have just seen, namely being experimental technologies and being general-purpose systems, makes it particularly difficult to assess ex-ante – that is, before the deployment of the system – all the risks involved in the use of certain AI systems. Consider, for instance, Large Language Models (LLMs). LLMs are relatively new in the AI landscape, at least if we focus on transformer-based architectures (Vaswani et al., 2017 ). Among other things, these systems, trained on huge datasets, can produce impressively convincing texts on the basis of prompts given by the users. While transformer-based LLMs have been available for some years now, they have become extremely popular among the general public after OpenAI’s launch of ChatGPT in November 2022. From a technical point of view, ChatGPT was not dramatically revolutionary: it was based on a pre-existing model that was fine-tuned to make it suitable for conversation. From a more societal perspective, however, it was groundbreaking: since November 2022, everyone has the possibility to interact free of charge and through a user-friendly interface with a state-of-the-art language model that impressively performs in a number of tasks, generating textual outputs that are often hardly distinguishable from human-produced ones. And in fact, ChatGPT reached one million users in five days. Footnote 23

Somewhat unsurprisingly, in a relatively short period of time, LLM-based applications have multiplied. Footnote 24 Among the interesting features of these models, their flexibility stands out. Leaving aside specific limitations imposed by programmers – e.g., they should typically be unable to produce discriminatory and pornographic outputs – they can generate virtually any type of text, can be combined with other models for multimodal processing and generation, and can easily be fine-tuned to be adapted to specific domains. This flexibility allows for a multiplicity of uses. Among other things, LLMs show promising applications in medicine (Thirunavukarasu et al., 2023 ), finance (Wu et al., 2023 ), coding (Xu et al., 2022 ), and education (Kasneci et al., 2023 ). And even within these contexts, LLM-based systems can be used for a variety of applications, as a testimony to their general-purpose character. However, this flexibility comes with a cost, namely a greater potential for hazard generation: different kinds of errors and failures can occur in the different applications of LLMs, from misdiagnoses to wrong predictions of the stock market resulting in monetary loss. On top of all that, LLMs’ flexibility opens the way to a great deal of misuses. For instance, one could use a LLM to write hardly detectable malwares, Footnote 25 or generate disinformation (Bagdasaryan & Shmatikov, 2022 ), also through the generation of fooling images. Provided that other technologies (even non-AI ones) are potentially related to many of the hazards involved in the use of LLMs, LLMs stand out in that they qualify as multi-hazard systems.

So, not only the large-scale use of LLM-based applications is unprecedented, which makes these technologies experimental in van de Poel’s ( 2016 ) sense. It is also characterised by high degrees of flexibility, for LLMs can easily be fine-tuned to specific tasks that need not be anticipated by the initial designers. These two features of LLMs make it extremely difficult to predict ex-ante the risks involved in the use of such technologies, for the operational experience with them is limited and it is hard to anticipate all their possible applications, and therefore all the associated risks.

Sure, ex-post analyses can be performed after the occurrence of unwanted events related to the use of LLMs. However, conducting ex-post risk evaluations in the presence of a recently developed AI technology can be extremely difficult, as we may not yet have all the required information needed to provide a complete retrospective risk analysis. More specifically, these analyses show severe limitations when it comes to considering all the unexplored risks of novel uses of LLMs. Given the experimental character of these technologies and their flexibility, it is reasonable to assume that uses and misuses other than those targeted by the ex-post analysis in question will be a reason for concern. Footnote 26

4.2 Risk Mitigation Strategies

Given these premises, how can we intervene to reduce the risks related to the use of these systems? Here, the analysis of risk we have presented in Sect. 3 turns out to be particularly helpful.

It is quite straightforward that we cannot intervene to reduce unspecified hazards – think about a hazard related to an unanticipated misuse of an LLM-based program. However, we can identify some areas of concern and intervene to limit the use of LLMs in that context. For instance, we might be concerned about and therefore limit the use of LLMs for medical diagnosis, if only for the fact that LLMs are still prone to hallucinations and the consequences of a misdiagnosis can be fatal. More generally, the idea is that, provided that we cannot anticipate with precision the unwanted consequences of the employment of LLMs, we can significantly mitigate the hazards by limiting the contexts of their use.

To do so, our first move could involve intervening at the design level – i.e., by placing constraints on the kind of answers an LLM-based tool could provide. Otherwise, we could intervene through regulation, by limiting the use of LLMs in critical contexts. Furthermore, approaches that integrate normative and technical solutions can be pursued. Consider the growing problem of distinguishing artificially generated contents (text, images, and so on) from those produced by humans, whose implications can be dramatic – just think about the use of deepfake in warfare and political contexts (Twomey, 2023 ) as well as for the so-called revenge porn (Kirchengast, 2020 ). Among others, Knott et al. ( 2023 ) argue for a legislative mechanism according to which the public release of general-purpose generative AI models should be possible only if a reliable tool for the detection of contents generated by the model is also made available. This way, we would intervene at the regulatory level to impose technical interventions.

That being said, hazard is not the only component of risk we should focus on when it comes to GPAIS, whose applications are not always significantly problematic from the point of view of the hazard. It is also important to consider that exposure levels play a crucial role in determining risk. We have seen how GPAIS have suddenly gained popularity, how their flexibility makes them suitable for a number of different tasks and how they are at the basis of easily accessible and user-friendly applications. When assessing the risks related to their use, all of this translates into significantly high levels of exposure. At the same time, as we have seen, we have very little experience with their being used at a large scale, and we can hardly predict the involved risks. A promising strategy, at least in principle, would therefore be to introduce these technologies into society by initially limiting their users, thereby intervening on exposure. This would allow us to collect data and monitor the impact of such technologies, which would in turn allow us to take the appropriate measures to ensure their safe and beneficial large-scale deployment (van de Poel, 2016 ).

Now, things are easier said than done, for a certain tension underlies this strategy. On the one hand, we can hardly anticipate the consequences of the large-scale use of GPAIS as well as their societal impact. On the other hand, a possible solution to deal with this uncertainty consists in preliminarily testing, so to say, these technologies in restricted and monitored environments. However, it is not clear to what extent this strategy could work, for there may be significant risks that emerge only when a system is extensively used. Note that this is not to say that small-scale testing is not helpful. However, it is important to keep in mind that some risks may emerge only at a societal scale, remaining therefore unanticipated. Moreover, a trade-off need to be made between the significance of the smaller-scale testing and its risks: testing the impact of a new technology in a small and controlled environment does not give us much grasp on what could happen at a societal scale but the more controlled environment allows us to easily detect risks and intervene, while larger-scale testing is a better approximation of the introduction of a technology in society but involves greater risks. While there seems to be no easy solution to the problem, this discussion brings forth a crucial point: monitoring processes must play a central role. Given the difficulties involved in anticipating the risks in the use of a certain AI system, especially when it qualifies as an experimental and a general-purpose one, continuous assessment is required to promptly detect and intervene on emerging risks. Footnote 27

Finally, provided that an eye should be kept on exposure, some criteria are needed to prioritise the protection of certain populations when it comes to the deployment of new AI systems. Here, vulnerability is the key notion. Whether testing a new AI system or considering the impacts of its introduction, prioritising vulnerable populations is crucial. On one hand, there are traditionally vulnerable groups such as older people and children, who, having little to no familiarity with AI systems, are more susceptible to dangers like deception or manipulation associated with the use of AI, and especially of generative AI. On the other hand, it is equally important to pay attention to users employing AI systems in contexts of vulnerability, such as in education. While the use of GPAIS in educational settings holds promising applications, it also imposes changes that do not always benefit the involved users. For instance, it might compel us to modify teaching and testing methods, possibly abandoning well-established and effective practices.

Taking stock, we have seen how keeping in mind a multi-component analysis of risk puts us in a better position to cope with AI-related risks emerging from the use of AI experimental technologies and general-purpose systems. Besides being increasingly popular, LLMs are a perfect example of such technologies. However, flexibility and experimentality are distinctive features of many kinds of models, especially if it comes to GPAIS. And while LLMs are among the most studied and used GPAIS, the class is wider and encompasses increasingly powerful models for computer vision – such as Meta AI’s Segment Anything Model (SAM), Footnote 28 a zero-shot learning system for image segmentation – as well as multimodal processing – such as Google DeepMind’s Gemini, Footnote 29 a family of models designed to handle image, audio, video, and text. Our general analysis should therefore be applicable to systems implementing all these models, allowing us to better understand and mitigate risk in all those cases in which the generality and experimentality of the model make it difficult to perform and rely on ex-ante and ex-post risk analyses.

5 Conclusions

Increasing attention has recently been devoted to the risks associated with the deployment of AI systems. However, a general epistemological framework for understanding such risks is still lacking. This article attempted to fill this gap by starting from multi-component analyses of risk, typically used in natural risk assessment and reduction, and trying to apply them to the context of AI. We argued that distinguishing between the components of hazard, exposure, and vulnerability allows us to better understand and deal with AI-related risk. This holds also for those AI systems that qualify as general-purpose and experimental technologies, for which it is often hard to perform ex-ante and ex-post risk analyses and for which continuous assessment and monitoring should be performed. This aligns, for instance, with the risk-based indications of the Food & Drug Administration (FDA) concerning the regulation of AI-based medical products. The FDA clarifies that the deeply iterative, autonomous, and often flexible nature of medical products necessitates a novel regulatory framework for the total product lifecycle. This framework fosters a rapid cycle of product enhancement, empowering these devices to continually improve their functionalities while maintaining robust protective measures even during the pre-market phase (FDA, 2024 ).

Needless to say, open questions abound. For instance, provided that the multi-component approach to risk borrowed from disaster risk analysis can fruitfully be applied to the domain of AI, it is still unclear whether there are some specificities of AI-related risk that call for a distinct and additional treatment. Or again, it would be interesting to investigate whether and how the development, use and societal impact of AI systems could be understood in terms of (deep forms of) uncertainty , further drifting apart from the dimension of probabilistic assessment that is often involved in talk of risk. With our work here, we hope to have provided the epistemological ground for addressing such questions and paved the way to the articulation of a more comprehensive methodology to deal with these risks.

Data Availability

Not applicable.

Code Availability

https://futureoflife.org/open-letter/pause-giant-ai-experiments/ .

https://www.safe.ai/statement-on-ai-risk .

https://www.nature.com/articles/d41586-023-02094-7 .

More precisely, the Regulation of the European Parliament and of the Council on laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts ( https://www.europarl.europa.eu/doceo/document/TA-9-2024-0138_EN.html ).

Although we will not explicitly address the question in this paper, it is noteworthy that, in the latest versions of the AI Act, the category of systemic risk was added specifically in relation to the risks associated with general-purpose AI models (Chapter V).

See Novelli et al. ( 2023 , 2024 ) for a different framework, meant to be directly applied to the AI Act.

Among other things, the AI Act’s list of high-risk systems has been criticized. For instance, Prainsack and Forgó ( 2024 ) have recently emphasized how systems classified as “medical devices” are considered high-risk ones regardless of their actual use. A system such as a smartwatch, on the other hand, may pose analogous risks and yet be excluded from the list of high-risk systems due to its being classified as a lifestyle gadget. This, the authors argue, “creates competitive advantages for companies with sufficient economic power to legally challenge high-risk assessments”.

Most contributions in the literature have from time to time focused on specific risks related to the deployment of specific systems and/or in specific contexts, while systematic and comprehensive reviews seem to be rarer.

A notable exception is represented by the Artificial Intelligence Risk Management Framework developed by the National Institute of Standards and Technology (NIST, 2023). Although the framework is practically oriented, it provides an analysis of how AI-related risks differ from risks of traditional software systems.

The probabilistic component of risk is also the main ingredient of the Royal Society’s (1983) definition of risk as “the probability that a particular adverse event occurs during a stated time period, or results from a particular challenge”.

Again, the multi-component analysis of risk is not meant to be a definition of risk, let alone an alternative one with respect to the definitions in terms of expected utility. It should rather be understood as an additional analysis aiming at decomposing specific risks to facilitate mitigation interventions.

The distinction between natural and technological risk is sometimes blurred. In fact, one could also refer to risks having mixed origins, both natural and technological – the so-called “Natech” risks (UNISDR, 2017 ). That said, it is important to note that the division between natural and human-made (or technological) risks is highly practical in various scenarios, but it is difficult to make a clear distinction between the two categories (Hansson, 2016 ). Frequently, some aspects of the same risk may be labelled as natural in certain situations and as technological in others. It is therefore advisable to avoid the oversimplified division of accidents into two rigid categories, natural and human-made.

Note that risk is given only in those cases in which all three components are present. There is clearly no risk if there is no hazard, but there is no risk also if no-one is exposed to harm or vulnerable. A reviewer interestingly points out that we might imagine situations in which interventions are successfully performed that significantly reduce the vulnerability of some components of the population with respect to a certain hazard, to the point where the risk related to the hazard in question becomes negligible for them. In such a case, from a population perspective, the hazard remains but the overall risk is mitigated, for there are fewer exposed people who are also vulnerable.

Sometimes a fourth component of risk is acknowledged, i.e. capacity, even if this concept is usually assumed as something pertaining to vulnerability. In particular, capacity is defined as “the combination of all the strengths, attributes and resources available within an organisation, community or society to manage and reduce disaster risks and strengthen resilience. […] Capacity may include infrastructure, institutions, human knowledge and skills, and collective attributes such as social relationships, leadership and management” (Sendai Framework Terminology on Disaster Risk Reduction, https://www.undrr.org/terminology/capacity ).

We thank one of the reviewers for suggesting this clarification.

See Floridi ( 2023 ) for an up-to-date overview of different (legal) definitions of AI.

https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449 . Note that this is not the only possible definition of AI, not even if we narrow it down to legal definitions. The AI Act, for example, defines an AI system as “a machine-based system designed to operate with varying levels of autonomy, that may exhibit adaptiveness after deployment and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments” (Art. 3, 1).

Analogous considerations are explicitly made in the AI Act concerning AI-based safety components in digital infrastructures, road traffic and the supply of water, gas, heating and electricity, whose failure or malfunctioning “may put at risk the life and health of persons at large scale and lead to appreciable disruptions in the ordinary conduct of social and economic activities” (recital 55). Consistently with the literature on the non-epistemic aspects of inductive risk (see Douglas, 2000 ), this kind of error also has an impact at the level of values (Karaca, 2021 ).

Note that vulnerability would arguably deserve a separate and detailed treatment, for the question of vulnerability and AI is often raised but seldom investigated. In particular, it would be interesting to analyse to what extent we can exclusively rely on “classic” vulnerable groups (such as older adults and children) when assessing AI risk or whether we should rethink the very concept of vulnerability and its categories in light of technical and social changes in the AI landscape.

Note that we are significantly simplifying the matter for the sake of exposition. As we will see in a moment, ex-post analyses often present significant challenges.

The distinction between ex-ante and ex-post risk analysis is not reducible to the difference between ‘inherent risk’ and ‘residual risks’, although there are some commonalities. Inherent risk represents the amount of risk that exists in the absence of risk mitigation measures, while residual risk refers to the risk that remains after an organization has implemented measures to mitigate the inherent risks (Gorecki, 2020 ). From an ex-ante perspective, we need to consider in advance the risks occurring either in the presence or absence of possible mitigation strategies, while from an ex-post perspective, we need to evaluate the risks resulting from concretely adopted mitigation strategies and imagine counterfactually what risks could have materialized without such strategies.

A closely related – and sometimes overlapping – notion is the one of foundation model (Bommasani et al., 2022 ).

https://www.statista.com/chart/29174/time-to-one-million-users/ .

As it often happens with other AI applications, the pace of technological innovation also makes it hard to keep regulations up to date.

https://www.cyberark.com/resources/threat-research-blog/chatting-our-way-into-creating-a-polymorphic-malware .

When it comes to LLMs, the issue of emergent capabilities further complicates the matter. The point is that, once they are actually used, these models exhibit abilities that could not be foreseen during the training phase. Such capabilities can also emerge when an LLM is scaled up in terms of hyperparameters and trained on a broader dataset. Needless to say, this makes it extremely hard to predict all possible uses and thus the potential risks of these systems.

Although we cannot afford to get into the details, incremental approaches based on trial-and-error and small steps testing represent a promising way to deal with these issues (Woodhouse & Collingridge, 1993 ; van de Poel, 2016 ).

https://segment-anything.com/ .

https://deepmind.google/technologies/gemini/#introduction .

Amoroso, D., & Tamburrini, G. (2020). Autonomous weapons systems and meaningful human control: Ethical and legal issues. Current Robotics Reports , 1 , 187–194. https://doi.org/10.1007/s43154-020-00024-3 .

Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias. Pro Publica . https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing .

Bagdasaryan, E., & Shmatikov, V. (2022). Spinning language models: Risks of propaganda-as-a-service and countermeasures. 2022 IEEE Symposium on Security and Privacy (SP) , San Francisco (CA), 769–786, https://doi.org/10.1109/SP46214.2022.9833572 .

Boholm, M., Möller, N., & Hansson, S. O. (2016). The concepts of risk, safety, and security application in everyday language. Risk Analysis , 36 (2), 320–338. https://doi.org/10.1111/risa.12464 .

Article   Google Scholar  

Bommasani, R., Hudson, D. A., Adeli, E., et al. (2022). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 .

Briggs, R. A. (2023). Normative theories of rational choice: Expected utility. In Edward N. Zalta & U. Nodelman (Eds.), The Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/archives/fall2023/entries/rationality-normative-utility/ .

Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Conference on Fairness, Accountability and Transparency , New York: PMLR, 77–91.

Crawford, K. (2021). The atlas of AI: Power, politics, and the planetary costs of artificial intelligence . Yale University Press.

Curzon, J., Kosa, T. A., Akalu, R., & El-Khatib, K. (2021). Privacy and artificial intelligence. IEEE Transactions on Artificial Intelligence , 2 (2), 96–108. https://doi.org/10.1109/TAI.2021.3088084 .

de Rus, G. (2021). Introduction to cost benefit analysis: Looking for reasonable shortcuts . Edward Elgar Publishing.

Douglas, H. (2000). Inductive risk and values in science. Philosophy of Science , 67 (4), 559–579. https://doi.org/10.1086/392855 .

Edwards, L. (2022). Regulating AI in Europe: Four problems and four solutions . Ada Lovelace Institute.

FDA. (2024). Artificial Intelligence and medical products: How CBER, CDER, CDRH, and OCP are working together. https://www.fda.gov/media/177030/download?attachment .

Floridi, L. (2021). The European legislation on AI: A brief analysis of its philosophical approach. Philosophy and Technology , 34 , 215–222. https://doi.org/10.1007/s13347-021-00460-9 .

Floridi, L. (2023). On the Brussels-Washington consensus about the legal definition of Artificial Intelligence. Philosophy and Technology , 36 , 87. https://doi.org/10.1007/s13347-023-00690-z .

Gorecki, A. (2020). Cyber breach response that actually works: Organizational approach to managing residual risk . Wiley.

Gutierrez, C. I., Aguirre, A., Uuk, R., Boine, C. C., & Franklin, M. (2023). A proposal for a definition of general purpose Artificial Intelligence systems. Digital Society , 2 , 36. https://doi.org/10.1007/s44206-023-00068-w .

Hansson, S. O. (1996). Decision making under great uncertainty. Philosophy of the Social Sciences , 26 (3), 369–386. https://doi.org/10.1177/004839319602600304 .

Hansson, S. O. (2009). From the casino to the jungle: Dealing with uncertainty in technological risk management. Synthese , 168 (3), 423–432. https://doi.org/10.1007/s11229-008-9444-1 .

Hansson, S. O. (2016). Managing risks of the unknown. In P. Gardoni, C. Murphy, & A. Rowell (Eds.), Risk analysis of natural hazards (pp. 155–172). Springer.

Hansson, S. O. (2023). Risk. In E. Zalta & U. Nodelman (Eds.), The Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/archives/sum2023/entries/risk .

Karaca, K. (2021). Values and inductive risk in machine learning modelling: The case of binary classification models. European Journal of Philosophy of Science , 11 , 102. https://doi.org/10.1007/s13194-021-00405-1 .

Kasneci, E., Seßler, K., Küchemann, S., et al. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences , 103 , 102274. https://doi.org/10.1016/j.lindif.2023.102274 .

Kirchengast, T. (2020). Deepfakes and image manipulation: Criminalisation and control. Information & Communications Technology Law , 29 (3), 308–323. https://doi.org/10.1080/13600834.2020.1794615 .

Knott, A., Pedreschi, D., Chatila, R., et al. (2023). Generative AI models should include detection mechanisms as a condition for public release. Ethics and Information Technology , 25 , 55. https://doi.org/10.1007/s10676-023-09728-4 .

Mahler, T. (2022). Between risk management and proportionality: The risk-based approach in the EU’s Artificial Intelligence Act proposal. Nordic Yearbook of Law and Informatics 2020–2021 , 247–270. https://doi.org/10.53292/208f5901.38a67238 .

Miyagawa, M., Kai, Y., Yasuhara, Y., Ito, H., Betriana, F., Tanioka, T., & Locsin, R. (2019). Consideration of safety management when using Pepper, a humanoid robot for care of older adults. Intelligent Control and Automation , 11 , 15–24. https://doi.org/10.4236/ica.2020.111002 .

Mökander, J., Juneja, P., Watson, D. S., et al. (2022). The US algorithmic accountability act of 2022 vs the EU Artificial Intelligence Act: What can they learn from each other? Minds & Machines , 32 , 751–758. https://doi.org/10.1007/s11023-022-09612-y .

National Institute of Standards and Technology (NIST) (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0) . https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf .

NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0) . https://doi.org/10.6028/NIST.AI.100-1 .

Novelli, C., Casolari, F., Rotolo, A., Taddeo, M., & Floridi, L. (2023). Taking AI risks seriously: A new assessment model for the AI act. AI & SOCIETY , 1–5. https://doi.org/10.1007/s00146-023-01723-z .

Novelli, C., Casolari, F., Rotolo, A., Taddeo, M., & Floridi, L. (2024). AI risk assessment: A scenario-based, proportional methodology for the AI act. Digital Society , 3 (1), 1–29. https://doi.org/10.1007/s44206-024-00095-1 .

OECD (2022). Measuring the environmental impacts of artificial intelligence compute and applications: The AI footprint. OECD Digital Economy Papers , 341. Paris: OECD Publishing. https://doi.org/10.1787/7babf571-en .

OECD (2023). Recommendation of the Council on Artificial Intelligence . https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449 .

Panayides, et al. (2020). AI in medical imaging informatics: Current challenges and future directions. IEEE Journal of Biomedical and Health Informatics , 24 (7), 1837–1857. https://doi.org/10.1109/JBHI.2020.2991043 .

Prainsack, B., & Forgó, N. (2024). New AI regulation in the EU seeks to reduce risk without assessing public benefit. Nature Medicine . https://doi.org/10.1038/s41591-024-02874-2 .

Queudot, M., & Meurs, M. J. (2018). Artificial Intelligence and predictive justice: Limitations and perspectives. In M. Mouhoub, S. Sadaoui, & O. Ait Mohamed (Eds.), Recent trends and future technology in applied intelligence . Springer. https://doi.org/10.1007/978-3-319-92058-0_85 .

Rakhymbayeva, N., Amirova, A., & Sandygulova, A. (2021). A long-term engagement with a social robot for autism therapy. Frontiers in Robotics and AI , 8 , 669972. https://doi.org/10.3389/frobt.2021.669972 .

Russell, S. J., & Norvig, P. (2021). Artificial intelligence: A modern approach (4th ed.). Pearson.

Soenksen, L. R., Kassis, T., Conover, S. T., Marti-Fuster, B., et al. (2021). Using deep learning for dermatologist-level detection of suspicious pigmented skin lesions from wide-field images. Science Translational Medicine , 13 (581), eabb3652. https://doi.org/10.1126/scitranslmed.abb3652 .

Tamburrini, G. (2022). The AI carbon footprint and responsibilities of AI scientists. Philosophies , 7 (1), 4. https://doi.org/10.3390/philosophies7010004 .

Tanaka, F., Isshiki, K., Takahashi, F., Uekusa, M., Sei, R., & Hayashi, K. (2015). Pepper learns together with children: Development of an educational application. 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids) , 270–275. https://doi.org/10.1109/HUMANOIDS.2015.7363546 .

Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., et al. (2023). Large language models in medicine. Nature Medicine , 29 , 1930–1940. https://doi.org/10.1038/s41591-023-02448-8 .

Thywissen, K. (2006). Components of risk: a comparative glossary. Source , 2 . Bonn: UNU-EHS.

Tian, L., & Oviatt, S. (2021). A taxonomy of social errors in human-robot interaction. ACM Transactions on Human-Robot Interaction (THRI) , 10 (2), 1–32. https://doi.org/10.1145/3439720 .

Twomey, J., Ching, D., Aylett, M. P., Quayle, M., Linehan, C., & Murphy, G. (2023). Do deepfake videos undermine our epistemic trust? A thematic analysis of tweets that discuss deepfakes in the Russian invasion of Ukraine. Plos One , 18 (10), e0291668. https://doi.org/10.1371/journal.pone.0291668 .

UNDRO. (1991). Mitigating natural disasters. Phenomena, effects and options. A manual for policy makers and planners . United Nations.

UNISDR (2017). Natech Hazard and Risk Assessment. https://www.undrr.org/quick/11674 .

Van de Poel, I. (2016). An ethical framework for evaluating experimental technology. Science and Engineering Ethics , 22 (3), 667–686. https://doi.org/10.1007/s11948-015-9724-3 .

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Aidan, Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems , 30 .

Verdecchia, R., Sallou, J., & Cruz, L. (2023). A systematic review of Green AI. WIREs Data Mining and Knowledge Discovery , 13 (4), e1507. https://doi.org/10.1002/widm.1507 .

Vermaas, P., Kroes, P., Van de Poel, I., Franssen, M., & Houkes, W. (2011). A philosophy of technology: From technical artefacts to sociotechnical systems. Springer .

Wirtz, B. W., Weyerer, J. C., & Kehl, I. (2022). Governance of artificial intelligence: a risk and guideline-based integrative framework. Government Information Quarterly , 39 (4), 101685.

Woodhouse, E. J., & Collingridge, D. (1993). Incrementalism, intelligent trial-and-error, and political decision theory. In H. Redner (Ed.), An heretical heir of the enlightenment: science, politics and policy in the work of Charles E. Lindblom (pp. 131–154). Westview.

Wu, S., Irsoy, O., Lu, S. (2023). Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564

Xu, F. F., Alon, U., Neubig, G., & Hellendoorn, V. J. (2022). A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2022) . New York: Association for Computing Machinery, 1–10. https://doi.org/10.1145/3520312.3534862 .

Zanotti, G., Petrolo, M., Chiffi, D., & Schiaffonati, V. (2023). Keep trusting! A plea for the notion of trustworthy AI. AI & Society . https://doi.org/10.1007/s00146-023-01789-9 .

Download references

The research is supported by (1) Italian Ministry of University and Research, PRIN Scheme (Project BRIO, no. 2020SSKZ7R); (2) Italian Ministry of University and Research, PRIN Scheme (Project NAND no. 2022JCMHFS); (3) PNRR-RETURN-NextGeneration EU program, PE0000005; (4) PNRR-PE-AI FAIR-NextGeneration EU program.

Open access funding provided by Politecnico di Milano within the CRUI-CARE Agreement.

Author information

Authors and affiliations.

Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy

Giacomo Zanotti & Viola Schiaffonati

Department of Architecture and Urban Studies, Politecnico di Milano, Milan, Italy

Daniele Chiffi

You can also search for this author in PubMed   Google Scholar

Contributions

All authors have contributed substantially to the manuscript.

Corresponding author

Correspondence to Giacomo Zanotti .

Ethics declarations

Ethical approval.

Ethics approval was not required for this study.

Consent to Publish

Consent to participate, competing interests.

The authors declare they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Zanotti, G., Chiffi, D. & Schiaffonati, V. AI-Related Risk: An Epistemological Approach. Philos. Technol. 37 , 66 (2024). https://doi.org/10.1007/s13347-024-00755-7

Download citation

Received : 22 January 2024

Accepted : 06 May 2024

Published : 25 May 2024

DOI : https://doi.org/10.1007/s13347-024-00755-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Components of risk
  • General-purpose AI systems
  • Experimental technologies
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 18 May 2024

AI-enhanced integration of genetic and medical imaging data for risk assessment of Type 2 diabetes

  • Yi-Jia Huang 1 , 2 ,
  • Chun-houh Chen   ORCID: orcid.org/0000-0003-0899-7477 2 &
  • Hsin-Chou Yang   ORCID: orcid.org/0000-0001-6853-7881 1 , 2 , 3 , 4  

Nature Communications volume  15 , Article number:  4230 ( 2024 ) Cite this article

2 Altmetric

Metrics details

  • Disease prevention
  • Genetics research
  • Medical genetics
  • Predictive medicine

Type 2 diabetes (T2D) presents a formidable global health challenge, highlighted by its escalating prevalence, underscoring the critical need for precision health strategies and early detection initiatives. Leveraging artificial intelligence, particularly eXtreme Gradient Boosting (XGBoost), we devise robust risk assessment models for T2D. Drawing upon comprehensive genetic and medical imaging datasets from 68,911 individuals in the Taiwan Biobank, our models integrate Polygenic Risk Scores (PRS), Multi-image Risk Scores (MRS), and demographic variables, such as age, sex, and T2D family history. Here, we show that our model achieves an Area Under the Receiver Operating Curve (AUC) of 0.94, effectively identifying high-risk T2D subgroups. A streamlined model featuring eight key variables also maintains a high AUC of 0.939. This high accuracy for T2D risk assessment promises to catalyze early detection and preventive strategies. Moreover, we introduce an accessible online risk assessment tool for T2D, facilitating broader applicability and dissemination of our findings.

Similar content being viewed by others

x risk analysis for ai research

Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data

x risk analysis for ai research

Machine learning for characterizing risk of type 2 diabetes mellitus in a rural Chinese population: the Henan Rural Cohort Study

x risk analysis for ai research

Early detection of type 2 diabetes mellitus using machine learning-based prediction models

Introduction.

Type 2 diabetes (T2D) is a prevalent global health concern, comprising almost 90% of diabetes mellitus (DM) cases 1 . T2D is associated with severe complications such as retinopathy, nephropathy, and cardiovascular diseases, significantly impacting health and quality of life and increasing healthcare expenses 2 . Early detection and risk assessment of T2D are crucial for effective health management. T2D has a global prevalence of 6% 3 . However, in Taiwan, the prevalence is even higher, at approximately 10%. The mortality and economic burden in medical care among T2D patients increase significantly over time 4 . T2D has a polygenic and multifactorial mode of inheritance 5 , 6 . The significant risk factors include genetic components, food intake, and environmental exposures 7 , 8 .

Genome-wide association studies (GWAS) have identified T2D susceptibility loci and genes, which have been used to develop T2D prediction models 9 , 10 , 11 . Polygenetic risk scores (PRS) and weighted PRS have attracted attention for the genetic prediction of T2D 12 , 13 , 14 . However, the prediction accuracy must be elevated for clinical use 15 . Recent studies have combined single nucleotide polymorphisms (SNPs) from multi-ethnic GWAS to calculate PRS and improve prediction accuracy 16 , 17 . Methods, such as PRS-CSx, have been developed to integrate GWAS summary statistics from multiple ethnic groups and combine multiple PRSs with weights considering linkage disequilibrium 18 , 19 , 20 . The use of PRS for T2D risk assessment and prediction is crucial in clinical application and precision medicine 21 .

Recent smart medicine and precision health studies have highlighted the utility of medical imaging analysis in disease diagnosis and prediction, in addition to genetic markers. Moreover, previous research has demonstrated the association of several diseases with T2D 22 , 23 , some of which can be diagnosed using medical imaging techniques. For instance, nonalcoholic fatty liver can be diagnosed through abdominal (ABD) ultrasonography 24 , osteoporosis through bone mineral density (BMD) 25 , and cardiovascular disease through electrocardiography (ECG) 26 . These T2D-associated diseases can be effectively diagnosed and detected using medical imaging analysis. Considering this, our study incorporates genetic markers and medical imaging data to assess the risk of T2D. This approach enables a comprehensive evaluation and potential improvement in T2D prediction and risk assessment.

Artificial intelligence, which encompasses machine learning and deep learning, has found extensive applications in genetic research, including disease diagnosis, classification, and prediction using supervised learning 27 , 28 . Extreme Gradient Boosting (XGBoost), a supervised tree-based machine learning approach 29 , has demonstrated superior performance in classification and prediction. Successful applications of XGBoost in precision medicine include chronic kidney disease diagnosis 30 , orthopedic auxiliary classification 31 , chronic obstructive pulmonary prediction 32 , and multiple phenotypes prediction 33 .

Taiwan Biobank (TWB), established in 2012, is a valuable resource for the integrative analysis of genetic and medical imaging data 34 . The TWB enrolled participants aged over 20 from the Han Chinese population in Taiwan and collected baseline questionnaires, blood, urine samples, and their biomarkers of lab tests, as well as genotyping data from all participants. Follow-up data, including repeated questionnaires, biomarker measurements, and medical imaging data, were collected every two to four years. Medical imaging data includes ABD, carotid artery ultrasonography (CAU), BMD, ECG, and thyroid ultrasonography (TU). The integrative analysis of genetic and medical imaging data holds great promise for disease risk assessment and prediction, as demonstrated by recent studies 35 , 36 , 37 , 38 . Here, we present a study integrating genome-wide SNPs and multimodality imaging data from the TWB for T2D risk assessment, marking an advancement in the field. We developed machine learning models incorporating genetic information, medical imaging, demographic variables, and other risk factors. Furthermore, we identified high-risk subgroups for T2D, providing insights into T2D precision medicine.

This study comprised two primary analyses: a genetic-centric analysis (Analysis 1; detailed in Fig.  1 and the Methods section) and a genetic-imaging integrative analysis (Analysis 2; detailed in Fig.  2 and the Methods section). Data used in the two analyses are summarized (Supplementary Table  S1 ). A total of 68,911 participants from the TWB were included in the analysis (Fig. S 1 ).

figure 1

The dataset containing information from 60,747 individuals after data quality control (QC) was divided into several subsets: (i) The genome-wide association study (GWAS) samples (Dataset 1, N  = 35,688), training samples (Dataset 2, N  = 12,236; Dataset 4, N  = 40,787), and validation samples (Dataset 3, N  = 3060; Dataset 5, N  = 10,197). For classification analysis, testing samples comprised Dataset 6 ( N  = 8827) and Dataset 7 ( N  = 936), while for prediction analysis, they were represented as Datasets 6’ ( N  = 8827) and Dataset 7’ ( N  = 936); B Sample size . Total sample size, along with the number of cases and the number of controls, are shown for each of the four phenotype definitions in Datasets 1 – 7; C Phenotype definition criteria . The definition and sample size for the four Type 2 Diabetes (T2D) phenotype definitions is shown. D Analysis flowchart . The analysis flow comprises three steps, starting with selecting T2D-associated single nucleotide polymorphisms (SNPs) and polygenic risk score (PRS), then selecting demographic and environmental covariates, and the best XGBoost model was established using the selected features. As to the first step, SNPs can be chosen from A our own GWAS with an adjustment for age, sex, and top ten principal components (PCs), B published studies based on single ethnic populations, and C published studies based on multiple ethnic populations. Source data are provided as a Source Data file.

figure 2

Phenotype Definition IV was used as an example to illustrate the process. The data containing information from 7,786 individuals were divided into four subsets: a training dataset ( N  = 4689), a validation dataset ( N  = 1175), and two independent testing datasets ( N  = 1469 for the first dataset and N  = 444 for the second independent dataset). Subsequently, the best XGBoost model was established. B Flowchart of PRS construction . The Polygenic Risk Score (PRS) was constructed using PRS-CSx, utilizing genome-wide association study (GWAS) summary statistics from the European (EUR), East Asian (EAS), and South Asian (SAS) populations obtained from the analysis of the DIAGRAM Project. Source data are provided as a Source Data file.

Genetic-centric analysis – Comparison of prediction models

We evaluated the prediction performance under different scenarios hierarchically (the best scenario at a previous variable was given for a discussion of the next variable) in the following order: the sources and significance levels of T2D-associated SNPs (Fig.  3A and Fig. S 2 ), T2D phenotype definitions (Fig.  3B ), family history variable combinations (Fig.  3C and Fig. S 3 ), demographic variable combinations (Fig.  3D ), demographic and genetic variable combinations (Fig.  3E ), and SNP and PRS combinations (Figs.  3 F and 3G ). The findings are summarized as follows: First, using T2D-associated SNPs from the previous large-sample-size GWAS 11 as predictors had the highest AUC of 0.557, but its AUC was not significantly higher than that used the SNPs identified by our smaller-sample-size GWAS under different thresholds of statistical significance (Fig.  3A ), although our GWASs did identify some T2D-associated SNPs (Fig. S 4 ). Second, the phenotype defined by self-reported T2D with HbA1C ≥ 6.5% or fasting glucose ≥126 mg/dL (i.e., T2D Definition IV) had the highest AUC of 0.640. Its AUC was significantly higher than the AUCs of the other three T2D definitions (Fig.  3B ). Third, sibs’ disease history had a significantly higher AUC of 0.732 than parents’ disease history with an AUC of 0.670 ( p  = 0.009). Moreover, additive parent-and-sib disease history had the highest AUC of 0.758. Its AUC was significantly higher than parent-only ( p  < 0.001) (Fig.  3C ). Fourth, a joint effect of age, sex, and additive parent-sib disease history had the highest AUC of 0.884. Its AUC was significantly higher than other demographic variable combinations, except for the combination of age and additive parent-sib disease history (Fig.  3D ). Fifth, whatever SNPs were included or not, demographic and PRS combinations outperformed the models without incorporation of PRS (Fig.  3E ), although genetic factors only improved up to 3% of AUC conditional on demographic characteristics (age, sex, and family history of T2D). Finally, given T2D-associated SNPs, AUC significantly increased if PRS was included (Fig.  3F ); T2D-associated SNPs provided a limited additional effect if PRS was already included (Fig.  3G ).

figure 3

A bar chart displays AUC. The two-sided DeLong test examined the difference between AUCs. Bonferroni’s correction was applied to control for a family-wise error rate in multiple comparisons. Symbols *, **, and *** indicate p -values < 0.05, 0.01, and 0.001, respectively. A SNP selection . Model predictors were SNPs selected from published studies or our GWAS under different p-value thresholds, where our GWAS association test is a two-sided Wald test for the slope coefficient in a logistic regression. The average AUCs of prediction models for four phenotype definitions were compared. B T2D Phenotype Definition . In addition to including the selected variables in Fig. 3A, the AUCs of four phenotype definitions were compared. C Family history of T2D . In addition to including the selected variables in Fig. 3A, B, the AUCs of the four types of T2D family history (i.e., (i): parents (binary factor), (ii) sibs (binary factor), (iii) either parents or sibs (binary factors), and (iv) both parents and sibs (ordinal factor)) were compared. D Demographic variables . In addition to including the selected variables in Fig. 3A–C, the AUCs of different combinations of demographic factors, including age, sex, and family history of T2D, are compared. E PRS and demographic variables . In addition to including the selected variables in Fig. 3A–D, the AUCs of different combinations of genetic variables, including SNPs, PRS-CS, and PRS-CSx, and demographic variables, including age, sex, and family history of T2D, are compared. F Impact of including PRS after SNPs . The AUCs of the models that consider SNPs, SNPs+PRS-CS, and SNPs+PRS-CSx as predictors are compared. G Impact of including additional SNPs after PRS . The additional 137 SNPs were collected from published studies (Supplemental Text  2 ). The AUCs of the models that consider additional SNPs given PRS in the model are compared. Source data are provided as a Source Data file.

Among different prediction models, the model with predictors PRS-CSx, age, sex, and family history of T2D had the highest AUC 0.915 (Fig.  4A ) for Type VI definition of T2D based on the first testing dataset (i.e., Dataset 6’ in Fig.  1 ). The optimal threshold, determined by the Youden index, for the fitted value that used to predict T2D or non-T2D in the XGboost model was 0.16. The Accuracy, Sensitivity, Specificity, and F1 indices were 0.843, 0.844, 0.843, and 0.672, respectively. Furthermore, the model was tested in the second independent testing dataset (i.e., Dataset 7’ in Fig.  1 ), and a promising result similar to the first testing dataset was found: AUC = 0.905, Accuracy = 0.843, Sensitivity = 0.846, Specificity = 0.842, and F1 = 0.644. AUCs are also provided for the other three T2D definitions (Fig. S 5 ).

figure 4

A AUCs of all models based on Phenotype Definition IV . A heatmap summarizes the AUCs of all models based on Phenotype Definition IV (i.e., T2D was defined by self-reported T2D, HbA1c, and fasting glucose). The genetic variables are shown on the X-axis, and the demographic variables are shown on the Y-axis. B Positive correlation between PRS and T2D odds ratio . In each decile of PRS based on PRS-CSx, the odds ratio of T2D risk and its 95% confidence interval were calculated based on an unadjusted model (blue line) and an adjusted model considering age, sex, and T2D family history (red line). The reference group was the PRS group in the 40–60% decile. The horizontal bars are presented as the odds ratio estimates (square symbol) +/– its 95% confidence intervals (left and right ends) at a PRS decile. C High-risk group . In the chart, the figures from the inner to the outer represent (i) the case-to-control ratio, (ii) the number of cases, and (iii) the number of controls in the PRS decile subgroups. D Association of age, sex, T2D family history, and PRS with T2D . In the univariate analysis, the p-values for age, sex, family history, and PRS were 4.17 × 10 –20 , 7.08 × 10 –7 , 9.41 × 10 –13 , and 2.06 × 10 –13 , respectively. In the multivariate analysis, the p-values for age, sex, family history, and PRS were 2.00 × 10 –16 , 5.56 × 10 –5 , 1.43 × 10 –10 , and 5.49 × 10 –13 , respectively. E Risk factors for T2D . Kaplan-Meier curves reveal that Age (older individuals), sex (males), T2D family history (the larger number of parents and siblings who had T2D), and PRS (high decile PRS group) are risk factors (high-risk level) for T2D risk. F Median event time of T2D . Examples of the median event time for developing T2D are provided based on a multivariate Cox regression model, both without and with incorporating PRS. NA indicates not assessable. Source data are provided as a Source Data file.

The importance of each predictor was evaluated through a backward elimination procedure of variables. The optimal model incorporating age, sex, family history of T2D, and PRS achieved an AUC of 0.915. The AUC reductions upon removing individual variables are as follows: (a) Omitting the age variable resulted in an AUC of 0.839, representing a reduction of 0.076. (b) Excluding the sex variable resulted in an AUC of 0.905, with a decrease of 0.01. (c) Removing the family history of the T2D variable yielded an AUC of 0.881, with a reduction of 0.034. (d) Eliminating the PRS variable resulted in an AUC of 0.884, decreasing to 0.031. Based on the decrease in AUC, the impact size appears to be in the order of age > family history > PRS > sex. Additionally, we evaluated feature importance (see the Methods section), and the order of feature importance is family history > age > PRS > sex. Our findings consistently highlight age and family history as the most crucial risk factors for T2D.

Genetic-centric analysis – Assessment of family history of T2D

Family history encompasses genetics and environment. We delved into the connection between the family history of T2D – treated as a graded scale (0, 1, 2, 3, and 4) – and the genetic component represented by the PRS. Through ordinal logistic regression, we observed a beta coefficient of 0.808 and an associated odds ratio (OR) of 2.24 ( p  = 1.65 × 10 –296 ). The remarkably small p-value emphasizes the robust statistical significance, signaling a substantial association between the PRS and familial T2D status. For each incremental unit rise in an individual’s PRS, their odds of belonging to a higher family history category for T2D increase by 2.24 times. This implies a tangible shift in the likelihood of different family history classifications as the PRS changes. The findings underscore a strong statistical link between genetic predisposition, as captured by the PRS, and the gradation of family history of T2D.

Furthermore, we calculated the Population Attributable Risk (PAR) by dichotomizing PRS into a high-risk group (PRS tercile >80%) and a non-high-risk group (PRS tercile <80%). Among the 59,811 participants, the breakdown was as follows: high PRS with family history ( N  = 5473), high PRS without family history ( N  = 6489), non-high PRS with family history ( N  = 16,054), and non-high PRS without family history ( N  = 31,795). The PAR estimate was 10.17%, indicating that 10.17% of the family history of T2D is attributed to genetic heritability. If considering a broader definition of the high-risk group (PRS tercile >60%) and non-high-risk group (PRS tercile <60%), the PAR estimate increased to 18.41%.

Further consideration of environmental factors, including education level, drinking experience, exercise habits, the number of exercise types, and SNP-SNP interactions with and without SNPs’ main effect, did not improve T2D prediction (Supplementary Table  S2 ). Considering model parsimony, the final model did not include these environmental factors and SNP-SNP interactions. In addition to prediction models, classification models were also established. The AUCs in classification models (Fig. S 6 ) were generally slightly higher than those in prediction models (Fig. S 5 ).

Genetic-centric analysis – Assessment of PRS

The positive association between PRS and T2D risk is shown (Fig.  4B ). Compared to the participants in the 40–60% PRS decile group, those in the top 10% decile group had a 4.738-fold risk of developing T2D (95% confidence interval: 3.147–7.132, p  < 0.001) and a 4.660-fold risk (95% confidence interval: 2.682–8.097, p  < 0.001) after adjusting for age, sex, and family history. In addition, we performed a stratified analysis across various combinations of age subgroups, sex subgroups, and family history subgroups to identify high-risk subgroups, where age was stratified into four subgroups based on quartiles: 0–25%, 25–50%, 50–75%, and 75–100%, corresponding to age subgroups of ≤43, 43–52, 52–59, and >59 years of age, respectively (Fig. S 7 ). We identified a high-risk subgroup of women who were older than 59 and had a family history of T2D. The ratio of case vs. control sample size was as high as 7.3–13.0-fold in the 80–100% decile groups (Fig.  4C ). The ratio was much higher than a 1.6-fold that did not consider PRS (i.e., PRS at 0–100%) (Fig.  4C ). Due to ambiguity or instability in the evidence for other combinations, we chose not to report them.

Genetic-centric analysis – Risk of developing T2D

Among 8347 non-T2D participants at baseline in the first testing dataset of 8827 participants, 220 reported T2D in the follow-up. The Cox regression analyses considered two types of time scales and three types of sex variable treatment and obtained a consistent result (Supplementary Table  S3 ). Using the analysis in which we considered time-on-study as the time-scale with age at baseline, sex, family history of T2D, and PRS as covariates for illustration, age, sex, family history of T2D, and PRS were all significantly associated with T2D ( p  < 0.001) (Fig.  4D ). Increased age, higher PRS, and stronger T2D family history had a higher T2D risk. The elderly male, with a strong family history and high PRS, had a severe T2D risk (Fig.  4E for multivariate Cox regression and Fig. S 8 for univariate Cox regression). We also provided the predicted time-to-event (week) (Fig.  4F ). For example, a 50-year-old man with one of his family members had T2D will achieve median T2D-free time after 460 weeks (95% CI, 384–NA). The median time to develop T2D was shortened to 419 weeks (95% CI, 384–NA) after considering a standardized PRS of 0.66 (equivalent to a PRS risk subgroup in the top 25% of the population).

A linear regression analysis was performed to assess the impact of exercise on HbA1c. Multiple testing for 110 analyses was corrected using Bonferroni correction, and the significance level was set as 4.5 × 10 -4 . It was observed that individuals engaging in regular exercise experienced a significant reduction in HbA1c by an average of 0.09% mg/dL ( p  < 0.001) compared to those who did not engage in regular exercise. Moreover, individuals with a high PRS who engaged in exercise demonstrated a greater reduction in HbA1c (0.13% mg/dL) than those with a low PRS (0.08% mg/dL). The results also suggested that the T2D patients who regularly engaged in exercise can have a noteworthy improvement of 0.32% mg/dL in HbA1c than those T2D patients who did not exercise regularly. In addition, among the various types of exercise, walking for fitness exhibited the most robust reduction in HbA1c for all samples, including high and low-risk subgroups and both T2D and non-T2D groups (Fig. S 9 ). On average, participants engaged in walking for fitness 18.30 times a month (standard deviation = 8.64) for approximately 48.13 minutes per session (standard deviation = 22.92).

Genetic-centric analysis – The ability of T2D early detection in our model

To investigate the early detection capability of our model for T2D, we performed an analysis focusing on 550 women participants older than 59 years, all of whom had a family history of T2D. We identified them as at high risk if they possessed a high PRS, even though they were initially reported as non-T2D at baseline. Thirty-six were changed to T2D, and 514 were still non-T2D at follow-up. We predicted their T2D status. G1 – G4 are the groups of participants in true positive, false negative, false positive, and true negative, respectively (Fig.  5A ). We evaluated that G3 was indeed misclassified by our prediction model or our prediction had corrected the problem in the self-reported T2D by further investigating: (1) their follow-up time and current risk in the Cox regression model; (2) HbA1c and fasting glucose; (3) the accuracy of self-reported disease status.

figure 5

A Four subgroups ( N  =  550). B Survival rate ( N  =  550). C Median survival time ( N  =  550) . P -values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 0.092, 0.0014 (**), and 2.22 × 10 –16 (***), respectively. D Follow-up time ( N  =  550) . P -values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 0.056, 0.32, and 0.14, respectively. E T2D risk ( N  =  550) . P-values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 0.018 (*), 0.073, and 0.0039 (**), respectively. F HbA1c ( N  =  550) . P-values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 2.21 × 10 –14 , 0.0039, and 3.00 × 10 –5 ; respectively; in the follow-up, p -values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 1.50 × 10 -13 , 6.01 × 10 −4 , and 4.55 × 10 -6 , respectively. G Fasting glucose ( N  =  550) . In the baseline, p -values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 2.06 × 10 -12 , 6.66 × 10 –4 , and 1.63 × 10 –2 ; respectively; in the follow-up, p -values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 8.30 × 10 –8 , 1.38 × 10 –3 , and 1.84 × 10 –2 , respectively. H Phenotype definition in G3 ( N  =  395) . Many individuals in G3 cannot satisfy the T2D Phenotype Definition IV. In Fig. 5C–G, two-sided Wilcoxon rank-sum tests were applied to compare group differences. The box plots’ center lines indicate the medians, the lower and upper boundaries of the boxes represent the first and third quartiles, and the whiskers extend to cover a range of 1.5 interquartile distances from the edges. The violin plots’ upper and lower bounds depict the minimum and maximum values. Source data are provided as a Source Data file.

The Kaplan-Meier curve for each subgroup is depicted (Fig.  5B ). The distributions of median survival time for each subgroup are illustrated (Fig.  5C ). The distributions of the time period from baseline to follow-up for each subgroup are presented (Fig.  5D ). The distributions of Type 2 diabetes (T2D) risk at follow-up for each subgroup are shown (Fig.  5E ). The distributions of HbA1c levels at baseline and follow-up for each subgroup are displayed (Fig.  5F ). The distributions of fasting glucose levels at baseline and follow-up are demonstrated (Fig.  5G ).

First, compared to G4 (true negative), G3 had a significantly lower T2D-free probability (Fig.  5B ), shorter median survival time (Fig.  5C ), higher T2D-risk under similar follow-up time (Fig.  5D and   5E ), higher HbA1c (Fig.  5F ), and higher fasting glucose (Fig.  5G ). Second, compared to G1 (true positive), G3 had a comparable survival rate (Fig.  5B ), median survival time (Fig.  5C ), and T2D-risk under similar follow-up time (Fig.  5D and  5E ) but lower HbA1c (Fig.  5F ) and fasting glucose (Fig.  5G ). We didn’t compare G2 and G3 because of the small sample size in G2. Finally, among the 395 participants in G3, 80.76% of them were removed from our previous analysis because their baseline HbA1c and fasting glucose violated the criteria for the phenotype definition (Fig.  1C ); 339 participants were removed because of their follow-up HbA1c and fasting glucose violated the formal T2D criteria; only 34 self-reported non-T2D were really non-T2D participants who had HbA1C < 6.5% and fasting glucose <126 mg/dL (Fig.  5H ). Overall, the results consistently indicate that G3 represents individuals in a pre-T2D stage, which our model can detect early.

Genetic-imaging integrative analysis – Model performance and essential features

The model that combined four types of image features performed best. Moreover, the model based on BMD image features exhibited a higher AUC, accuracy, specificity, and F1 than the models based on any other three types of images (Fig.  6A ). The models based on image features had an AUC of 0.898, higher than the ones of genetic information (AUC = 0.677) and demographic factors (AUC = 0.843). Integrating image features, genetic information, and demographic factors increased AUC to 0.949 in the first testing data (Fig.  6B ); the results for each of the four images are also provided (Fig. S 10 ). The accuracy, sensitivity, specificity, and F1 of the model in the first testing data were 0.871, 0.878, 0.870, and 0.663, respectively, based on a classification threshold of 0.03. The model also performed reasonably well in the second testing dataset with AUC = 0.929, Accuracy = 0.854, Sensitivity = 0.789, Specificity = 0.862, and F1 = 0.558. The results of a prediction model using tuned parameters are also provided (Supplementary Table  S4 ). As no significant improvement was observed, this paper discusses the default model. According to the estimated feature importance in the best XGBoost model, all genetic factors (PRS), four types of medical images, and demographic variables provided informative features for risk assessment, such as PRS (genetics), family history and age (demographic factors), fatty liver (ABD images), end-diastolic velocity in the right common carotid artery (CAU images), RR interval (ECG images), and spine thickness (BMD images). Of the 152 medical imaging features, 125 were selected in the final model. (Fig.  6C ).

figure 6

A Performance comparison of medical imaging data analysis . The area under the receiver operating characteristic (ROC) curve (AUC), accuracy (ACC), sensitivity (SEN), specificity (SPEC), and F1 score are compared for the integrative analysis of four types of medical images (All) and individual medical image analyses, including BMD, ECG, CAU, and ABD. B The model that combines four types of medical imaging, PRS, and demographic variables shows the highest AUC of 0.949 . ROC plots and the corresponding AUC for the models considering medical image features (I), genetic PRS (G), and demographic variables, including age, sex, T2D family history (D), and their combinations. C An optimal model combining medical imaging, PRS, and demographic variables . The best model’s top 20 features with a high feature impact include the medical image, genetic, and demographic features. D Positive correlation between MRS and T2D odds ratio . In each decile of MRS based on four types of medical images, the odds ratio of T2D risk and its 95% confidence interval were calculated based on an unadjusted model (blue line) and an adjusted model considering age, sex, and T2D family history (red line), with the MRS group in the 40–60% decile serving as the reference group. The horizontal bars are presented as the odds ratio estimates (square symbol) +/– its 95% confidence intervals (left and right ends). E High-risk group . The figures from the inner to the outer in the chart display (i) the case-to-control ratio, (ii) the number of cases, and (iii) the number of controls in the MRS decile subgroups. F Input page of the online T2D prediction website . Personal information, including age, sex, family history of T2D, PRS, and MRS, is input to calculate T2D risk. PRS and MRS are optional, and a reference distribution is provided. G Output page of the online T2D prediction website . Source data are provided as a Source Data file.

To address the challenges of practical clinical implementation in the best XGBoost model, we have proposed an alternative model that requires a limited number of features. We systematically calculated each feature’s incremental area under the AUC by sequentially including those with the highest feature importance. We selected the top features showing a positive AUC increment. The analysis revealed that a sub-model incorporating only the following eight crucial variables: family history (from the questionnaire), age (from the questionnaire), fatty liver (from ABD images), spine thickness (from BMD images), PRS (from genetic data), end-diastolic velocity in the right common carotid artery (R_CCA_EDV) (from CAU images), RR interval (from ECG images), and end-diastolic velocity in the left common carotid artery (L_CCA_EDV) (from CAU images), maintains a commendable AUC of 0.939 (Fig. S 11 ). This streamlined model significantly reduces the number of risk predictors while preserving high prediction accuracy, demonstrating promising potential for practical application in clinical settings. Moreover, the reduced number of risk predictors in the streamlined model alleviates concerns about model overfitting.

Genetic-imaging integrative analysis – Multi-image risk score (MRS)

Each participant’s multi-image risk score (MRS) was calculated as the likelihood of being predicted as a T2D case using XGBoost, which analyzed the medical imaging features for T2D prediction. The odds ratio and its confidence interval for the association between MRS and T2D are shown by percentiles of MRS (Fig.  6D ). Compared to the participants in the 40–60% MRS decile group, the risk of T2D increased with MRS. Of importance, we further identified that, for the men older than 54 years old with a family history of T2D, the case vs. control ratio of sample size was 9.3 in the 90–100% MRS decile group, much higher than 1.3, which MRS was not considered (Fig.  6E ).

Online T2D-risk assessment

We have established a website where users can calculate their T2D risk online. To obtain the risk assessment, users are required to provide age, sex, and family history of T2D, and they can optionally provide PRS and MRS (Fig.  6F ). PRS and MRS can be entered manually or uploaded as a file (Supplemental Text  1 ). Additionally, we have provided PRS and MRS risk percentages based on the study population as a reference. The online risk assessment offers information, including the risk of developing T2D over 3, 5, and 7 years, T2D-free probability, and T2D risk with and without considering PRS (Fig.  6G ). The assessment takes into account both PRS and MRS (Fig.  6G ). For example, consider a 50-year-old male with a family history of T2D and PRS 1.5 and MRS 1.5. Without considering PRS, the risk (probability) of developing T2D after a 7-year follow-up is 0.23. However, when PRS is considered, the risk increases to 0.37. Furthermore, considering MRS further increases the risk to 0.81. The online tool provides these valuable insights to users based on their input data.

In this study, we conducted a comparative analysis of two prediction models based on GWAS data. The first method utilized T2D-associated SNPs derived from our GWAS with a limited sample size, either as individual predictors or in combination to construct a PRS for the prediction model. The second method incorporated T2D-associated SNPs from previously published GWASs with a significantly larger sample size or utilized summary statistics of whole-genome SNPs from GWASs with a considerably larger sample size to construct the PRS. Notably, the latter approach yielded a higher prediction AUC. These findings underscore the substantial impact of sample size in GWAS, PRS construction, and subsequent classification and prediction analyses, aligning with previous research 39 . Consequently, in situations where the sample size is limited, we propose utilizing external genetic information such as SNPs and summary statistics from published studies with larger sample sizes, which not only facilitates the development of a more predictive PRS and model but also reduces computational overhead 40 .

Our study investigated the importance of employing a precision phenotype definition to evaluate disease risk. We also addressed a potential limitation associated with the prevalent use of self-reported disease status. Utilizing four T2D definitions, Type IV, which integrates self-reported T2D with measurements of HbA1c and GLU-AC, emerged as a definition closely aligned with clinical practice. Our results demonstrate that the model based on T2D Definition IV exhibits the highest prediction accuracy. Consequently, in this study, superior diagnostic accuracy corresponds with higher prediction accuracy. Furthermore, the application of self-reported T2D (Definition Type I) yields an AUC significantly lower than the AUC of Definition Type IV. This outcome underscores a potential limitation associated with the commonly used self-report disease status, which functions as a convenient phenotype in the analysis of TWB data.

Our study emphasizes the superiority of disease family history as a predictor of T2D compared to T2D-associated SNPs and PRS. The inclusion of genetic factors such as significant SNPs and PRS as additional predictors, given family history, only results in modest improvements in the model’s predictive capability. Family history encompasses genetic, epigenetics, and shared environmental influences, which are crucial in understanding the etiology of T2D 41 . Additionally, we observed that the disease history of siblings provides more informative value for prediction than the disease history of parents 42 .

T2D subgrouping can facilitate the implementation of precision medicine in clinical practice, particularly when utilizing complex data 43 . This study demonstrated a positive association between PRS and MRS with T2D risk. Notably, we identified a high-risk subgroup of women older than 59 years with a family history of T2D, where the case vs. control ratio of sample size in the 80–100% PRS decile group ranged from 7 to 13, significantly higher than the overall population. Similarly, for MRS, we found a high-risk subgroup of men older than 54 years with a family history of T2D, where the case vs. control ratio of sample size in the 90–100% MRS decile group was 9.3, considerably higher than the ratio of 1.3 when MRS was not considered. These results demonstrate the utility of PRS and MRS in identifying high-risk subgroups for T2D.

In the PRS-CSx method, we considered three weighting methods to combine several population-specific PRSs into the final PRS: (1) an equal weight, (2) the population-specified weight, and (3) the meta-effect size for each SNP. Our results showed that the meta-effect size obtained a worse performance. The population-specified weight performed best; however, the result may vary between cohorts.

In this study, our PRS based on PRS-CSx achieved an AUC of 0.732 for T2D prediction. The AUC increased to 0.915 after further, including age, sex, and family history of T2D in the prediction model. When comparing our results with the previous publications, Khera, Chaffin 44 achieved an AUC of 0.725 using a logistic regression that included age, sex, and PRS constructed with LDpred 45 . Imamura, Shigemizu 14 achieved an AUC of 0.648 with a PRS built by 49 T2D-associated SNPs with LD weights, and the AUC increased to 0.787 after including age, sex, and BMI. Ge, Irvin 18 achieved an AUC of 0.694 with a PRS constructed using summary statistics from three large-scale GWASs. Walford, Porneala 46 achieved an AUC of 0.641 with a PRS built by 63 SNPs, age, and sex. In summary, our study utilized phenotype refinement through HbA1c and fasting glucose, employed XGBoost with superior performance, and considered the family history of T2D as a critical factor for T2D prediction, leading to improved performance compared to previous studies.

Including environmental factors such as education level, drinking level, exercise habit, and the number of exercise types in our models increased prediction accuracy for non-T2D participants but decreased accuracy for T2D cases. The overall improvement in prediction performance achieved by including these environmental factors was relatively modest and did not reach statistical significance. Similarly, including SNP-SNP interactions in the models did not lead to a significant improvement. While SNP-SNP interactions have been proposed as a potential explanation for missing heritability 47 , our findings indicate that incorporating these interactions does not provide additional benefits when PRS is already included in the model. This could be attributed to PRS already capturing a substantial portion of the genetic component, making incorporating SNP main effects and SNP-SNP interactions less impactful.

This study demonstrates good ability in detecting T2D cases, but we observed that some self-reported non-T2D individuals might be misclassified as T2D cases. Further investigation suggested that these cases represent individuals in a pre-T2D stage. Firstly, their T2D risk at the follow-up time was higher than true non-T2D participants but lower than the confirmed T2D cases, indicating an elevated but not fully developed risk. Secondly, these individuals exhibited higher HbA1c and fasting glucose levels than true non-T2D participants, albeit lower than confirmed T2D cases, suggesting a pre-T2D stage. Lastly, when redefining the phenotype using HbA1c and fasting glucose, most of these participants did not meet the inclusion criteria for the control group, further suggesting that they may not be genuinely non-T2D participants. Considering these factors, it is evident that although these participants are self-reported as non-T2D, they are likely in a pre-T2D stage, with an increased risk of developing T2D in the future. It is crucial to follow up with these individuals, monitor their condition closely, and implement preventive interventions to mitigate the risk of T2D development.

The integration of genetics and medical imaging data into risk assessment shows excellent potential for enabling early T2D detection and prevention, albeit at a higher cost. Practical examples from health examinations and screenings, such as the MJ Health Survey Database in Taiwan 48 , provide compelling evidence for successfully incorporating these data into real-world practices. These examples highlight the valuable role that genetics and medical imaging data can play in enhancing risk assessment and underscore the potential benefits of integrating these approaches for improved disease management and prevention. Notably, the performance of T2D classification and prediction in the established models was validated in a second independent dataset, yielding equally impressive results, thus demonstrating that the results are not due to overfitting.

While our current study presents a robust proof-of-principle model for disease risk evaluation based on genetic and multi-modality medical imaging variables within the TWB, we recognize the importance of external validation for broader generalizability. However, the current landscape poses challenges in accessing publicly available datasets encompassing genetic and all four types of medical imaging data. Despite the inherent limitations in readily available datasets with comparable characteristics, we are actively collaborating with a medical center to collect external validation data for future studies.

Another limitation of our study is that due to limited follow-up time in the TWB, only a limited number of participants experienced a change in T2D status from baseline to follow-up, particularly for redefining the phenotype using HbA1c and fasting glucose. To assess the early detection capability of our model for T2D, we are currently addressing this issue by monitoring the participants who exhibited changes in self-reported T2D status from baseline to follow-up in our Cox regression model. This limitation can be overcome in future studies as the TWB continues to track these samples. In addition, conducting a cohort survey or clinical trial is warranted to evaluate the high-risk subgroups our PRS and MRS identified for future precision T2D medicine.

In conclusion, our study surpassed previous research in predicting and classifying T2D. We successfully developed artificial intelligence models that effectively combined genetic markers, medical imaging features, and demographic variables for early detection and risk assessment of T2D. PRS and MRS were instrumental in identifying high-risk subgroups for T2D risk assessment. To facilitate online T2D risk evaluation, we have also created a dedicated website.

Inclusion and ethics declarations

The TWB collected written informed consent from all participants. The TWB (TWBR10911-01 and TWBR11005-04) and the Institute Review Board at Academia Sinica approved our data application and use (AS-IRB01-17049 and AS-IRB01-21009).

Study participants and variables

This study included a genetic-centric analysis (Analysis 1) and a genetic-imaging integrative analysis (Analysis 2). A total of 68,911 participants in the TWB were analyzed.

In the genetic-centric analysis, 50,984 participants who had only baseline data (i.e., without follow-up data) were used as the training and validation samples; they consisted of 2531 self-reported T2D patients and 48,453 self-reported non-T2D controls (Fig.  1A and S 1 ). We assigned 80% and 20% of data as the training and validation samples. The GWAS samples (Dataset 1, N  = 35,688), training samples (Dataset 2, N = 12,236; Dataset 4, N  = 40,787), and validation samples (Dataset 3, N  = 3060; Dataset 5, N  = 10,197). For classification analysis, testing samples comprised two independent datasets: Dataset 6 ( N  = 8827) and Dataset 7 (N = 936), while for prediction analysis, they were represented as Datasets 6’ ( N  = 8827) and Dataset 7’ ( N  = 936) (Fig.  1A and Fig. S 1 ). Here, 9763 participants who had both baseline and follow-up data were used as the testing samples, where 8,827 and 936 participants were recruited as the first and second testing datasets; they consisted of 528 self-reported T2D patients and 9235 self-reported non-T2D controls at baseline; 767 self-reported T2D patients and 8996 self-reported non-T2D controls at follow-up (Fig.  1A and Fig. S 1A ).

In addition to the self-reported T2D, hemoglobin A1C (HbA1c) and fasting glucose (GLU-AC) collected in both baseline and follow-up were used to refine the self-reported T2D phenotype. In total, we considered four definitions for T2D as follows: (1) Self-reported T2D: The case and control information was collected from the questionnaire directly; (2) Self-reported T2D + HbA1C: A case was defined as self-reported T2D and HbA1C ≥ 6.5% and control was defined as self-reported non-T2D and HbA1C ≤ 5.6%; (3) Self-reported T2D + GLU-AC: A case was defined as self-reported T2D and GLU-AC ≥ 126 and control was defined as self-reported non-T2D and GLU-AC < 100; (4) Self-reported T2D + HbA1c + GLU-AC: A case was defined as self-reported T2D, HbA1C ≥ 6.5%, or GLU-AC ≥ 126 and control was defined as self-reported non-T2D, HbA1C ≤ 5.6%, and GLU-AC < 100 (Figs.  1 B and 1C ).

The demographic characteristics of the study population for four phenotype definitions were shown (Table  1 ). The table reveals that the participants in the T2D case group are older than those in the control group, with a higher proportion of males.

Other variables in the genetic-centric analysis (Fig.  1D ) are illustrated as follows: Demographic variables included age, sex, and family history of T2D. Four types of family history were: T2D occurrence in any of father and mother (parents) (Yes or No), in any of brother and sister (sibs) (Yes or No), in any of father, mother, brother, and sister (Yes or No), and the number of T2D cases in father, mother, brother, and sister (0, 1, 2, 3, or 4). Environmental exposures included education level, drinking level, exercise habits, and the number of exercise types.

Whole-genome genotyping using one of two SNP arrays was performed based on the samples in the baseline. TWBv1.0 SNP array with approximately 650,000 SNP markers or TWBv2.0 SNP array with approximately 750,000 SNP markers was employed. Imputation was performed based on the 1KG-EAS panel 49 . The SNPs with an info score of less than 0.9 were removed 50 . Sample and marker quality controls followed the procedures of Yang, Chu 51 . Related samples were removed using the index of identity by descent in the quality control procedure. External information about T2D-associated SNP sets, and effect sizes based on the GWAS summary statistics of T2D were collected (Supplemental Text  2 ).

In the genetic-imaging integrative analysis, 17,785 participants who had both genetic data and medical imaging data were analyzed (Fig.  2A and S 1B ); they consisted of 1366 self-reported T2D patients and 16,419 self-reported non-T2D controls (Fig.  2A ); here, the case and control were defined based on the questionnaire at follow-up rather than baseline. For example, based on the T2D Definition IV (Fig.  1C ), 7786 participants, which consisted of 1118 cases and 6668 controls, were analyzed (Fig.  2A ). The entire dataset was split into training + validation and testing sets at an 8:2 ratio. Subsequently, the training + validation set was further randomized into distinct training and validation datasets, maintaining an 8:2 ratio. Imaging report variables in the genetic-imaging integrative analysis (Upper left in Fig.  2A ) consisted of 28 ABD features, 29 CAU features, 85 BMD features, and 10 ECG features (Supplemental Data  1 ). TU features were not included because of a small sample size. The details about the medical imaging protocol can be referred to TWB ( https://www.biobank.org.tw/about_value.php ). In the flowchart of PRS calculation, external information about T2D-associated SNP sets and GWAS summary statistics from DIAGRAM are provided (Fig.  2B ). This paper provides only numerical data in aggregate and summary statistics. No individuals can be identified.

Polygenic risk score for T2D

PRS was constructed by using PRS-CS 52 and PRS-CSx 19 . PRS-CS 52 was run based on the meta-GWAS summary statistics of T2D in East Asia in the DIAGRAM Consortium 53 and the linkage disequilibrium (LD) reference from the EAS population of the 1000 Genomes Project 49 . PRS was calculated using PLINK (--score command) based on our genotype data, and 884,327 SNP effects were estimated using PRS-CS. Normalized PRS was standardized to mean = 0 and standard deviation = 1. PRS-CSx 19 based on the meta-GWAS summary statistics of T2D in multiple populations, including (a) East Asian: 56,268 cases and 227,155 controls in the DIAGRAM Consortium 53 ; (b) European: 80,154 cases and 853,816 controls in the DIAGRAM Consortium 53 ; (c) South Asian: 16,540 cases and 32,952 controls in the DIAGRAM Consortium 53 , and the LD reference from each of the three populations (EAS, EUR, and SAS). 884,327, 880,098, and 900,047 SNPs for EAS, EUR, and SAS were applied to our data to calculate the population-specific PRS for each individual using the PLINK (--score command). We combined the three population-specific PRS with equal weight to calculate a final PRS. R language was used to standardize the PRS to mean = 0 and standard deviation = 1.

Classification and prediction for T2D

XGBoost algorithm 29 , implemented through Python code, was applied to classify and predict T2D based on the following features: genetic variables, demographic variables, environmental exposures, and imaging report variables. Both classification and prediction models were trained and validated based on the baseline data (Datasets 2–5 in Fig.  1A ). Final classification models were built and tested based on the baseline phenotype data (Dataset 6 in Fig.  1A ) and further replicated based on the second independent testing dataset (Dataset 7 in Fig.  1A ). Final prediction models were built and tested based on the follow-up phenotype data (Dataset 6’ in Fig.  1A ) and replicated based on the second independent testing dataset (Dataset 7’ in Fig.  1A ). The illustration of the data used for classification and prediction tasks in Analysis 1 and Analysis 2 are provided (Supplementary Table  S1 ).

The XGBoost models were trained with the following default parameter settings: a maximum tree depth of 6, a learning rate of 0.3, a regularization parameter alpha (L1) of 0, a regularization parameter lambda (L2) of 1, 100 boosting stages, and an early-stop parameter of 30. Feature importance was calculated based on the average of three importance metrics: weight, gain, and cover indices for each variable within a single tree and then averaged across all the trees in a model 54 . Feature selection was performed based on the feature importance score. Parameter tuning was conducted to establish the best model (Supplementary Table  S4 ).

The area under the receiver operating curve (AUC) was calculated to evaluate the model’s overall performance. The two-sided DeLong test (DeLong et al., 1988) examined AUCs’ differences. Bonferroni’s correction 55 was applied to control for a family-wise error rate in multiple comparisons. In the best model, accuracy, sensitivity, specificity, and F1-score were calculated to evaluate the performance, where the optimal cut-off value of the XGBoost model was calculated using the Youden index 56 in the validation data.

Event history analysis and online risk assessment

In the genetic-centric analysis, multivariate Cox regression 57 was applied to identify important risk factors for the T2D event time and estimate the T2D-free probability in the testing datasets. The event was defined as the occurrence of T2D in the follow-up for non-T2D participants at baseline. The Cox regression analysis considered two types of time scales (i.e., time-on-study and age) and three types of sex variable treatment (i.e., adjusting for sex as a covariate, conducting sex-specific analysis with an assumption of a common sex effect, and performing sex-specific analysis with different sex effects), resulting in six analyses (refer to Supplementary Table  S3 ).

The initial three analyses considered time-on-study as the time scale, with age at baseline included as a covariate, and incorporating the following sex variable treatment: (1) Model 1: Sex was treated as a covariate in the analysis; (2) Model 2: Sex-specific analysis, assuming a common effect for males and females; (3) Model 3: Sex-specific analysis with different effects for males and females. The subsequent three analyses considered age as the time scale, with age-at-baseline as left truncation, along with the three sex variable treatments, similar to the time-on-study analysis, to be Models 4 – 6.

The time-on-study scale analysis calculated the event time as the duration from baseline to follow-up. In the age scale analysis, the event time was age-at-follow-up. In each model, the median event time in weeks was also calculated. Because medical imaging data were only available in the follow-up, the genetic-imaging integrative analysis applied multivariate logistic regression 58 to identify important risk factors for T2D events and estimate the T2D-free probability in the testing datasets. In addition, we established a website at https://hcyang.stat.sinica.edu.tw/software/T2D_web/header.php to provide an online risk assessment for T2D.

Web resources

We established a website at https://hcyang.stat.sinica.edu.tw/software/T2D_web/header.php to provide an online risk assessment for T2D.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

The data analyzed in this study were obtained from the Taiwan Biobank with proper approval. As the data are subject to ownership rights held by the Taiwan Biobank, they have not been deposited in a public repository. Researchers interested in accessing the data must do so through a formal application process, subject to approval by the Taiwan Biobank. Detailed instructions on requesting data access can be found on the Taiwan Biobank’s official website ( https://www.twbiobank.org.tw/index.php ). Source data are provided in the Supplementary Information and Source Data files with this paper. In addition to the TWB data, a set of 137 highly significant T2D-associated SNPs from the AGEN can be downloaded from https://blog.nus.edu.sg/agen/summary-statistics/t2d-2020/ . Meta-GWAS summary statistics of T2D in multiple populations from the DIAGRAM Consortium can be obtained from https://diagram-consortium.org/downloads.html . The linkage disequilibrium reference from various populations of the 1000 Genomes Project is available for download at https://github.com/getian107/PRScsx .  Source data are provided in this paper.

Code availability

The repository at https://github.com/yjhuang1119/Risk-assessment-model contains code for constructing a disease risk assessment model using eXtreme Gradient Boosting (XGBoost). The code also computes performance metrics for model evaluation and feature importance scores for model explainability. A README is provided.

Laakso, M. Biomarkers for type 2 diabetes. Mol. Metab. 27 , S139–S146 (2019).

Article   CAS   PubMed Central   Google Scholar  

Morrish, N. J., Wang, S. L., Stevens, L. K., Fuller, J. H. & Keen, H. and the WHOMSG. Mortality and causes of death in the WHO multinational study of vascular disease in diabetes. Diabetologia 44 , S14 (2001).

Article   PubMed   Google Scholar  

Khan, M. A. B. et al. Epidemiology of Type 2 Diabetes - Global burden of disease and forecasted trends. J. Epidemiol. Glob. Health 10 , 107–111 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Chen, H.-Y., Kuo, S., Su, P.-F., Wu, J.-S. & Ou, H.-T. Health care costs associated with macrovascular, microvascular, and metabolic complications of type 2 diabetes across time: estimates from a population-based cohort of more than 0.8 million individuals with up to 15 years of follow-up. Diabetes Care 43 , 1732–1740 (2020).

Prasad, R. B. & Groop, L. Genetics of Type 2 diabetes—pitfalls and possibilities. Genes 6 , 87–123 (2015).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bonnefond, A. & Froguel, P. Rare and common genetic events in type 2 diabetes: what should biologists know? Cell Metab. 21 , 357–368 (2015).

Article   CAS   PubMed   Google Scholar  

Meigs, J. B., Cupples, L. A. & Wilson, P. W. Parental transmission of type 2 diabetes: the Framingham Offspring Study. Diabetes 49 , 2201–2207 (2000).

Lyssenko, V. et al. Predictors of and longitudinal changes in insulin sensitivity and secretion preceding onset of type 2 diabetes. Diabetes 54 , 166–174 (2005).

Xue, A. et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat. Commun. 9 , 2941 (2018).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Suzuki, K. et al. Identification of 28 new susceptibility loci for type 2 diabetes in the Japanese population. Nat. Genet. 51 , 379–386 (2019).

Spracklen, C. N. et al. Identification of type 2 diabetes loci in 433,540 East Asian individuals. Nature 582 , 240–245 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

van Hoek, M. et al. Predicting type 2 diabetes based on polymorphisms from genome-wide association studies: a population-based study. Diabetes 57 , 3122–3128 (2008).

Talmud, P. J. et al. Utility of genetic and non-genetic risk factors in prediction of type 2 diabetes: Whitehall II prospective cohort study. BMJ 340 , b4838 (2010).

Imamura, M. et al. Assessing the clinical utility of a genetic risk score constructed using 49 susceptibility alleles for type 2 diabetes in a Japanese population. J. Clin. Endocrinol. Metab. 98 , E1667–E1673 (2013).

Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51 , 584–591 (2019).

Polfus, L. M. et al. Genetic discovery and risk characterization in type 2 diabetes across diverse populations. Hum. Genet. Genomics Adv. 2 , 100029 (2021).

Article   CAS   Google Scholar  

Ishigaki, K. et al. Multi-ancestry genome-wide association analyses identify novel genetic mechanisms in rheumatoid arthritis. Nat. Genet. 54 , 1640–1651 (2022).

Ge, T. et al. Development and validation of a trans-ancestry polygenic risk score for type 2 diabetes in diverse populations. Genome Med 14 , 70 (2022).

Ruan, Y. et al. Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 54 , 573–580 (2022).

Tsuo, K. et al. Multi-ancestry meta-analysis of asthma identifies novel associations and highlights the value of increased power and diversity. Cell Genomics 2 , 100212 (2022).

Shojima, N. & Yamauchi, T. Progress in genetics of type 2 diabetes and diabetic complications. J. Diabetes Investig. 14 , 503–515 (2023).

Robertson, R. P. Prevention of type 2 diabetes mellitus. In: UpToDate (eds Nathan, D. & Rubinow, K.). (Wolters Kluwer, 2022). https://pro.uptodatefree.ir/show/1774 .

Isaia, G. et al. Osteoporosis in type II diabetes. Acta Diabetol. Lat. 24 , 305–310 (1987).

Ballestri, S. et al. Nonalcoholic fatty liver disease is associated with an almost twofold increased risk of incident type 2 diabetes and metabolic syndrome. Evidence from a systematic review and meta‐analysis. J. Gastroenterol. Hepatol. 31 , 936–944 (2016).

Lin, H.-H. et al. Association between type 2 diabetes and osteoporosis risk: A representative cohort study in Taiwan. Plos One 16 , e0254451 (2021).

Nabel, E. G. Cardiovascular disease. N. Engl. J. Med. 349 , 60–72 (2003).

Quazi, S. Artificial intelligence and machine learning in precision and genomic medicine. Med. Oncol. 39 , 120 (2022).

Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16 , 321–332 (2015).

Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference On Knowledge Discovery And Data Mining 785–794 (ACM, New York, NY, USA, 2016). https://doi.org/10.1145/2939672.2939785 .

Ogunleye, A. & Wang, Q. G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinforma. 17 , 2131–2140 (2020).

Article   Google Scholar  

Li, S. & Zhang, X. Research on orthopedic auxiliary classification and prediction model based on XGBoost algorithm. Neural Comput. Appl. 32 , 1971–1979 (2020).

Ma, X. et al. Comparison and development of machine learning tools for the prediction of chronic obstructive pulmonary disease in the Chinese population. J. Transl. Med. 18 , 146 (2020).

Elgart, M. et al. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun. Biol. 5 , 856 (2022).

Lin, J.-C., Hsiao, W. W.-W. & Fan, C.-T. Managing “incidental findings” in biobank research: Recommendations of the Taiwan biobank. Comput. Struct. Biotechnol. J. 17 , 1135–1142 (2019).

Bi, X.-a. et al. IHGC-GAN: influence hypergraph convolutional generative adversarial network for risk prediction of late mild cognitive impairment based on imaging genetic data. Brief. Bioinforma. 23 , bbac093 (2022).

Chen, R. J. et al. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Trans. Med. Imaging 41 , 757–770 (2022).

Perkins, B. A. et al. Precision medicine screening using whole-genome sequencing and advanced imaging to identify disease risk in adults. Proc. Natl Acad. Sci. USA 115 , 3686–3691 (2018).

Hou, Y. C. et al. Precision medicine integrating whole-genome sequencing, comprehensive metabolomics, and advanced imaging. Proc. Natl Acad. Sci. USA 117 , 3053–3062 (2020).

Wray, N. R., Goddard, M. E. & Visscher, P. M. Prediction of individual genetic risk of complex disease. Curr. Opin. Genet. Dev. 18 , 257–263 (2008).

Brown, B. C., Ye, C. J., Price, A. L. & Zaitlen, N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 99 , 76–88 (2016).

Cornelis, M. C., Zaitlen, N., Hu, F. B., Kraft, P. & Price, A. L. Genetic and environmental components of family history in type 2 diabetes. Hum. Genet. 134 , 259–267 (2015).

Chien, K. L. et al. Sibling and parental history in type 2 diabetes risk among ethnic Chinese: the Chin-Shan Community Cardiovascular Cohort Study. Eur. J. Cardiovasc Prev. Rehabil. 15 , 657–662 (2008).

Misra, S. et al. Precision subclassification of type 2 diabetes: a systematic review. Commun. Med. 3 , 138 (2023).

Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet 50 , 1219–1224 (2018).

Privé, F., Arbel, J. & Vilhjálmsson, B. J. LDpred2: better, faster, stronger. Bioinformatics 36 , 5424–5431 (2020).

Article   PubMed Central   Google Scholar  

Walford, G. A. et al. Metabolite traits and genetic risk provide complementary information for the prediction of future type 2 diabetes. Diabetes Care 37 , 2508–2514 (2014).

Wu, S.-J. et al. Particle swarm optimization algorithm for analyzing SNP–SNP interaction of renin-angiotensin system genes against hypertension. Mol. Biol. Rep. 40 , 4227–4233 (2013).

Wu, X. et al. Cohort Profile: The Taiwan MJ Cohort: half a million Chinese with repeated health surveillance data. Int. J. Epidemiol. 46 , 1744–1744g (2017).

Auton, A. et al. A global reference for human genetic variation. Nature 526 , 68–74 (2015).

Article   ADS   PubMed   Google Scholar  

Krithika, S. et al. Evaluation of the imputation performance of the program IMPUTE in an admixed sample from Mexico City using several model designs. BMC Med Genomics 5 , 12 (2012).

Yang, H.-C. et al. Genome-wide pharmacogenomic study on methadone maintenance treatment identifies SNP rs17180299 and multiple Haplotypes on CYP2B6, SPON1, and GSG1L associated with plasma concentrations of Methadone R- and S-enantiomers in Heroin-dependent patients. PLOS Genet. 12 , e1005910 (2016).

Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10 , 1776 (2019).

Mahajan, A. et al. Multi-ancestry genetic study of type 2 diabetes highlights the power of diverse populations for discovery and translation. Nat. Genet 54 , 560–572 (2022).

Zhang, B., Zhang, Y. & Jiang, X. C. Feature selection for global tropospheric ozone prediction based on the BO-XGBoost-RFE algorithm. Sci. Rep. 12 , 9244 (2022).

Bonferroni, C. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del. R. Istituto Super. di Sci. Economiche e Commericiali di Firenze 8 , 3–62 (1936).

Google Scholar  

Youden, W. J. Index for rating diagnostic tests. Cancer 3 , 32–35 (1950).

Cox, D. R. Regression models and life‐tables. J. R. Stat. Soc.: Ser. B (Methodol.) 34 , 187–202 (1972).

Article   MathSciNet   Google Scholar  

Agresti, A. Categorical data analysis . 3rd edn. (John Wiley & Sons Inc., Hoboken, 2013).

Download references

Acknowledgements

This work was supported by research grants from Academia Sinica (AS-PH-109-01 and AS-SH-112-01). We gratefully acknowledge the Taiwan Biobank for providing the data used in this research and extend our thanks to all its participants for their invaluable contributions. The National Center for Genome Medicine of Taiwan also provided technical support in genotyping. Part of this paper was completed during the first author’s master’s studies, which were supported by the Mr. Samuel Yin New Students Scholarship and a scholarship offered by the Institute of Statistical Science, Academia Sinica. We thank Mr. Chia-Wei Chen and Dr. Shih-Kai Chu in our research team for providing the genetic data with quality control.

Author information

Authors and affiliations.

Institute of Public Health, National Yang-Ming Chiao-Tung University, Taipei, Taiwan

Yi-Jia Huang & Hsin-Chou Yang

Institute of Statistical Science, Academia Sinica, Taipei, Taiwan

Yi-Jia Huang, Chun-houh Chen & Hsin-Chou Yang

Biomedical Translation Research Center, Academia Sinica, Taipei, Taiwan

Hsin-Chou Yang

Department of Statistics, National Cheng Kung University, Tainan, Taiwan

You can also search for this author in PubMed   Google Scholar

Contributions

H.C.Y. conceptualized and supervised the study. Y.J.H. performed data curation and applied software. Y.J.H. & H.C.Y. conducted formal data analysis and result visualization and wrote the paper. C.h.C. & H.C.Y. provided funding acquisition and resources. H.C.Y., Y.J.H., and C.h.C. validated the results.

Corresponding author

Correspondence to Hsin-Chou Yang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Kerby Shedden and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Huang, YJ., Chen, Ch. & Yang, HC. AI-enhanced integration of genetic and medical imaging data for risk assessment of Type 2 diabetes. Nat Commun 15 , 4230 (2024). https://doi.org/10.1038/s41467-024-48618-1

Download citation

Received : 29 September 2023

Accepted : 08 May 2024

Published : 18 May 2024

DOI : https://doi.org/10.1038/s41467-024-48618-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

x risk analysis for ai research

To revisit this article, visit My Profile, then View saved stories .

  • Backchannel
  • Newsletters
  • WIRED Insider
  • WIRED Consulting

Will Knight

OpenAI’s Long-Term AI Risk Team Has Disbanded

Photo of Ilya Sutskever OpenAi's chief scientist

In July last year, OpenAI announced the formation of a new research team that would prepare for the advent of supersmart artificial intelligence capable of outwitting and overpowering its creators. Ilya Sutskever, OpenAI’s chief scientist and one of the company’s cofounders, was named as the colead of this new team. OpenAI said the team would receive 20 percent of its computing power.

Now OpenAI’s “superalignment team” is no more, the company confirms. That comes after the departures of several researchers involved, Tuesday’s news that Sutskever was leaving the company, and the resignation of the team’s other colead. The group’s work will be absorbed into OpenAI’s other research efforts.

Sutskever’s departure made headlines because although he’d helped CEO Sam Altman start OpenAI in 2015 and set the direction of the research that led to ChatGPT, he was also one of the four board members who fired Altman in November . Altman was restored as CEO five chaotic days later after a mass revolt by OpenAI staff and the brokering of a deal in which Sutskever and two other company directors left the board . Hours after Sutskever’s departure was announced on Tuesday, Jan Leike, the former DeepMind researcher who was the superalignment team’s other colead, posted on X that he had resigned.

Neither Sutskever nor Leike responded to requests for comment. Sutskever did not offer an explanation for his decision to leave but offered support for OpenAI’s current path in a post on X . “The company’s trajectory has been nothing short of miraculous, and I’m confident that OpenAI will build AGI that is both safe and beneficial” under its current leadership, he wrote.

Leike posted a thread on X on Friday explaining that his decision came from a disagreement over the company’s priorities and how much resources his team was being allocated.

“I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point,” Leike wrote. “Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.”

The dissolution of OpenAI’s superalignment team adds to recent evidence of a shakeout inside the company in the wake of last November’s governance crisis. Two researchers on the team, Leopold Aschenbrenner and Pavel Izmailov, were dismissed for leaking company secrets, The Information reported last month. Another member of the team, William Saunders, left OpenAI in February, according to an internet forum post in his name.

Two more OpenAI researchers working on AI policy and governance also appear to have left the company recently. Cullen O'Keefe left his role as research lead on policy frontiers in April, according to LinkedIn. Daniel Kokotajlo, an OpenAI researcher who has coauthored several papers on the dangers of more capable AI models, “quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI,” according to a posting on an internet forum in his name. None of the researchers who have apparently left responded to requests for comment.

How Many Charging Stations Would We Need to Totally Replace Gas Stations?

By Aarian Marshall

Your Bike Tires Are Too Skinny. Riding on Fat, Supple Tires Is Just Better

By Carlton Reid

Spotify Hates Albums. Here’s How to Fix That

By Justin Pot

OpenAI declined to comment on the departures of Sutskever or other members of the superalignment team, or the future of its work on long-term AI risks. Research on the risks associated with more powerful models will now be led by John Schulman, who coleads the team responsible for fine-tuning AI models after training.

The superalignment team was not the only team pondering the question of how to keep AI under control, although it was publicly positioned as the main one working on the most far-off version of that problem. The blog post announcing the superalignment team last summer stated: “Currently, we don't have a solution for steering or controlling a potentially superintelligent AI, and preventing it from going rogue.” OpenAI’s charter binds it to safely developing so-called artificial general intelligence, or technology that rivals or exceeds humans, safely and for the benefit of humanity. Sutskever and other leaders there have often spoken about the need to proceed cautiously. But OpenAI has also been early to develop and publicly release experimental AI projects to the public.

OpenAI was once unusual among prominent AI labs for the eagerness with which research leaders like Sutskever talked of creating superhuman AI and of the potential for such technology to turn on humanity. That kind of doomy AI talk became much more widespread last year, after ChatGPT turned OpenAI into the most prominent and closely-watched technology company on the planet. As researchers and policymakers wrestled with the implications of ChatGPT and the prospect of vastly more capable AI, it became less controversial to worry about AI harming humans or humanity as a whole .

The existential angst has since cooled—and AI has yet to make another massive leap—but the need for AI regulation remains a hot topic. And this week OpenAI showcased a new version of ChatGPT that could once again change people’s relationship with the technology in powerful and perhaps problematic new ways.

The departures of Sutskever and Leike come shortly after OpenAI’s latest big reveal—a new “multimodal” AI model called GPT-4o that allows ChatGPT to see the world and converse in a more natural and humanlike way. A livestreamed demonstration showed the new version of ChatGPT mimicking human emotions and even attempting to flirt with users. OpenAI has said it will make the new interface available to paid users within a couple of weeks.

There is no indication that the recent departures have anything to do with OpenAI’s efforts to develop more humanlike AI or to ship products. But the latest advances do raise ethical questions around privacy, emotional manipulation, and cybersecurity risks. OpenAI maintains another research group called the Preparedness team which focuses on these issues.

Update 5/17/24 12:23 pm ET: This story has been updated to include comments from posts on X by Jan Leike.

You Might Also Like …

In your inbox: Will Knight's Fast Forward explores advances in AI

Indian voters are being bombarded with millions of deepfakes

They bought tablets in prison —and found a broken promise

The one thing that’s holding back the heat pump

It's always sunny: Here are the best sunglasses for every adventure

x risk analysis for ai research

Lauren Goode

OpenAI’s Chief AI Wizard, Ilya Sutskever, Is Leaving the Company

Reece Rogers

Generative AI Doesn’t Make Hardware Less Hard

Elana Klein

This paper is in the following e-collection/theme issue:

Published on 27.5.2024 in Vol 11 (2024)

The Impact of Performance Expectancy, Workload, Risk, and Satisfaction on Trust in ChatGPT: Cross-Sectional Survey Analysis

Authors of this article:

Author Orcid Image

Original Paper

  • Avishek Choudhury, PhD   ; 
  • Hamid Shamszare, MSc  

Industrial and Management Systems Engineering, Benjamin M. Statler College of Engineering and Mineral Resources, West Virginia University, Morgantown, WV, United States

Corresponding Author:

Avishek Choudhury, PhD

Industrial and Management Systems Engineering

Benjamin M. Statler College of Engineering and Mineral Resources

West Virginia University

321 Engineering Sciences Building

1306 Evansdale Drive

Morgantown, WV, 26506

United States

Phone: 1 3042939431

Email: [email protected]

Background: ChatGPT (OpenAI) is a powerful tool for a wide range of tasks, from entertainment and creativity to health care queries. There are potential risks and benefits associated with this technology. In the discourse concerning the deployment of ChatGPT and similar large language models, it is sensible to recommend their use primarily for tasks a human user can execute accurately. As we transition into the subsequent phase of ChatGPT deployment, establishing realistic performance expectations and understanding users’ perceptions of risk associated with its use are crucial in determining the successful integration of this artificial intelligence (AI) technology.

Objective: The aim of the study is to explore how perceived workload, satisfaction, performance expectancy, and risk-benefit perception influence users’ trust in ChatGPT.

Methods: A semistructured, web-based survey was conducted with 607 adults in the United States who actively use ChatGPT. The survey questions were adapted from constructs used in various models and theories such as the technology acceptance model, the theory of planned behavior, the unified theory of acceptance and use of technology, and research on trust and security in digital environments. To test our hypotheses and structural model, we used the partial least squares structural equation modeling method, a widely used approach for multivariate analysis.

Results: A total of 607 people responded to our survey. A significant portion of the participants held at least a high school diploma (n=204, 33.6%), and the majority had a bachelor’s degree (n=262, 43.1%). The primary motivations for participants to use ChatGPT were for acquiring information (n=219, 36.1%), amusement (n=203, 33.4%), and addressing problems (n=135, 22.2%). Some participants used it for health-related inquiries (n=44, 7.2%), while a few others (n=6, 1%) used it for miscellaneous activities such as brainstorming, grammar verification, and blog content creation. Our model explained 64.6% of the variance in trust. Our analysis indicated a significant relationship between (1) workload and satisfaction, (2) trust and satisfaction, (3) performance expectations and trust, and (4) risk-benefit perception and trust.

Conclusions: The findings underscore the importance of ensuring user-friendly design and functionality in AI-based applications to reduce workload and enhance user satisfaction, thereby increasing user trust. Future research should further explore the relationship between risk-benefit perception and trust in the context of AI chatbots.

Introduction

ChatGPT (OpenAI) [ 1 ] is a powerful tool for a wide range of tasks, from entertainment and creativity to health care queries [ 2 ]. However, there are potential benefits associated with this technology. For instance, it can help summarize large amounts of text data [ 3 , 4 ] or generate programming code [ 5 ]. There is also the notion that ChatGPT may potentially assist with health care tasks [ 6 - 9 ]. However, the risks associated with using ChatGPT can hinder its adoption in various high-risk domains. These risks include the potential for inaccuracies and lack of citation relevance in scientific content generated by ChatGPT [ 10 ], ethical issues (copyright, attribution, plagiarism, and authorship) [ 11 ], the risk of hallucination (inaccurate information that sounds plausible scientifically) [ 12 ], and the possibility of biased content and inaccurate information due to the quality of training data sets generated prior to the year 2021 [ 4 ].

In the discourse concerning the deployment of ChatGPT and similar artificial intelligence (AI) technologies, it is sensible to recommend their use primarily for tasks a human user can execute accurately. Few studies have advocated using the technology under human supervision [ 13 , 14 ]. Encouraging users to rely on such tools for tasks beyond their competence is risky, as they may need help to evaluate the AI’s output effectively. The strength of ChatGPT lies in its ability to automate more straightforward, mundane tasks, freeing human users to invest their time and cognitive resources into critical tasks (not vice versa). This approach to technology use maintains a necessary balance, leveraging AI for efficiency gains while ensuring that critical decision-making remains within the purview of human expertise.

As we transition into the subsequent phase of ChatGPT deployment, establishing realistic performance expectations and understanding users’ perceptions of risk associated with its use are crucial in determining the successful integration of this AI technology. Thus, understanding users’ perceptions of ChatGPT becomes essential, as these perceptions significantly influence their usage decisions [ 2 ]. For example, suppose users believe that ChatGPT’s capabilities surpass human knowledge. In that case, they may be tempted to use it for tasks such as self-diagnosis, which could lead to potentially harmful outcomes if the generated information is mistaken or misleading. Conversely, a realistic appraisal of the limitations and strengths of technology would encourage its use in low-risk, routine tasks and foster a safer, more effective integration into our everyday lives.

Building upon the importance of user perceptions and expectations, we must also consider that the extent to which users trust ChatGPT hinges mainly on the perception of its accuracy and reliability. As users witness the technology’s ability to perform tasks effectively and generate correct, helpful information, their trust in the system grows. This, in turn, allows them to offload routine tasks to the AI and focus their energies on more complex or meaningful endeavors. Similarly, instances where the AI generates inaccurate or misleading information can quickly erode users’ perception of the technology. Users may become dissatisfied and lose trust if they perceive the technology as unreliable or potentially harmful, particularly if they have previously overestimated its capabilities. This underlines the importance of setting realistic expectations and accurately understanding the strengths and limitations of ChatGPT, which can help foster a healthy level of trust and satisfaction among users. Ultimately, establishing and maintaining trust and satisfaction are not a onetime event but an ongoing process of validating the AI’s outputs, understanding and acknowledging its limitations, and making the best use of its capabilities within a framework of informed expectations and continuous learning. This dynamic balance is pivotal for the effective and safe integration of AI technologies such as ChatGPT into various sectors of human activity.

In our prior work, we explored the impact of trust in the actual use of ChatGPT [ 15 ]. This study aims to explore a conceptual framework delving deeper into the aspects influencing user trust in ChatGPT.

As shown in Figure 1 , the proposed conceptual model is grounded in the well-established theories of technology acceptance and use, incorporating constructs such as performance expectancy, workload, satisfaction, risk-benefit perception, and trust to comprehensively evaluate user interaction with technology. Performance expectancy, derived from the core postulates of the technology acceptance model (TAM) [ 16 ] and the unified theory of acceptance and use of technology (UTAUT) [ 17 ], posits that the perceived use of the technology significantly predicts usage intentions. Workload, akin to effort expectancy, reflects the perceived cognitive and physical effort required to use the technology, where a higher workload may inversely affect user satisfaction—a construct that encapsulates the fulfillment of user expectations and needs through technology interaction. The risk-benefit perception embodies the user’s assessment of the technology’s potential advantages against its risks, intricately influencing both user satisfaction and trust. Trust, a pivotal determinant of technology acceptance [ 15 ], signifies the user’s confidence in the reliability and efficacy of the technology. This theoretical framework thus serves to elucidate the multifaceted process by which users come to accept and use a technological system, highlighting the critical role of both cognitive appraisals and affective responses in shaping the technology adoption landscape.

We explore the following hypotheses:

  • Hypothesis 1: Perceived workload of using ChatGPT negatively correlates with user trust in ChatGPT.
  • Hypothesis 2: Perceived workload of using ChatGPT negatively correlates with user satisfaction with ChatGPT.
  • Hypothesis 3: User satisfaction with ChatGPT positively correlates with trust in ChatGPT.
  • Hypothesis 4: User trust in ChatGPT is positively correlated with the performance expectancy of ChatGPT.
  • Hypothesis 5: The risk-benefit perception of using ChatGPT is positively correlated with user trust in ChatGPT.

x risk analysis for ai research

Ethical Considerations

The study obtained ethics approval from West Virginia University, Morgantown (protocol 2302725983). The study was performed in accordance with relevant guidelines and regulations. No identifiers were collected during the study, and all users were compensated for completing the survey through an audience paneling service. In compliance with ethical research practices, informed consent was obtained from all participants before initiating the survey. Attached to the survey was a comprehensive cover letter outlining the purpose of the study, the procedure involved, the approximate time to complete the survey, and assurances of anonymity and confidentiality. It also emphasized that participation was completely voluntary, and participants could withdraw at any time without any consequences. The cover letter also included the contact information of the researchers for any questions or concerns the participants might have regarding the study. Participants were asked to read through the cover letter information carefully and were instructed to proceed with the survey only if they understood and agreed to the terms described, effectively providing their consent to participate in the study.

Study Design

A semistructured, web-based questionnaire was disseminated to adult individuals within the United States who engaged with ChatGPT (version 3.5) at least once per month. Data collection took place between February and March 2023. The questionnaire was crafted using Qualtrics (Qualtrics LLC), and its circulation was handled by Centiment (Centiment LLC), a provider of audience-paneling services. Centiment’s services were used due to their extensive reach and ability to connect with a diverse and representative group via their network and social media. Their fingerprinting technology, which uses IP address, device type, screen size, and cookies, was used to guarantee the uniqueness of the survey respondents. Prior to the full-scale dissemination, a soft launch was carried out with 40 responses gathered. The purpose of a soft launch, a limited-scale trial of the survey, is to pinpoint any potential problems, such as ambiguity or confusion in questions, technical mishaps, or any other factors that might affect the quality of data obtained. The survey was made available to a larger audience following the successful soft launch.

Table 1 shows the descriptive statistics of the survey questions used in this study. We developed 3 latent constructs based on the question: trust, workload, and performance expectancy, and 2 single question variables: satisfaction and risk-benefit perception. Participant responses to all the questions were captured using a 4-point Likert scale ranging from 1=strongly disagree to 4=strongly agree. These questions were adapted from constructs used in various models and theories such as the TAM, the theory of planned behavior, UTAUT, and research on trust and security in digital environments.

  • Trust: Questions T1-T7 related to trust in AI systems were adapted from the trust building model [ 18 ].
  • Workload: WL1 and WL2 questions from the National Aeronautics and Space Administration Task Load Index for measuring perceived workload [ 19 ].
  • Performance expectancy: PE1-PE4 are about the perceived benefits of using the system, which is a central concept in TAM and UTAUT.
  • Satisfaction: The single item relates to overall user satisfaction, a common measure in information systems success models [ 20 ].
  • Risk-benefit perception: Question addresses the user’s assessment of benefits relative to potential risks, an aspect often discussed in the context of technology adoption and use [ 21 ].

These references provide a starting point for understanding the theoretical underpinnings of the survey used in this study. They are adapted from foundational works in information systems, human-computer interaction, and psychology that address trust, workload, performance expectancy, satisfaction, and the evaluation of benefits versus risks in technology use.

Statistical Analysis and Model Validation

To test our hypotheses and structural model, we used the partial least squares structural equation modeling (PLS-SEM) method, a widely used approach for multivariate analysis. PLS-SEM enables the estimation of complex models with multiple constructs, indicator variables, and structural paths, without making assumptions about the data’s distribution [ 22 ]. This method is beneficial for studies with small sample sizes that involve many constructs and items [ 23 ]. PLS-SEM is a suitable method because of its flexibility and ability to allow for interaction between theory and data in exploratory research [ 24 ]. The analyses were performed using the SEMinR package in R (R Foundation for Statistical Computing) [ 25 ]. We started by loading the data set collected for this study using the reader package in R. We then defined the measurement model. This consisted of 5 composite constructs: trust, performance expectancy, workload, risk-benefit perception, and satisfaction. Trust was measured with 7 items (T1 through T7), performance expectancy with 4 items (PE1 through PE4), and workload with 2 items (WL1 and WL2), while risk-benefit perception and satisfaction were each measured with a single item. We also evaluated the convergent and discriminant validity of the latent constructs, which we assessed using 3 criteria: factor loadings (>0.50), composite reliability (>0.70), and average variance extracted (>0.50). We used the Heterotrait-Monotrait ratio (<0.90) to assess discriminant validity [ 26 ].

Next, we defined the structural model, which captured the hypothesized relationships between the constructs. The model included paths from risk-benefit perception, performance expectancy, workload, satisfaction to trust, and a path from workload to satisfaction. We then estimated the model’s parameters using the partial least squares method. This was done with the estimate_pls function in the seminar package. The partial least squares method was preferred due to its ability to handle complex models and its robustness to violations of normality assumptions. We performed a bootstrap resampling procedure with 10,000 iterations to obtain robust parameter estimates and compute 95% CIs. The bootstrapped model was plotted to visualize the estimates and their 95% CIs.

Of 607 participants who completed the survey, 29.9% (n=182) used ChatGPT at least once per month, 26.1% (n=158) used it weekly, 24.5% (n=149) accessed it more than once per week, and 19.4% (n=118) interacted with it almost daily. A substantial portion of the participants held at least a high school diploma (n=204, 33.6%), and the majority had a bachelor’s degree (n=262, 43.1%). The primary motivations for participants to use ChatGPT were for acquiring information (n=219, 36%), amusement (n=203, 33.4%), and addressing problems (n=135, 22.2%). Some participants used it for health-related inquiries (n=44, 7.2%), while a few others (n=6, 1%) used it for miscellaneous activities such as brainstorming, grammar verification, and blog content creation. Table 2 shows the factor loading of the latent constructs in the model.

The model explained 2% and 64.6% of the variance in “satisfaction” and “trust,” respectively. Reliability estimates, as shown in Table 3 , indicated high levels of internal consistency for all 5 latent variables, with Cronbach α and ρ values exceeding the recommended threshold of 0.7. The average variance extracted for the latent variables also exceeded the recommended threshold of 0.5, indicating that these variables are well-defined and reliable. Based on the root mean square error of approximation (RMSEA) fit index, our PLS-SEM model demonstrates a good fit for the observed data. The calculated RMSEA value of 0.07 falls below the commonly accepted threshold of 0.08, indicating an acceptable fit. The RMSEA estimates the average discrepancy per degree of freedom in the model, capturing how the proposed model aligns with the population covariance matrix. With a value below the threshold, it suggests that the proposed model adequately represents the relationships among the latent variables. This finding provides confidence in the model’s ability to explain the observed data and support the underlying theoretical framework.

Table 4 shows the estimated paths in our model. Hypothesis 1 postulated that as the perceived workload of using ChatGPT increases, user trust in ChatGPT decreases. Our analysis indicated a negative estimate for the path from workload to trust (–0.047). However, the T statistic (–1.674) is less than the critical value, and the 95% CI straddles 0 (–0.102 to –0.007), suggesting that the effect is not statistically significant. Therefore, we do not have sufficient evidence to support hypothesis 1.

Hypothesis 2 stated that perceived workload is negatively correlated with user satisfaction with ChatGPT. The results supported this hypothesis, as the path from workload to satisfaction showed a negative estimate (–0.142), a T statistic (–3.416) beyond the critical value, and a 95% CI (–0.223 to –0.061).

The data confirmed this relationship for hypothesis 3, which proposed a positive correlation between satisfaction with ChatGPT and trust in ChatGPT. The path from satisfaction to trust had a positive estimate (0.165), a T statistic (4.478) beyond the critical value, and a 95% CI (0.093-0.237).

Hypothesis 4 suggested that user performance expectations of ChatGPT increase with their trust in the technology. The analysis supported this hypothesis. The path from performance expectancy to trust displayed a positive estimate (0.598), a large T statistic (15.554), and a 95% CI (0.522-0.672). Finally, we examined hypothesis 5, which posited that user trust in ChatGPT increases as their risk-benefit perception of using the technology increases. The path from risk-benefit perception to trust showed a positive estimate (0.114). The T statistic (3.372) and the 95% CI (0.048-0.179) indicating this relationship is significant, but the positive sign suggests that as the perceived benefits outweigh the risks, the trust in ChatGPT increases. Therefore, hypothesis 5 is supported. Figure 2 illustrates the structural model with all path coefficients.

a AVE: average variance extracted.

x risk analysis for ai research

Main Findings

This study represents one of the initial attempts to investigate how human factors such as workload, performance expectancy, risk-benefit perception, and satisfaction influence trust in ChatGPT. Our results showed that these factors significantly influenced trust in ChatGPT, with performance expectancy exerting the strongest association, highlighting its critical role in fostering trust. Additionally, we found that satisfaction was a mediator in the relationship between workload and trust. At the same time, a positive correlation was observed between trust in ChatGPT and the risk-benefit perception. Our findings align with the May 23, 2023, efforts and initiatives of the Biden-Harris Administration to advance responsible AI research, development, and deployment [ 27 ]. The Administration recognizes that managing its risks is crucial and prioritizes protecting individuals’ rights and safety. One of the critical actions taken by the administration is the development of the artificial intelligence risk management framework (AI RMF). The AI RMF builds on the importance of trustworthiness in AI systems and is a framework for strengthening AI trustworthiness and promoting the trustworthy design, development, deployment, and use of AI systems, contributing to the need for our research [ 28 ]. Our findings reveal the importance of performance expectancy, satisfaction, and risk-benefit perception in determining the user’s trust in AI systems. By addressing these factors, AI systems can be designed and developed to be more user-centric, aligning with the AI RMF’s emphasis on human-centricity and responsible AI.

Workload and Trust in ChatGPT

Moreover, we found that reducing user workload is vital for enhancing user satisfaction, which in turn improves trust. This finding aligns with the AI RMF’s focus on creating AI systems that are equitable and accountable and that mitigate inequitable outcomes. Additionally, our research emphasizes the need for future exploration of other factors impacting user trust in AI technologies. Such endeavors align with the AI RMF’s vision of managing AI risks comprehensively and holistically, considering technical and societal factors. Understanding these factors is crucial for fostering public trust and enhancing the overall trustworthiness of AI systems, as outlined in the AI RMF [ 28 ].

This study also extends and complements existing literature. Consistent with the observed patterns in studies on flight simulators, dynamic multitasking environments, and cyberattacks [ 29 - 31 ], we also found that higher perceived workload in using ChatGPT led to lower levels of trust in this technology. Our findings align with the existing research indicating a negative correlation between workload and user satisfaction [ 32 ]. We observed that as the perceived workload of using ChatGPT increased, user satisfaction with the technology decreased. This outcome echoes the consensus within the literature that a high workload can lead to user dissatisfaction, particularly if the technology requires too much effort or time [ 33 ]. The literature reveals that perceived workload balance significantly influences job satisfaction in work organizations [ 25 ], and similar patterns are found in the well-being studies of nurses, where perceived workload negatively impacts satisfaction with work-life balance [ 34 ]. While this study does not directly involve the workplace environment or work-life balance, the parallels between workload and satisfaction are evident. Furthermore, our research parallels the study suggesting that when providing timely service, AI applications can alleviate perceived workload and improve job satisfaction [ 35 ]. ChatGPT, as an AI-powered chatbot, could potentially contribute to workload relief when it performs effectively and efficiently, thereby boosting user satisfaction.

Satisfaction and Trust in ChatGPT

Our findings corroborate with existing literature, suggesting a strong positive correlation between user satisfaction and trust in the technology or service provider [ 23 , 24 , 26 , 36 - 38 ]. We found that the users who expressed higher satisfaction with ChatGPT were more likely to trust the system, strengthening the premise that satisfaction can predict trust in a technology or service provider. Similar to the study on digital transaction services, our research indicates that higher satisfaction levels with ChatGPT corresponded with higher trust in the AI system [ 37 ]. This suggests that when users are satisfied with the performance and results provided by ChatGPT, they tend to trust the technology more. The research on mobile transaction apps mirrors our findings, where we also discovered that satisfaction with ChatGPT use was a significant predictor of trust in the system [ 36 ]. This showcases the importance of ensuring user satisfaction in fostering trust using innovative technologies like AI chatbots. The study on satisfaction with using digital assistants, where a positive relationship between trust and satisfaction was observed [ 26 ], further aligns with our study. We also found a positive correlation between trust in ChatGPT and user satisfaction with this AI assistant.

Performance Expectancy and Trust in ChatGPT

Our findings concerning the strong positive correlation between performance expectancy and trust in ChatGPT serve as an extension to prior literature. Similar findings have been reported in previous studies on wearables and mobile banking [ 39 , 40 ], where performance expectancy was positively correlated with trust. However, our results diverge from the observations of a recent study that did not find a significant impact of performance expectancy on trust in chatbots [ 41 ]. Moreover, the observed mediating role of satisfaction in the relationship between workload and trust in ChatGPT is a notable contribution to the literature. While previous studies have demonstrated a positive correlation between workload reduction by chatbots and trust, as well as between trust and user satisfaction [ 42 - 44 ], the role of satisfaction as a mediator between workload and trust has not been explored. Finally, the positive correlation between the risk-benefit perception of using ChatGPT and trust aligns with the findings of previous studies [ 45 - 47 ]. Similar studies on the intention to use chatbots for digital shopping and customer service have found that trust in chatbots impacts perceived risk and is affected by the risk involved in using chatbots [ 46 , 47 ]. Our study adds to this body of research by confirming the same positive relationship within the context of ChatGPT.

Limitations

Despite the valuable insights provided by this study, limitations should be acknowledged. First, our research focused explicitly on ChatGPT and may not be generalizable to other AI-powered conversational agents or chatbot technologies. Different chatbot systems may have unique characteristics and user experiences that could influence the factors affecting trust. Second, this study relied on self-reported data from survey responses, which may be subject to response biases and limitations inherent to self-report measures. Participants’ perceptions and interpretations of the constructs under investigation could vary, leading to potential measurement errors. Third, this study was cross-sectional, capturing data at a specific point in time. Longitudinal studies that track users’ experiences and perceptions over time provide a more comprehensive understanding of the dynamics between trust and the factors investigated. Finally, the sample of participants in this study consisted of individuals who actively use ChatGPT, which may introduce a self-selection bias. The perspectives and experiences of nonusers or individuals with limited exposure to AI-powered conversational agents may differ, and their insights could provide additional valuable perspectives.

Conclusions

This study examined the factors influencing trust in ChatGPT, an AI-powered conversational agent. Our analysis found that performance expectancy, satisfaction, workload, and risk-benefit perceptions significantly influenced users’ trust in ChatGPT. These findings contribute to understanding trust dynamics in the context of AI-powered conversational agents and provide insights into the factors that can enhance user trust. By addressing the factors influencing trust, we contribute to the broader goal of fostering responsible AI practices that prioritize user-centric design and protect individuals’ rights and safety. Future research should consider longitudinal designs to capture the dynamics of trust over time. Additionally, incorporating perspectives from diverse user groups and examining the impact of contextual factors on trust would further enrich our understanding of trust in AI technologies.

Data Availability

The data sets generated and analyzed during this study are available from the corresponding author on reasonable request.

Authors' Contributions

AC, the lead researcher, was responsible for the study’s conceptualization, the survey’s development, figure illustration, data collection and analysis, and manuscript writing. HS, the student author, was responsible for manuscript writing and conducting the literature review. Both authors collaborated throughout the research process and approved the final version of the manuscript for submission.

Conflicts of Interest

None declared.

  • OpenAI: Models GPT-3. OpenAI. URL: https://platform.openai.com/docs/models/gpt-4 [accessed 2024-05-07]
  • Shahsavar Y, Choudhury A. User intentions to use ChatGPT for self-diagnosis and health-related purposes: cross-sectional survey study. JMIR Hum Factors. 2023;10:e47564. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Editorial N. Tools such as ChatGPT threaten transparent science; here are our ground rules for their use. Nature. 2023;613(7945):612. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Moons P, Van Bulck L. ChatGPT: can artificial intelligence language models be of value for cardiovascular nurses and allied health professionals. Eur J Cardiovasc Nurs. 2023;22(7):e55-e59. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Aljanabi M, Ghazi M, Ali AH, Abed SA. ChatGpt: open possibilities. Iraqi J Sci Comput Sci Math. 2023;4(1):62-64. [ CrossRef ]
  • D'Amico RS, White TG, Shah HA, Langer DJ. I asked a ChatGPT to write an editorial about how we can incorporate chatbots into neurosurgical research and patient care…. Neurosurgery. 2023;92(4):663-664. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Holzinger A, Keiblinger K, Holub P, Zatloukal K, Müller H. AI for life: trends in artificial intelligence for biotechnology. N Biotechnol. 2023;74:16-24. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Sharma G, Thakur A. ChatGPT in drug discovery. ChemRxiv. 2023. URL: https://chemrxiv.org/engage/chemrxiv/article-details/63d56c13ae221ab9b240932f [accessed 2024-05-07]
  • Mann DL. Artificial intelligence discusses the role of artificial intelligence in translational medicine: a interview with ChatGPT. JACC Basic Transl Sci. 2023;8(2):221-223. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Chen TJ. ChatGPT and other artificial intelligence applications speed up scientific writing. J Chin Med Assoc. 2023;86(4):351-353. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digit Health. 2023;5(3):e105-e106. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and other large language models are double-edged swords. Radiology. 2023;307(2):e230163. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lubowitz JH. ChatGPT, an artificial intelligence chatbot, is impacting medical literature. Arthroscopy. 2023;39(5):1121-1122. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Jianning L, Amin D, Jens K, Jan E. ChatGPT in healthcare: a taxonomy and systematic review. medRxiv. . Preprint posted online on March 30, 2023. [ CrossRef ]
  • Choudhury A, Shamszare H. Investigating the Impact of User Trust on the Adoption and Use of ChatGPT: Survey Analysis. J Med Internet Res. Jun 14, 2023;25:e47184. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Hu PJ, Chau PYK, Sheng ORL, Tam KY. Examining the technology acceptance model using physician acceptance of telemedicine technology. J Manage Inform Syst. 2015;16(2):91-112. [ FREE Full text ] [ CrossRef ]
  • Williams MD, Rana NP, Dwivedi YK. The unified theory of acceptance and use of technology (UTAUT): a literature review. J Enterp Inf Manag. 2015;28(3):443-448. [ FREE Full text ] [ CrossRef ]
  • McKnight DH, Choudhury V, Kacmar C. The impact of initial consumer trust on intentions to transact with a web site: a trust building model. J Strateg Inf Syst. 2002;11(3-4):297-323. [ FREE Full text ] [ CrossRef ]
  • Hart SG, Staveland LE. Development of NASA-TLX (Task Load Index): results of empirical and theoretical research. Adv Psychol. 1988;52:139-183. [ FREE Full text ] [ CrossRef ]
  • Delone WH, McLean ER. The DeLone and McLean model of information systems success: a ten-year update. J Manag Inf Syst. 2014;19(4):9-30. [ FREE Full text ] [ CrossRef ]
  • Featherman MS, Pavlou PA. Predicting e-services adoption: a perceived risk facets perspective. Int J Hum Comput Stud. 2003;59(4):451-474. [ FREE Full text ] [ CrossRef ]
  • Hair JF, Sarstedt M, Ringle CM, Gudergan SP. Advanced Issues in Partial Least Squares Structural Equation Modeling. 2nd Edition. Thousand Oaks, CA. SAGE Publications; 2017.
  • Eriksson K, Hermansson C, Jonsson S. The performance generating limitations of the relationship-banking model in the digital era—effects of customers' trust, satisfaction, and loyalty on client-level performance. Int J Bank Mark. 2020;38(4):889-916. [ FREE Full text ] [ CrossRef ]
  • Al-Ansi A, Olya HGT, Han H. Effect of general risk on trust, satisfaction, and recommendation intention for halal food. Int J Hosp Manag. 2019;83:210-219. [ FREE Full text ] [ CrossRef ]
  • Inegbedion H, Inegbedion E, Peter A, Harry L. Perception of workload balance and employee job satisfaction in work organisations. Heliyon. 2020;6(1):e03160. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Marikyan D, Papagiannidis S, Rana OF, Ranjan R, Morgan G. "Alexa, let’s talk about my productivity": the impact of digital assistants on work productivity. J Bus Res. 2022;142:572-584. [ FREE Full text ] [ CrossRef ]
  • FACT SHEET: Biden-⁠Harris Administration takes new steps to advance responsible artificial intelligence research, development, and deployment. The White House. 2023. URL: https://tinyurl.com/bdfnb97b [accessed 2024-05-07]
  • Tabassi E. Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology. 2023. URL: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf [accessed 2024-05-07]
  • Sato T, Yamani Y, Liechty M, Chancey ET. Automation trust increases under high-workload multitasking scenarios involving risk. Cogn Tech Work. 2019;22(2):399-407. [ FREE Full text ] [ CrossRef ]
  • Karpinsky ND, Chancey ET, Palmer DB, Yamani Y. Automation trust and attention allocation in multitasking workspace. Appl Ergon. 2018;70:194-201. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Gontar P, Homans H, Rostalski M, Behrend J, Dehais F, Bengler K. Are pilots prepared for a cyber-attack? A human factors approach to the experimental evaluation of pilots' behavior. J Air Transp Manag. 2018;69:26-37. [ FREE Full text ] [ CrossRef ]
  • Tentama F, Rahmawati PA, Muhopilah P. The effect and implications of work stress and workload on job satisfaction. Int J Sci Technol Res. 2019;8(11):2498-2502. [ FREE Full text ]
  • Kim C, Mirusmonov M, Lee I. An empirical examination of factors influencing the intention to use mobile payment. Comput Human Behav. 2010;26(3):310-322. [ FREE Full text ] [ CrossRef ]
  • Holland P, Tham TL, Sheehan C, Cooper B. The impact of perceived workload on nurse satisfaction with work-life balance and intention to leave the occupation. Appl Nurs Res. 2019;49:70-76. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Nguyen TM, Malik A. A two-wave cross-lagged study on AI service quality: the moderating effects of the job level and job role. Br J Manag. 2021;33(3):1221-1237. [ FREE Full text ] [ CrossRef ]
  • Kumar A, Adlakaha A, Mukherjee K. The effect of perceived security and grievance redressal on continuance intention to use M-wallets in a developing country. Int J Bank Mark. 2018;36(7):1170-1189. [ FREE Full text ] [ CrossRef ]
  • Chen X, Li S. Understanding continuance intention of mobile payment services: an empirical study. J Comput Inf Syst. 2016;57(4):287-298. [ FREE Full text ] [ CrossRef ]
  • Fang Y, Qureshi I, Sun H, McCole P, Ramsey E, Lim KH. Trust, satisfaction, and online repurchase intention. MIS Q. 2014;38(2):407-428. [ FREE Full text ]
  • Gu Z, Wei J, Xu F. An empirical study on factors influencing consumers' initial trust in wearable commerce. J Comput Inf Syst. 2015;56(1):79-85. [ CrossRef ]
  • Oliveira T, Faria M, Thomas MA, Popovič A. Extending the understanding of mobile banking adoption: when UTAUT meets TTF and ITM. Int J Inform Manage. 2014;34(5):689-703. [ FREE Full text ] [ CrossRef ]
  • Mostafa RB, Kasamani T. Antecedents and consequences of chatbot initial trust. Eur J Mark. 2021;56(6):1748-1771. [ FREE Full text ] [ CrossRef ]
  • Wang X, Lin X, Shao B. Artificial intelligence changes the way we work: a close look at innovating with chatbots. J Assoc Inf Sci Technol. 2022;74(3):339-353. [ FREE Full text ] [ CrossRef ]
  • Hsiao KL, Chen CC. What drives continuance intention to use a food-ordering chatbot? An examination of trust and satisfaction. Libr Hi Tech. 2021;40(4):929-946. [ FREE Full text ] [ CrossRef ]
  • Pesonen JA. ‘Are You OK?’ Students’ trust in a chatbot providing support opportunities. Springer; 2021. Presented at: Learning and Collaboration Technologies: Games and Virtual Environments for Learning: 8th International Conference, LCT 2021, Held as Part of the 23rd HCI International Conference, HCII 2021; July 24-29, 2021; Virtual Event. [ CrossRef ]
  • Dwivedi YK, Kshetri N, Hughes L, Slade EL, Jeyaraj A, Kar AK, et al. Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inform Manage. 2023;71:102642. [ FREE Full text ] [ CrossRef ]
  • Silva SC, De Cicco R, Vlačić B, Elmashhara MG. Using chatbots in e-retailing—how to mitigate perceived risk and enhance the flow experience. Int J Retail Distrib Manag. 2022;51(3):285-305. [ FREE Full text ] [ CrossRef ]
  • Nordheim CB, Følstad A, Bjørkli CA. An initial model of trust in chatbots for customer service? Findings from a questionnaire study. Interact Comput. 2019;31(3):317-335. [ FREE Full text ] [ CrossRef ]

Abbreviations

Edited by A Kushniruk, E Borycki; submitted 11.12.23; peer-reviewed by P Radanliev, G Farid; comments to author 17.01.24; revised version received 25.03.24; accepted 07.04.24; published 27.05.24.

©Avishek Choudhury, Hamid Shamszare. Originally published in JMIR Human Factors (https://humanfactors.jmir.org), 27.05.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Human Factors, is properly cited. The complete bibliographic information, a link to the original publication on https://humanfactors.jmir.org, as well as this copyright and license information must be included.

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Online First
  • Pilot study on large language models for risk-of-bias assessments in systematic reviews: A(I) new type of bias?
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0001-7989-6994 Joseph Barsby 1 , 2 ,
  • http://orcid.org/0000-0003-4417-0370 Samuel Hume 1 , 3 ,
  • http://orcid.org/0000-0002-7679-9125 Hamish AL Lemmey 1 , 4 ,
  • http://orcid.org/0000-0002-2867-7393 Joseph Cutteridge 5 , 6 ,
  • http://orcid.org/0000-0001-8621-5165 Regent Lee 7 ,
  • http://orcid.org/0000-0003-3795-6762 Katarzyna D Bera 3 , 7
  • 1 Oxford Medical School , Oxford , UK
  • 2 Newcastle Upon Tyne Hospitals NHS Foundation Trust , Newcastle Upon Tyne , UK
  • 3 University of Oxford St Anne's College , Oxford , UK
  • 4 University of Oxford Magdalen College , Oxford , UK
  • 5 York and Scarborough Teaching Hospitals NHS Foundation Trust , York , UK
  • 6 Hull University Teaching Hospitals NHS Trust , Hull , UK
  • 7 Nuffield Department of Surgical Sciences , University of Oxford , Oxford , UK
  • Correspondence to Dr Katarzyna D Bera, University of Oxford Nuffield Department of Surgical Sciences, Oxford, UK; katarzyna.bera{at}st-annes.ox.ac.uk

https://doi.org/10.1136/bmjebm-2024-112990

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

  • Systematic Reviews as Topic

Risk-of-bias (RoB) assessment is used to assess randomised control trials for systematic errors. Developed by Cochrane, it is considered the gold standard of assessing RoB for studies included within systematic reviews, representing a key part of Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. 1 The RoB tool comprises six domains that may signify bias: random sequence generation, allocation concealment, blinding of participants and personnel, attrition bias, reporting bias and other potential biases. 2 This assessment is an integral component of evaluating the quality of evidence; however, it is a time-consuming and labour-intensive process.

Large language models (LLMs) are a form of generative artificial intelligence (AI) trained on large volumes of data. ChatGPT is an LLM developed by OpenAI, capable of generating a wide variety of responses in response to user prompts. Concerns exist around the application of such AI tools in research, including ethical, copyright, plagiarism and cybersecurity risks. 3 However, LLMs are increasingly popular with investigators seeking to streamline analyses. Studies have begun investigating the potential role of LLMs in the RoB assessment process. 4 5 Given the flexibility and rapidly evolving nature of LLMs, our goal was to explore whether ChatGPT could be used to automate the RoB assessment process without sacrificing accuracy. This study offers an assessment of the applicability of LLMs in SRs as of December 2023.

This study sits within an SR (PROSPERO CRD420212479050). Two reviewers (SH and HALL) implemented RoB across n=15 full-length papers in portable document format (PDF) format ( table 1 ). Six domains were assessed independently alongside an added ‘overall assessment’ domain ranking each as high, low or unclear RoB. Alongside RoB assessment, reviewers recorded author name, DOI and publication year using a Microsoft Office form. Any conflicts were resolved by discussion, with a third reviewer (KDB) available for arbitration, although this was not required.

  • View inline

List of included papers (n=15), numbered 1–15

In parallel, a fourth reviewer (JB) uploaded the same PDF files to ChatPDF.com, a plug-in powered by ChatGPT3.5 that facilitates the upload of PDF files for analysis by ChatGPT3.5. ChatGPT3.5 was prompted to assess the RoB within each paper through prompts pertaining to each of the six domains. Responses were recorded in a separate, but identical, Microsoft Office form. Reviewer 4 (JB) was blinded to the assessment of reviewers 1 and 2 throughout. JB then repeated this process using ChatGPT4.

Responses from decisions across all domains were compared considering percentage of concurrent decisions, opposite decisions and indeterminable decisions. All data were analysed and stored in Microsoft Excel. Gemini was trialled but was unable to perform a RoB assessment in its current form.

In total, n=105 decisions were undertaken by GPT3.5, GPT4 and human reviewers. ChatGPT3.5 was concurrent with human (gold standard) assessment in 41/105 (39.0%) decisions and disagreed on 10/105 (9.5%). ChatGPT4 agreed with human assessment on 34/105 (34.3%) decisions and disagreed on 15/105 (14.3%). Both ChatGPT3.5 and ChatGPT4 delivered an indeterminate response on 54/105 (51.4%) decisions. ChatGPT3.5 outperformed or matched the performance of ChatGPT4 in 6/7 (85.6%) domains (aside from selective reporting), with ChatGPT3.5 performing best in assessing sequence generation and completeness of data, with 8/15 (53.3%) concurrent with human assessment. ChatGPT4 performed best in assessing selective reporting, with 14/15 (93.3%) correct decisions. Results by domain are summarised in table 2 . Notably, GPT4 performed superiorly in one domain (selective reporting), returning a correct decision in 14/15 (93.3%) cases, while GPT3.5 was correct in 10/15 (66.7%) decisions.

Summary of risk-of-bias assessment outcome per domain assessed by ChatGPT3.5 and ChatGPT4 compared with human assessment

When assessing Karanikolas (2011) and Mann (1983), ChatGPT3.5 returned an assessment as ‘moderate’, and ‘low to moderate’, in the overall domain. Both were recorded as unclear as substitute. On one occasion, ChatGPT identified an author as ‘P. C. Weaver’, who is not an author on any included papers. When discussing Lastoria (2006), ChatGPT responded initially in Portuguese.

We explored the potential for ChatGPT as a tool for assessing RoB. Overall, ChatGPT demonstrated moderate agreement, and minor disagreement with gold standard (human) assessment. While encouraging, this suboptimal performance precludes us from recommending ChatGPT be used in real-world RoB assessment.

To emulate end-users, prompts were not standardised, and some questions were repeated to ensure accuracy. When responses were ‘moderate’, ChatGPT was prompted to reassess. Similarly, ChatGPT often declined to perform assessments, which required further prompting with alternate question structure or wording. When prompted to assess allocation concealment for Choksy (2006), ChatGPT summarised the process as follows: ‘randomisation was performed using a random number table and randomisation details were placed in sealed envelopes which were opened in the operating theatre’. Human assessment ranked this as low RoB, whereas ChatGPT ranked this as unclear stating ‘it [was] unclear whether these envelopes were opaque’, demonstrating a literal interpretation of the literature not seen in human assessment.

In this analysis, ChatGPT4 did not offer improvement on ChatGPT3.5 in any domain, aside from selective reporting. For selective reporting, ChatGPT4 returned a correct decision in 14/15 (93.3%) cases, and ChatGPT3.5 in 10/15 (66.7%). However, human assessment returned a decision of unclear on n=14/15 of these assessments. ChatGPT4’s inability to give a definitive assessment is perhaps best outlined by ChatGPT4’s response: ‘To make a definitive assessment, one would ideally compare the outcomes reported in the paper with those specified in the study’s protocol registration’. This could represent an improvement through appreciation of the wider context of a given paper.

Our findings have limitations: the sample size of included papers in this study was small and fairly homogenous. Additionally, we chose not to use standardised prompts. Prompt variability has been demonstrated to introduce output variability. 5 Future research could benefit from a larger and more diverse dataset, with standardised prompts to assess and improve consistency of responses.

Supplemental material

Ethics statements, patient consent for publication.

Not applicable.

Ethics approval

  • McKenzie JE ,
  • Bossuyt PM , et al
  • Higgins JPT ,
  • Altman DG ,
  • Gøtzsche PC , et al
  • Arshad HB ,
  • Khan SU , et al
  • Talukdar JR

Supplementary materials

Supplementary data.

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

  • Data supplement 1

X @scullingmonkey

Contributors Conception of study: JB and KDB. Data collection: JB, JC, KDB, SH, HALL. Data analysis: JB, JC and KDB. Data interpretation: JB, KDB and LR. Drafting of paper: JB, JC and KDB. Approval of final version: all authors. ChatGPT is the subject of this study and was only used as described in methods and results. ChatGPT was not used in analysing, drafting or rewriting of the paper.

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Provenance and peer review Not commissioned; externally peer reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Read the full text or download the PDF:

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

X-Risk Analysis for AI Research

Profile image of Mantas Mazeika

2022, arXiv (Cornell University)

Related Papers

Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

AI advances represent a great technological opportunity, but also possible perils. This paper undertakes an ethical and systematic evaluation of those risks in a pragmatic analytical form of questions, which we term ‘Conceptual AI Risk analysis’. We then look at a topical case example in an actual industrial setting and apply that methodology in outline. The case involves Deep Learning Black-Boxes and their risk issues in an environment that requires compliance with legal rules and industry best practices. We examine a technological means to attempt to solve the Black-box problem for this case, referred to as “Really Useful Machine Learning” ( RUMLSM ). DARPA has identified such cases as being the “Third Wave of AI.” Conclusions to its efficacy are drawn.

x risk analysis for ai research

Martin Ciupa

Aryeh Englander

The Next Wave of Sociotechnical Design

Per Rådberg Nagbøl

AI &amp; SOCIETY

Claudio Novelli

The EU Artificial Intelligence Act (AIA) defines four risk categories: unacceptable, high, limited, and minimal. However, as these categories statically depend on broad fields of application of AI, the risk magnitude may be wrongly estimated, and the AIA may not be enforced effectively. This problem is particularly challenging when it comes to regulating general-purpose AI (GPAI), which has versatile and often unpredictable applications. Recent amendments to the compromise text, though introducing context-specific assessments, remain insufficient. To address this, we propose applying the risk categories to specific AI scenarios, rather than solely to fields of application, using a risk assessment model that integrates the AIA with the risk approach arising from the Intergovernmental Panel on Climate Change (IPCC) and related literature. This integrated model enables the estimation of AI risk magnitude by considering the interaction between (a) risk determinants, (b) individual driv...

arXiv (Cornell University)

Premkumar Devanbu

Social Science Research Network

ALEX R MATHEW

The integration of AI technology is with the hope of reducing human errors, improving efficiency, and providing the faster result. Nonetheless, research indicates that AI may become hazardous over time and would not perform the intended roles. To recognize the perilous effects of AI, the study draws from the dynamic programming theory developed by Richard Bellman. The study proves that while AI is programmed to do a useful task, a malfunction may lead to negative consequences of the entire system. It may also attract indirect consequences such as traffic jams. The best way to explain this correlation is via the dynamic programming theory, which asserts that as systems become complex, they may sway from their targeted objective. Computer-related dangers may emanate from the cognitive complexity of the system as opposed to the specific properties of which the system was meant to perform. The immediate risks of AI are autonomous weapons, although other long-term effects such as gradual replacement of humans by robots and AI drug addition will also result.

BRAIN. Broad Research in Artificial Intelligence and Neuroscience ISSN 2067-3957

Academia EduSoft

Nowadays, there is a serious anxiety on the existence of dangerous intelligent systems and it is not just a science-fiction idea of evil machines like the ones in well-known Terminator movie or any other movies including intelligent robots – machines threatening the existence of humankind. So, there is a great interest in some alternative research works under the topics of Machine Ethics, Artificial Intelligence Safety and the associated research topics like Future of Artificial Intelligence and Existential Risks. The objective of this study is to provide a general discussion about the expressed research topics and try to find some answers to the question of 'Are we safe enough in the future of Artificial Intelligence?'. In detail, the discussion includes a comprehensive focus on 'dystopic' scenarios, enables interested researchers to think about some 'moral dilemmas' and finally have some ethical outputs that are considerable for developing good intelligent systems. From a general perspective, the discussion taken here is a good opportunity to improve awareness on the mentioned, remarkable research topics associated with not only Artificial Intelligence but also many other natural and social sciences taking role in the humankind.

RELATED PAPERS

Jeffrey Horwitz

Journal of Economic Structures

Antonio F. Amores

Journal of Chromatography B

Pavla Maňásková-Postlerová

Journal of The Institution of Engineers (India): Series C

The Journal of Biochemistry

Neethu Ajayakumar

Jurnal Perikanan dan Kelautan

Nelvia Mai Susanti

Pediatric Radiology

ACS Energy Letters

Vladyslav Verteletskyi

Media, War &amp; Conflict

Renato De Marchi

Zenodo (CERN European Organization for Nuclear Research)

Maulani Septianti

Ahmed Badran

NOEMI SANCHEZ

Chemical engineering transactions

Heine-Jahrbuch 39

BERIT BALZER HAUS

Literatura e Autoritarismo

rosicley coimbra

Neurobiology of Aging

Gianluca Córdova

Indian Journal of Public Administration

Navreet Kaur

Reproductive Medicine

Yasunari Miyagi

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

IMAGES

  1. X-Risk Analysis for AI Research

    x risk analysis for ai research

  2. Algorithmic Risk Management: A Framework for Identifying, Assessing

    x risk analysis for ai research

  3. Operational Risk Management Gets Smart with AI

    x risk analysis for ai research

  4. What To Expect from NIST’s Artificial Intelligence Risk Management

    x risk analysis for ai research

  5. Managing AI risks

    x risk analysis for ai research

  6. Five Views of AI Risk: Understanding the darker side of AI

    x risk analysis for ai research

VIDEO

  1. Is Broadcom Just Getting Started?

  2. CVS Health Stock Looks Like A Buy

  3. ST4 X RISK X NTANH

  4. The Risk From Gen AI also Highlighted by World Economic Forum 2024 Risk Report can occur in 2 Years

  5. Risk benefit analysis

  6. Market Risk Analysis: Volume II

COMMENTS

  1. [2206.05862] X-Risk Analysis for AI Research

    X-Risk Analysis for AI Research. Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks.

  2. X-Risk Analysis for AI Research

    X-Risk Analysis for AI Research. Dan Hendrycks UC Berkeley Mantas Mazeika UIUC. Abstract. Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a system- atic discussion of how to manage long-tail risks from ...

  3. [PDF] X-Risk Analysis for AI Research

    X-Risk Analysis for AI Research. Dan Hendrycks, Mantas Mazeika. Published in arXiv.org 13 June 2022. Computer Science, Philosophy. TLDR. A guide for how to analyze AI x-risk, drawing on time-tested concepts from hazard analysis and systems safety that have been designed to steer large processes in safer directions, and discusses strategies for ...

  4. X-Risk Analysis for AI Research

    X-Risk Analysis for AI Research. Hendrycks, Dan. ; Mazeika, Mantas. Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including ...

  5. X-Risk Analysis for AI Research

    X-Risk Analysis for AI Research. Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping ...

  6. X-Risk Analysis for AI Research

    X-Risk Analysis for AI Research. Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks.

  7. X-Risk Analysis for AI Research

    X-Risk Analysis for AI Research. June 2022. DOI: 10.48550/arXiv.2206.05862. Authors: Dan Hendrycks. Mantas Mazeika. Preprints and early-stage research may not have been peer reviewed yet. To read ...

  8. [2206.05862v7] X-Risk Analysis for AI Research

    Title: X-Risk Analysis for AI Research. Authors: Dan Hendrycks, Mantas Mazeika (Submitted on 13 Jun 2022 , last revised 20 Sep 2022 (this version, v7)) Abstract: Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research ...

  9. X-Risk Analysis for AI Research

    X-Risk Analysis for AI Research. Authors. Dan Hendrycks; Mantas Mazeika; Publication date 20 September 2022. Publisher. Abstract Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of ...

  10. X-Risk Analysis for AI Research

    X-Risk Analysis for AI Research. Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping ...

  11. Cheat sheet of AI X-risk

    Introduction. Clarifying AI X-risk is a summary of a literature review of AI risk models by DeepMind that (among other things) categorizes the threat models it studies along two dimensions: the technical cause of misalignment and the path leading to X-risk. I think this classification is inadequate because the technical causes of misalignment ...

  12. Clarifying AI X-risk

    The DeepMind AGI Safety team has been working to understand the space of threat models for existential risk (X-risk) from misaligned AI. This post summarizes our findings. Our aim was to clarify the case for X-risk to enable better research project generation and prioritization. First, we conducted a literature review of existing threat models ...

  13. 2206.05862

    X-Risk Analysis for AI Research (2206.05862) Published Jun 13, 2022 in cs.CY, cs.AI, and cs.LG. Abstract. Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. ... To add precision and ground these discussions, we provide a guide for ...

  14. "X-Risk Analysis for AI Research."

    Bibliographic details on X-Risk Analysis for AI Research. DOI: 10.48550/ARXIV.2206.05862 access: open type: Informal or Other Publication metadata version: 2022-06-20

  15. Two Types of AI Existential Risk: Decisive and Accumulative

    This paper contrasts the conventional decisive AI x-risk hypothesis with an accumulative AI x-risk hypothesis. While the former envisions an overt AI takeover pathway, characterized by scenarios like uncontrollable superintelligence, the latter suggests a different causal pathway to existential catastrophes. ... X-risk analysis for ai research ...

  16. Towards risk-aware artificial intelligence and machine learning systems

    There is an increasing demand for embedding risk analysis and risk management in the development pipeline of AI/ML systems. In this section, we highlight several research challenges that may be faced along the development of risk-aware AI/ML systems, as discussed below. 1. Black-box nature of certain AI/ML models.

  17. [2206.05862] X-Risk Analysis for AI Research

    This culminates in X-Risk Sheets (Appendix C), a new risk analysis tool to help researchers perform x-risk analysis of their safety research papers. We hope this document serves as a guide to safety researchers by clarifying how to analyze x-risks from AI systems, and helps stakeholders and interested parties with evaluating and assessing x ...

  18. Transformative AI and Scenario Planning for AI X-risk

    Transformative AI and Scenario Planning for AI. X-risk. This post is part of a series by the AI Clarity team at Convergence Analysis. In our previous post, Corin Katzke reviewed methods for applying scenario planning methods to AI existential risk strategy. In this post, we want to provide the motivation for our focus on transformative AI.

  19. AXRP

    This website hosts transcripts of episodes of AXRP, pronounced axe-urp, short for the AI X-risk Research Podcast. On this podcast, I (Daniel Filan) have conversations with researchers about their research.We discuss their work and hopefully get a sense of why it's been written and how it might reduce the risk of artificial intelligence causing an existential catastrophe: that is, permanently ...

  20. AI and the falling sky: interrogating X-Risk

    Non-AI X-Risks. Before determining what moral weight to assign AI X-Risk, consider non-AI X-Risks. For example, an increasing number of bacteria, parasites, viruses and fungi with antimicrobial resistance could threaten human health and life; the use of nuclear, chemical, biological or radiological weapons could end the lives of millions or make large parts of the planet uninhabitable; extreme ...

  21. Managing extreme AI risks amid rapid progress

    Risk assessment We must learn to assess not just dangerous capabilities but also risk in a societal context, ... Y.B., J.C., G.Ha., and S.Mc. hold the position of Candian Institute for Advanced Research (CIFAR) AI Chair. J.C. is a senior research adviser to Google DeepMind. A.A. reports acting as an adviser to the Civic AI Security Program and ...

  22. AI-Related Risk: An Epistemological Approach

    Risks connected with AI systems have become a recurrent topic in public and academic debates, and the European proposal for the AI Act explicitly adopts a risk-based tiered approach that associates different levels of regulation with different levels of risk. However, a comprehensive and general framework to think about AI-related risk is still lacking. In this work, we aim to provide an ...

  23. AI-enhanced integration of genetic and medical imaging data for risk

    Type 2 diabetes is a global health threat demanding precise healthcare methods. Here, the authors show that their AI-driven risk assessment models, integrating genetic, imaging, and demographic ...

  24. OpenAI's Long-Term AI Risk Team Has Disbanded

    OpenAI declined to comment on the departures of Sutskever or other members of the superalignment team, or the future of its work on long-term AI risks. Research on the risks associated with more ...

  25. arXiv.org e-Print archive

    arXiv.org e-Print archive

  26. The Impact of Performance Expectancy, Workload, Risk, and Satisfaction

    Conclusions: The findings underscore the importance of ensuring user-friendly design and functionality in AI-based applications to reduce workload and enhance user satisfaction, thereby increasing user trust. Future research should further explore the relationship between risk-benefit perception and trust in the context of AI chatbots.

  27. Pilot study on large language models for risk-of-bias assessments in

    Risk-of-bias (RoB) assessment is used to assess randomised control trials for systematic errors. Developed by Cochrane, it is considered the gold standard of assessing RoB for studies included within systematic reviews, representing a key part of Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines.1 The RoB tool comprises six domains that may signify bias: random ...

  28. (PDF) X-Risk Analysis for AI Research

    This culminates in X-Risk Sheets (Appendix C), a new risk analysis tool to help researchers perform x-risk analysis of their safety research papers. We hope this document serves as a guide to safety researchers by clarifying how to analyze x-risks from AI systems, and helps stakeholders and interested parties with evaluating and assessing x ...

  29. Gartner Survey Shows AI-Enhanced Malicious Attacks Are a New Top

    Concern about artificial intelligence (AI)-enhanced malicious attacks ascended to the top of the Gartner emerging risk rankings in the first quarter of 2024, according to Gartner, Inc. "The prospect of malicious actions enabled by AI-assisted tools is concerning risk leaders worldwide," said Gamika Takkar, director, research in the Gartner ...

  30. Multi-modality Regional Alignment Network for Covid X-Ray Survival

    In response to the worldwide COVID-19 pandemic, advanced automated technologies have emerged as valuable tools to aid healthcare professionals in managing an increased workload by improving radiology report generation and prognostic analysis. This study proposes Multi-modality Regional Alignment Network (MRANet), an explainable model for radiology report generation and survival prediction that ...