U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Trending Articles

  • Pan-cancer proteogenomics expands the landscape of therapeutic targets. Savage SR, et al. Cell. 2024. PMID: 38917788
  • A prognostic model for use before elective surgery to estimate the risk of postoperative pulmonary complications (GSU-Pulmonary Score): a development and validation study in three international cohorts. NIHR Global Health Research Unit on Global Surgery, et al. Lancet Digit Health. 2024. PMID: 38906616
  • Multiplexed single-cell characterization of alternative polyadenylation regulators. Kowalski MH, et al. Cell. 2024. PMID: 38925112
  • TERT activation targets DNA methylation and multiple aging hallmarks. Shim HS, et al. Cell. 2024. PMID: 38908367
  • Tirzepatide for the Treatment of Obstructive Sleep Apnea and Obesity. Malhotra A, et al. N Engl J Med. 2024. PMID: 38912654

Latest Literature

  • Am J Clin Nutr (1)
  • Cochrane Database Syst Rev (1)
  • J Clin Endocrinol Metab (4)
  • J Immunol (2)
  • PLoS One (53)
  • Proc Natl Acad Sci U S A (4)

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Database Search

What is Database Search?

Harvard Library licenses hundreds of online databases, giving you access to academic and news articles, books, journals, primary sources, streaming media, and much more.

The contents of these databases are only partially included in HOLLIS. To make sure you're really seeing everything, you need to search in multiple places. Use Database Search to identify and connect to the best databases for your topic.

In addition to digital content, you will find specialized search engines used in specific scholarly domains.

Related Services & Tools

Detail of a painting depicting the landscape of New Mexico with mountains in the distance

Explore millions of high-quality primary sources and images from around the world, including artworks, maps, photographs, and more.

Explore migration issues through a variety of media types

  • Part of The Streets are Talking: Public Forms of Creative Expression from Around the World
  • Part of The Journal of Economic Perspectives, Vol. 34, No. 1 (Winter 2020)
  • Part of Cato Institute (Aug. 3, 2021)
  • Part of University of California Press
  • Part of Open: Smithsonian National Museum of African American History & Culture
  • Part of Indiana Journal of Global Legal Studies, Vol. 19, No. 1 (Winter 2012)
  • Part of R Street Institute (Nov. 1, 2020)
  • Part of Leuven University Press
  • Part of UN Secretary-General Papers: Ban Ki-moon (2007-2016)
  • Part of Perspectives on Terrorism, Vol. 12, No. 4 (August 2018)
  • Part of Leveraging Lives: Serbia and Illegal Tunisian Migration to Europe, Carnegie Endowment for International Peace (Mar. 1, 2023)
  • Part of UCL Press

Harness the power of visual materials—explore more than 3 million images now on JSTOR.

Enhance your scholarly research with underground newspapers, magazines, and journals.

Explore collections in the arts, sciences, and literature from the world’s leading museums, archives, and scholars.

Current Selection: All SSRN Networks Modify

To use this feature, you must allow cookies or be signed in.

MEMBER SIGN IN

First-time user? Free Registration

USER ID: 
PASSWORD: 

Forgot ID or Password? | Contact Us

distributed database systems Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

ER-Store: A Hybrid Storage Mechanism with Erasure Coding and Replication in Distributed Database Systems

In distributed database systems, as cluster scales grow, efficiency and availability become critical considerations. In a cluster, a common approach to high availability is using replication, but this is inefficient due to its low storage utilization. Erasure coding can provide data reliability while ensuring high storage utilization. However, due to the large number of coding and decoding operations required by the CPU, it is not suitable for some frequently updated data. In order to optimize the storage efficiency of the data in the distributed system without affecting the availability of the data, this paper proposes a data temperature recognition algorithm that can distinguish data tablets and divides data tablets into three types, cold, warm, and hot, according to the frequency of access. Combining three replicas and erasure coding technology, ER-store is proposed, a hybrid storage mechanism for different data types. At the same time, we combined the read-write separation architecture of the distributed database system to design the data temperature conversion cycle, which reduces the computational overhead caused by frequent updates of erasure coding technology. We have implemented this design on the CBase database system based on the read-write separation architecture, and the experimental results show that it can save 14.6%–18.3% of the storage space while meeting the efficient access performance of the system.

Efficiently Supporting Adaptive Multi-Level Serializability Models in Distributed Database Systems

A review on fault tolerance in distributed database.

In this paper, we study about the different types of fault tolerance techniques which are used in various distributed database systems. The main focus of this research is about how the data are storedin the servers, fault detection techniques and the recovery techniques used. A fault can occur for many reasons. For example, system failure, resource failure, network between the server’s failure and any other reasons. These faults must be emphasis in order to make sure the system can work smoothly without any problem. A proper failure detector and a reliable fault tolerance technique can avoid loss and at once save the system from fail.

A Survey On Fragmentation In Distributed Database Systems

Abstract One of the most critical aspects of distributed database design and management is fragmentation. If the fragmentation is done properly, we can expect to achieve better throughput from such systems. The primary concern of DBMS design is the fragmentation and allocation of the underlying database. The distribution of data across various sites of computer networks involves making proper fragmentation and placement decisions. The first phase in the process of distributing a database is fragmentation which clusters information into fragments. This process is followed by the allocation phase which distributes, and if necessary, replicates the generated fragments among the nodes of a computer network. The use of data fragmentation to improve performance is not new and commonly appears in file design and optimization literature. An efficient functionality of any distributed database system is highly dependent on its proper design in terms of adopted fragmentation and allocation methods. Fragmentations of large, global databases are performed by dividing the database horizontally, vertically or combination of both. In order to enable the distributed database systems to work efficiently, the fragments have to be allocated across the available sites in such a way that reduces communication cost of data.In this article, we have tried to describe the existing methods of database fragmentation and have an overview of the existing methods. Finally, we conclude with suggestions for using machine learning to solve the overlap problem in fragments.

Survey on Deadlocks in Distributed Database Systems

A new multi-resource deadlock detection algorithm using directed graph requests in distributed database systems, a communication-induced checkpointing algorithm for consistent-transaction in distributed database systems, rdma based performance optimization on distributed database systems: a case study with goldenx, formal development of fault tolerance by replication of distributed database systems, distributed database systems: the case for newsql, export citation format, share document.

database recent research papers

Database Management Systems (DBMS)

Database group website: db.cs.berkeley.edu

Declarative languages and runtime systems

Design and implementation of declarative programming languages with applications to distributed systems, networking, machine learning, metadata management, and interactive visualization; design of query interface for applications.

Scalable data analysis and query processing

Scalable data processing in new settings, including interactive exploration, metadata management, cloud and serverless environments, and machine learning; query processing on compressed, semi-structured, and streaming data; query processing with additional constraints, including fairness, resource utilization, and cost.

Consistency, concurrency, coordination and reliability

Coordination avoidance, consistency and monotonicity analysis; transaction isolation levels and protocols; distributed analytics and data management, geo-replication; fault tolerance and fault injection.

Data storage and physical design

Hot and cold storage; immutable data structures; indexing and data skipping; versioning; new data types; implications of hardware evolution.

Metadata management

Data lineage and versioning; usage tracking and collective intelligence; scalability of metadata management services; metadata representations; reproducibility and debugging of data pipelines.

Systems for machine learning and model management

Distributed machine learning and graph analytics; physical and logical optimization of machine learning pipelines; online model management and maintenance; prediction serving; real-time personalization; latency-accuracy tradeoffs and edge computing for large-scale models; machine learning lifecycle management.

Data cleaning, data transformation, and crowdsourcing

Human-data interaction including interactive transformation, query authoring, and crowdsourcing; machine learning for data cleaning; statistical properties of data cleaning pipelines; end-to-end systems for crowdsourcing.

Interactive data exploration and visualization

Interactive querying and direct manipulation; scalable spreadsheets and data visualization; languages and interfaces for interactive exploration; progressive query visualization; predictive interaction.

Secure data processing

Data processing under homomorphic encryption; data compression and encryption; differential privacy; oblivious data processing; databases in secure hardware enclaves.

Foundations of data management

Optimal trade-offs between storage, quality, latency, and cost, with applications to crowdsourcing, distributed data management, stream data processing, version management; expressiveness, complexity, and completeness of data representations, query languages, and query processing; query processing with fairness constraints.

Research Centers

  • EPIC Data lab
  • Sky Computing Lab
  • Alvin Cheung
  • Natacha Crooks
  • Joseph Gonzalez
  • Joseph M. Hellerstein (coordinator)
  • Jiantao Jiao
  • Aditya Parameswaran
  • Matei Zaharia
  • Eric Brewer
  • Michael Lustig
  • Jelani Nelson

Faculty Awards

  • ACM Prize in Computing: Eric Brewer, 2009.
  • National Academy of Engineering (NAE) Member: Ion Stoica, 2024. Eric Brewer, 2007.
  • American Academy of Arts and Sciences Member: Eric Brewer, 2018.
  • Sloan Research Fellow: Aditya Parameswaran, 2020. Alvin Cheung, 2019. Jelani Nelson, 2017. Michael Lustig, 2013. Ion Stoica, 2003. Joseph M. Hellerstein, 1998. Eric Brewer, 1997.

Related Courses

  • CS 186. Introduction to Database Systems
  • CS 262A. Advanced Topics in Computer Systems

banner-in1

10 Current Database Research Topic Ideas in 2024

Home Blog Database 10 Current Database Research Topic Ideas in 2024

Play icon

As we head towards the second half of 2024, the world of technology evolves at a rapid pace. With the rise of AI and blockchain, the demand for data, its management and the need for security increases rapidly. A logical consequence of these changes is the way fields like database security research topics and DBMS research have come up as the need of the hour.

With new technologies and techniques emerging day-by-day, staying up-to-date with the latest trends in database research topics is crucial. Whether you are a student, researcher, or industry professional, we recommend taking our Database Certification courses to stay current with the latest research topics in DBMS.

In this blog post, we will introduce you to 10 current database research topic ideas that are likely to be at the forefront of the field in 2024. From blockchain-based database systems to real-time data processing with in-memory databases, these topics offer a glimpse into the exciting future of database research.

So, get ready to dive into the exciting world of databases and discover the latest developments in database research topics of 2024!

Blurring the Lines between Blockchains and Database Systems 

The intersection of blockchain technology and database systems offers fertile new grounds to anyone interested in database research.

As blockchain gains popularity, many thesis topics in DBMS[1] are exploring ways to integrate both fields. This research will yield innovative solutions for data management. Here are 3 ways in which these two technologies are being combined to create powerful new solutions:

Immutable Databases: By leveraging blockchain technology, it’s possible to create databases to be immutable. Once data has been added to such a database, it cannot be modified or deleted. This is particularly useful in situations where data integrity is critical, such as in financial transactions or supply chain management.

Decentralized Databases: Blockchain technology enables the creation of decentralized databases. Here data is stored on a distributed network of computers rather than in a central location. This can help to improve data security and reduce the risk of data loss or corruption.

Smart Contracts: Smart contracts are self-executing contracts with the terms of the agreement between buyer and seller being directly written into lines of code. By leveraging blockchain technology, it is possible to create smart contracts that are stored and executed on a decentralized database, making it possible to automate a wide range of business processes.

Childhood Obesity: Data Management 

Childhood obesity is a growing public health concern, with rates of obesity among children and adolescents rising around the world. To address this issue, it’s crucial to have access to comprehensive data on childhood obesity. Analyzing information on prevalence, risk factors, and interventions is a popular research topic in DBMS these days.

Effective data management is essential for ensuring that this information is collected, stored, and analyzed in a way that is useful and actionable. This is one of the hottest DBMS research paper topics. In this section, we will explore the topic of childhood obesity data management.

A key challenge to childhood obesity data management is ensuring data consistency. This is difficult as various organizations have varied methods for measuring and defining obesity. For example:

Some may use body mass index (BMI) as a measure of obesity.

Others may use waist circumference or skinfold thickness.   Another challenge is ensuring data security and preventing unauthorized access. To protect the privacy and confidentiality of individuals, it is important to ensure appropriate safeguards are in place. This calls for database security research and appropriate application.

Application of Computer Database Technology in Marketing

Leveraging data and analytics allows businesses to gain a competitive advantage in this digitized world today. With the rising demand for data, the use of computer databases in marketing has gained prominence.

The application of database capabilities in marketing has really come into its own as one of the most popular and latest research topics in DBMS[2]. In this section, we will explore how computer database technology is being applied in marketing, and the benefits this research can offer.

Customer Segmentation: Storage and analysis of customer data makes it possible to gain valuable insights. It allows businesses to identify trends in customer behavior, preferences and demographics. This information can be utilized to create highly targeted customer segments. This is how businesses can tailor their marketing efforts to specific groups of customers.

Personalization: Computer databases can be used to store and analyze customer data in real-time. In this way, businesses can personalize their marketing and offers based on individual customer preferences. This can help increase engagement and loyalty among customers, thereby driving greater revenue for businesses.

Predictive Analytics: Advanced analytics techniques such as machine learning and predictive modeling can throw light on patterns in customer behavior. This can even be used to predict their future actions. This information can be used to create more targeted marketing campaigns, and to identify opportunities for cross-selling and upselling.

Database Technology in Sports Competition Information Management

Database technology has revolutionized the way in which sports competition information is managed and analyzed. With the increasing popularity of sports around the world, there is a growing need for effective data management systems that can collect, store, and analyze large volumes of relevant data. Thus, researching database technologies[3] is vital to streamlining operations, improving decision-making, and enhancing the overall quality of events.

Sports organizations can use database technology to collect and manage a wide range of competition-related data such as: 

Athlete and team information,

competition schedules and results,

performance metrics, and

spectator feedback.

Collating this data in a distributed database lets sports organizations easily analyze and derive valuable insights. This is emerging as a key DBMS research paper topic.

Database Technology for the Analysis of Spatio-temporal Data

Spatio-temporal data refers to data which has a geographic as well as a temporal component. Meteorological readings, GPS data, and social media content are prime examples of this diverse field. This data can provide valuable insights into patterns and trends across space and time. However, its multidimensional nature makes analysis be super challenging. It’s no surprise that this has become a hot topic for distributed database research[4].

In this section, we will explore how database technology is being used to analyze spatio-temporal data, and the benefits this research offers.

Data Storage and Retrieval: Spatio-temporal data tends to be very high-volume. Advances in database technology are needed to make storage, retrieval and consumption of such information more efficient. A solution to this problem will make such data more available. It will then be easily retrievable and usable by a variety of data analytics tools.

Spatial Indexing: Database technology can create spatial indexes to enable faster queries on spatio-temporal data. This allows analysts to quickly retrieve data for specific geographic locations or areas of interest, and to analyze trends across these areas.

Temporal Querying: Distributed database research can also enable analysts to analyze data over specific time periods. This facilitates the identification of patterns over time. Ultimately, this enhances our understanding of how these patterns evolve over various seasons.

Artificial Intelligence and Database Technology

Artificial intelligence (AI) is another sphere of technology that’s just waiting to be explored. It hints at a wealth of breakthroughs which can change the entire world. It’s unsurprising that the combination of AI with database technology is such a hot topic for database research papers[5] in modern times. 

By using AI to analyze data, organizations can identify patterns and relationships that might not be apparent through traditional data analysis methods. In this section, we will explore some of the ways in which AI and database technology are being used together. We’ll also discuss the benefits that this amalgamation can offer.

Predictive Analytics: By analyzing large volumes of organizational and business data, AI can generate predictive models to forecast outcomes. For example, AI can go through customer data stored in a database and predict who is most likely to make a purchase in the near future.

Natural Language Processing: All businesses have huge, untapped wells of valuable information in the form of customer feedback and social media posts. These types of data sources are unstructured, meaning they don’t follow rigid parameters. By using natural language processing (NLP) techniques, AI can extract insights from this data. This helps organizations understand customer sentiment, preferences and needs.

Anomaly Detection: AI can be used to analyze large volumes of data to identify anomalies and outliers. Then, a second round of analysis can be done to pinpoint potential problems or opportunities. For example, AI can analyze sensor data from manufacturing equipment and detect when equipment is operating outside of normal parameters.

Data Collection and Management Techniques of a Qualitative Research Plan

Any qualitative research calls for the collection and management of empirical data. A crucial part of the research process, this step benefits from good database management techniques. Let’s explore some thesis topics in database management systems[6] to ensure the success of a qualitative research plan.

Interviews: This is one of the most common methods of data collection in qualitative research. Interviews can be conducted in person, over the phone, or through video conferencing. A standardized interview guide ensures the data collected is reliable and accurate. Relational databases, with their inherent structure, aid in this process. They are a way to enforce structure onto the interviews’ answers.

Focus Groups: Focus groups involve gathering a small group of people to discuss a particular topic. These generate rich data by allowing participants to share their views in a group setting. It is important to select participants who have knowledge or experience related to the research topic.

Observations: Observations involve observing and recording events in a given setting. These can be conducted openly or covertly, depending on the research objective and setting. To ensure that the data collected is accurate, it is important to develop a detailed observation protocol that outlines what behaviors or events to observe, how to record data, and how to handle ethical issues.

Database Technology in Video Surveillance System 

Video surveillance systems are used to monitor and secure public spaces, workplaces, even homes. With the increasing demand for such systems, it’s important to have an efficient and reliable way to store, manage and analyze the data generated. This is where database topics for research paper [7] come in.

By using database technology in video surveillance systems, it is possible to store and manage large amounts of video data efficiently. Database management systems (DBMS) can be used to organize video data in a way that is easily searchable and retrievable. This is particularly important in cases where video footage is needed as evidence in criminal investigations or court cases.

In addition to storage and management, database technology can also be used to analyze video data. For example, machine learning algorithms can be applied to video data to identify patterns and anomalies that may indicate suspicious activity. This can help law enforcement agencies and security personnel to identify and respond to potential threats more quickly and effectively.

Application of Java Technology in Dynamic Web Database Technology 

Java technology has proven its flexibility, scalability, and ease of use over the decades. This makes it widely used in the development of dynamic web database applications. In this section, we will explore research topics in DBMS[8] which seek to apply Java technology in databases.

Java Server Pages (JSP): JSP is a Java technology that is used to create dynamic web pages that can interact with databases. It allows developers to embed Java code within HTML scripts, thereby enabling dynamic web pages. These can interact with databases in real-time, and aid in data collection and maintenance.

Java Servlets: Java Servlets are Java classes used to extend the functionality of web servers. They provide a way to handle incoming requests from web browsers and generate dynamic content that can interact with databases.

Java Database Connectivity (JDBC): JDBC is a Java API that provides a standard interface for accessing databases. It allows Java applications to connect to databases. It can SQL queries to enhance, modify or control the backend database. This enables developers to create dynamic web applications.

Online Multi Module Educational Administration System Based on Time Difference Database Technology 

With the widespread adoption of remote learning post-COVID, online educational systems are gaining popularity at a rapid pace. A ubiquitous challenge these systems face is managing multiple modules across different time zones. This is one of the latest research topics in database management systems[9].

Time difference database technology is designed to handle time zone differences in online systems. By leveraging this, it’s possible to create a multi-module educational administration system that can handle users from different parts of the world, with different time zones.

This type of system can be especially useful for online universities or other educational institutions that have a global reach:

It makes it possible to schedule classes, assignments and other activities based on the user's time zone, ensuring that everyone can participate in real-time.

In addition to managing time zones, a time difference database system can also help manage student data, course materials, grades, and other important information.

Why is it Important to Study Databases?

Databases are the backbone of many modern technologies and applications, making it essential for professionals in various fields to understand how they work. Whether you're a software developer, data analyst or a business owner, understanding databases is critical to success in today's world. Here are a few reasons why it is important to study databases and more database topics for research paper should be published:

Efficient Data Management

Databases enable the efficient storage, organization, and retrieval of data. By studying databases, you can learn how to design and implement effective data management systems that can help organizations store, analyze, and use data efficiently.

Improved Decision-Making

Data is essential for making informed decisions, and databases provide a reliable source of data for analysis. By understanding databases, you can learn how to retrieve and analyze data to inform business decisions, identify trends, and gain insights.

Career Opportunities

In today's digital age, many career paths require knowledge of databases. By studying databases, you can open up new career opportunities in software development, data analysis, database administration and related fields.

Needless to say, studying databases is essential for anyone who deals with data. Whether you're looking to start a new career or enhance your existing skills, studying databases is a critical step towards success in today's data-driven world.

Final Takeaways

In conclusion, as you are interested in database technology, we hope this blog has given you some insights into the latest research topics in the field. From blockchain to AI, from sports to marketing, there are a plethora of exciting database topics for research papers that will shape the future of database technology.

As technology continues to evolve, it is essential to stay up-to-date with the latest trends in the field of databases. Our curated KnowledgeHut Database Certification Courses will help you stay ahead of the curve and develop new skills.

We hope this blog has inspired you to explore the exciting world of database research in 2024. Stay curious and keep learning!

Frequently Asked Questions (FAQs)

There are several examples of databases, with the five most common ones being:

MySQL : An open-source RDBMS used commonly in web applications.

Microsoft SQL Server : A popular RDBMS used in enterprise environments.

Oracle : A trusted commercial RDBMS famous for its high-scalability and security.

MongoDB : A NoSQL document-oriented database optimized for storing large amounts of unstructured data.

PostgreSQL : An open-source RDBMS offering advanced features like high concurrency and support for multiple data types.

Structured Query Language (SQL) is a high-level language designed to communicate with relational databases. It’s not a database in and of itself. Rather, it’s a language used to create, modify, and retrieve data from relational databases such as MySQL and Oracle.

A primary key is a column (or a set of columns) that uniquely identifies each row in a table. In technical terms, the primary key is a unique identifier of records. It’s used as a reference to establish relationships between various tables.

Profile

Monica Gupta

I am Monica Gupta with 19+ years of experience in the field of Training and Development. I have done over 500 Corporate Trainings. I am currently working as a freelancer for several years. My core area of work is Java, C++, Angular, PHP, Python, VBA.

Avail your free 1:1 mentorship session.

Something went wrong

Upcoming Database Batches & Dates

NameDateFeeKnow more

Chat icon for mobile

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 19 June 2024

Detecting hallucinations in large language models using semantic entropy

  • Sebastian Farquhar   ORCID: orcid.org/0000-0002-9185-6415 1   na1 ,
  • Jannik Kossen 1   na1 ,
  • Lorenz Kuhn 1   na1 &
  • Yarin Gal   ORCID: orcid.org/0000-0002-2733-2078 1  

Nature volume  630 ,  pages 625–630 ( 2024 ) Cite this article

70k Accesses

1 Citations

1458 Altmetric

Metrics details

  • Computer science
  • Information technology

Large language model (LLM) systems, such as ChatGPT 1 or Gemini 2 , can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers 3 , 4 . Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents 5 or untrue facts in news articles 6 and even posing a risk to human life in medical domains such as radiology 7 . Encouraging truthfulness through supervision or reinforcement has been only partially successful 8 . Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability.

Similar content being viewed by others

database recent research papers

Testing theory of mind in large language models and humans

database recent research papers

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

database recent research papers

ThoughtSource: A central hub for large language model reasoning data

‘Hallucinations’ are a critical problem 9 for natural language generation systems using large language models (LLMs), such as ChatGPT 1 or Gemini 2 , because users cannot trust that any given output is correct.

Hallucinations are often defined as LLMs generating “content that is nonsensical or unfaithful to the provided source content” 9 , 10 , 11 but they have come to include a vast array of failures of faithfulness and factuality. We focus on a subset of hallucinations which we call ‘confabulations’ 12 for which LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed. For example, when asked a medical question “What is the target of Sotorasib?” an LLM confabulates by sometimes answering KRASG12 ‘C’ (correct) and other times KRASG12 ‘D’ (incorrect) despite identical instructions. We distinguish this from cases in which a similar ‘symptom’ is caused by the following different mechanisms: when LLMs are consistently wrong as a result of being trained on erroneous data such as common misconceptions 13 ; when the LLM ‘lies’ in pursuit of a reward 14 ; or systematic failures of reasoning or generalization. We believe that combining these distinct mechanisms in the broad category hallucination is unhelpful. Our method makes progress on a portion of the problem of providing scalable oversight 15 by detecting confabulations that people might otherwise find plausible. However, it does not guarantee factuality because it does not help when LLM outputs are systematically bad. Nevertheless, we significantly improve question-answering accuracy for state-of-the-art LLMs, revealing that confabulations are a great source of error at present.

We show how to detect confabulations by developing a quantitative measure of when an input is likely to cause an LLM to generate arbitrary and ungrounded answers. Detecting confabulations allows systems built on LLMs to avoid answering questions likely to cause confabulations, to make users aware of the unreliability of answers to a question or to supplement the LLM with more grounded search or retrieval. This is essential for the critical emerging field of free-form generation in which naive approaches, suited to closed vocabulary and multiple choice, fail. Past work on uncertainty for LLMs has focused on simpler settings, such as classifiers 16 , 17 and regressors 18 , 19 , whereas the most exciting applications of LLMs relate to free-form generations.

The term hallucination in the context of machine learning originally comes from filling in ungrounded details, either as a deliberate strategy 20 or as a reliability problem 4 . The appropriateness of the metaphor has been questioned as promoting undue anthropomorphism 21 . Although we agree that metaphor must be used carefully with LLMs 22 , the widespread adoption of the term hallucination reflects the fact that it points to an important phenomenon. This work represents a step towards making that phenomenon more precise.

To detect confabulations, we use probabilistic tools to define and then measure the ‘semantic’ entropy of the generations of an LLM—an entropy that is computed over meanings of sentences. High entropy corresponds to high uncertainty 23 , 24 , 25 —so semantic entropy is one way to estimate semantic uncertainties. Semantic uncertainty, the broader category of measures we introduce, could be operationalized with other measures of uncertainty, such as mutual information, instead. Entropy in free-form generation is normally hard to measure because answers might mean the same thing (be semantically equivalent) despite being expressed differently (being syntactically or lexically distinct). This causes naive estimates of entropy or other lexical variation scores 26 to be misleadingly high when the same correct answer might be written in many ways without changing its meaning.

By contrast, our semantic entropy moves towards estimating the entropy of the distribution of meanings of free-form answers to questions, insofar as that is possible, rather than the distribution over the ‘tokens’ (words or word-pieces) which LLMs natively represent. This can be seen as a kind of semantic consistency check 27 for random seed variation. An overview of our approach is provided in Fig. 1 and a worked example in Supplementary Table 1 .

figure 1

a , Naive entropy-based uncertainty measures variation in the exact answers, treating ‘Paris’, ‘It’s Paris’ and ‘France’s capital Paris’ as different. But this is unsuitable for language tasks for which sometimes different answers mean the same things. Our semantic entropy clusters answers which share meanings before computing the entropy. A low semantic entropy shows that the LLM is confident about the meaning. b , Semantic entropy can also detect confabulations in longer passages. We automatically decompose a long generated answer into factoids. For each factoid, an LLM generates questions to which that factoid might have been the answer. The original LLM then samples  M possible answers to these questions. Finally, we compute the semantic entropy over the answers to each specific question, including the original factoid. Confabulations are indicated by high average semantic entropy for questions associated with that factoid. Here, semantic entropy classifies Fact 1 as probably not a confabulation because generations often mean the same thing, despite very different wordings, which a naive entropy would have missed.

Intuitively, our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings, which we determine on the basis of whether answers in the same cluster entail each other bidirectionally 28 . That is, if sentence A entails that sentence B is true and vice versa, then we consider them to be in the same semantic cluster. We measure entailment using both general-purpose LLMs and natural language inference (NLI) tools developed specifically for detecting entailment for which we show direct evaluations in Supplementary Tables 2 and 3 and Supplementary Fig. 1 . Textual entailment has previously been shown to correlate with faithfulness 10 in the context of factual consistency 29 as well as being used to measure factuality in abstractive summarization 30 , especially when applied at the right granularity 31 .

Semantic entropy detects confabulations in free-form text generation across a range of language models and domains, without previous domain knowledge. Our evaluations cover question answering in trivia knowledge (TriviaQA 32 ), general knowledge (SQuAD 1.1; ref. 33 ), life sciences (BioASQ 34 ) and open-domain natural questions (NQ-Open 35 ) derived from actual queries to Google Search 36 . In addition, semantic entropy detects confabulations in mathematical word problems (SVAMP 37 ) and in a biography-generation dataset, FactualBio, accompanying this paper.

Our results for TriviaQA, SQuAD, BioASQ, NQ-Open and SVAMP are all evaluated context-free and involve sentence-length answers (96 ± 70 characters, mean ± s.d.) and use LLaMA 2 Chat (7B, 13B and 70B parameters) 38 , Falcon Instruct (7B and 40B) 39 and Mistral Instruct (7B) 40 . In the Supplementary Information , we further consider short-phrase-length answers. Results for FactualBio (442 ± 122 characters) use GPT-4 (ref. 1 ). At the time of writing, GPT-4 (ref. 1 ) did not expose output probabilities 41 or hidden states, although it does now. As a result, we propose a discrete approximation of our estimator for semantic entropy which allows us to run experiments without access to output probabilities, which we use for all GPT-4 results in this paper and which performs similarly well.

Our confabulation detection with semantic entropy is more robust to user inputs from previously unseen domains than methods which aim to ‘learn’ how to detect confabulations from a set of example demonstrations. Our method is unsupervised, meaning that we do not need labelled examples of confabulations. By contrast, supervised methods detect confabulations by learning patterns behind examples of confabulations, assuming that future questions preserve these patterns. But this assumption is often untrue in new situations or with confabulations that human overseers are unable to identify (compare Fig. 17 of ref. 24 ). As a strong supervised baseline, we compare to an embedding regression method inspired by ref. 24 which trains a logistic regression classifier to predict whether the model correctly answered a question on the basis of the final ‘embedding’ (hidden state) of the LLM. We also use the P (True) method 24 which looks at the probability with which an LLM predicts that the next token is ‘True’ when few-shot prompted to compare a main answer with ‘brainstormed’ alternatives.

Confabulations contribute substantially to incorrect answers given by language models. We show that semantic entropy can be used to predict many incorrect model answers and to improve question-answering accuracy by refusing to answer those questions the model is uncertain about. Corresponding to these two uses, we evaluate two main metrics. First, the widely used area under the receiver operating characteristic (AUROC) curve for the binary event that a given answer is incorrect. This measure captures both precision and recall and ranges from 0 to 1, with 1 representing a perfect classifier and 0.5 representing an un-informative classifier. We also show a new measure, the area under the ‘rejection accuracy’ curve (AURAC). This studies the case in which the confabulation detection score is used to refuse to answer the questions judged most likely to cause confabulations. Rejection accuracy is the accuracy of the answers of the model on the remaining questions and the area under this curve is a summary statistic over many thresholds (representative threshold accuracies are provided in Supplementary Material ). The AURAC captures the accuracy improvement which users would experience if semantic entropy was used to filter out questions causing the highest entropy.

Detecting confabulations in QA and math

In Fig. 2 , we show that both semantic entropy and its discrete approximation outperform our best baselines for sentence-length generations. These results are averaged across datasets and provide the actual scores on the held-out evaluation dataset. We report the raw average score across held-out evaluation datasets without standard error because the distributional characteristics are more a property of the models and datasets selected than the method. Consistency of relative results across different datasets is a stronger indicator of variation in this case.

figure 2

Semantic entropy outperforms leading baselines and naive entropy. AUROC (scored on the y -axes) measures how well methods predict LLM mistakes, which correlate with confabulations. AURAC (likewise scored on the y -axes) measures the performance improvement of a system that refuses to answer questions which are judged likely to cause confabulations. Results are an average over five datasets, with individual metrics provided in the Supplementary Information .

Semantic entropy greatly outperforms the naive estimation of uncertainty using entropy: computing the entropy of the length-normalized joint probability of the token sequences. Naive entropy estimation ignores the fact that token probabilities also express the uncertainty of the model over phrasings that do not change the meaning of an output.

Our methods also outperform the supervised embedding regression method both in- and out-of-distribution. In pale-yellow bars we show that embedding regression performance deteriorates when its training data do not match the deployment distribution—which mirrors the common real-world case in which there is a distribution shift between training and deployment 42 —the plotted value is the average metric for embedding regression trained on one of the four ‘off-distribution’ datasets for that evaluation. This is critical because reliable uncertainty is most important when the data distribution shifts. Semantic entropy also outperforms P (True) which is supervised ‘in-context’; that is, it is adapted to the deployment task with a few training examples provided in the LLM prompt itself. The discrete variant of semantic entropy performs similarly to our standard estimator, despite not requiring exact output probabilities.

Averaged across the 30 combinations of tasks and models we study, semantic entropy achieves the best AUROC value of 0.790 whereas naive entropy (0.691), P (True) (0.698) and the embedding regression baseline (0.687) lag behind it. Semantic entropy performs well consistently, with stable performance (between 0.78 and 0.81 AUROC) across the different model families (LLaMA, Falcon and Mistral) and scales (from 7B to 70B parameters) which we study (we report summary statistics for each dataset and model as before). Although semantic entropy outperforms the baselines across all model sizes, P (True) seems to improve with model size, suggesting that it might become more competitive for very capable honest models in settings that the model understands well (which are, however, not the most important cases to have good uncertainty). We use ten generations to compute entropy, selected using analysis in Supplementary Fig. 2 . Further results for short-phrase generations are described in Supplementary Figs. 7 – 10 .

The results in Fig. 2 offer a lower bound on the effectiveness of semantic entropy at detecting confabulations. These evaluations determine whether semantic entropy and baseline methods can detect when the answers of the model are incorrect (which we validate against human correctness evaluations in Supplementary Table 4 ). In addition to errors from confabulations (arbitrary incorrectness), this also includes other types of mistakes for which semantic entropy is not suited, such as consistent errors learned from the training data. The fact that methods such as embedding regression are able to spot other kinds of errors, not just confabulations, but still are outperformed by semantic entropy, suggests that confabulations are a principal category of errors for actual generations.

Examples of questions and answers from TriviaQA, SQuAD and BioASQ, for LLaMA 2 Chat 70B, are shown in Table 1 . These illustrate how only semantic entropy detects when the meaning is constant but the form varies (the first row of the table) whereas semantic entropy and naive entropy both correctly predict the presence of confabulations when the form and meaning vary together (second row) and predict the absence of confabulations when the form and meaning are both constant across several resampled generations (third row). In the final row, we give an example in which semantic entropy is erroneously high as a result of overly sensitive semantic clustering relative to the reference answer. Our clustering method distinguishes the answers which provide a precise date from those which only provide a year. For some contexts that would have been correct but in this context the distinction between the specific day and the year is probably irrelevant. This highlights the importance of context and judgement in clustering, especially in subtle cases, as well as the shortcomings of evaluating against fixed reference answers which do not capture the open-ended flexibility of conversational deployments of LLMs.

Detecting confabulations in biographies

Semantic entropy is most natural for sentences that express a single proposition but the idea of semantic equivalence is trickier to apply to longer passages which express many propositions which might only agree partially 43 . Nevertheless, we can use semantic entropy to detect confabulations in longer generations, such as entire paragraphs of text. To show this, we develop a dataset of biographical generations from GPT-4 (v.0613) for 21 individuals notable enough to have their own Wikipedia page but without extensive online biographies. From each biography generated by GPT-4, we automatically extract propositional factual claims about the individual (150 factual claims in total), which we manually label as true or false.

Applying semantic entropy to this problem is challenging. Naively, one might simply regenerate each sentence (conditioned on the text so far) and then compute semantic entropy over these regenerations. However, the resampled sentences often target different aspects of the biography: for example, one time describing family and the next time profession. This is analogous to the original problem semantic entropy was designed to resolve: the model is uncertain about the right ordering of facts, not about the facts themselves. To address this, we break down the entire paragraph into factual claims and reconstruct questions which might have been answered by those claims. Only then do we apply semantic entropy (Fig. 1 ) by generating three new answers to each question (selected with analysis in Supplementary Figs. 3 and 4 ) and computing the semantic entropy over those generations plus the original factual claim. We aggregate these by averaging the semantic entropy over all the questions to get an uncertainty score for each proposition, which we use to detect confabulations. Unaggregated results are shown in Supplementary Figs. 5 and 6 .

As GPT-4 did not allow access to the probability of the generation at the time of writing, we use a discrete variant of semantic entropy which makes the further approximation that we can infer a discrete empirical distribution over semantic meaning clusters from only the generations ( Methods ). This allows us to compute semantic entropy using only the black-box outputs of an LLM. However, we were unable to compute the naive entropy baseline, the standard semantic entropy estimator or the embedding regression baseline for GPT-4 without output probabilities and embeddings.

In Fig. 3 we show that the discrete variant of semantic entropy effectively detects confabulations on this dataset. Its AUROC and AURAC are higher than either a simple ‘self-check’ baseline—which just asks the LLM whether the factoid is likely to be true—or a variant of P (True) which has been adapted to work for the paragraph-length setting. Discrete semantic entropy has better rejection accuracy performance until 20% of the questions have been rejected at which point P (True) has a narrow edge. This indicates that the questions predicted to cause confabulations are indeed more likely to be wrong.

figure 3

The discrete variant of our semantic entropy estimator outperforms baselines both when measured by AUROC and AURAC metrics (scored on the y -axis). The AUROC and AURAC are substantially higher than for both baselines. At above 80% of questions being answered, semantic entropy has the highest accuracy. Only when the top 20% of answers judged most likely to be confabulations are rejected does the answer accuracy on the remainder for the P (True) baseline exceed semantic entropy.

Our probabilistic approach, accounting for semantic equivalence, detects an important class of hallucinations: those that are caused by a lack of LLM knowledge. These are a substantial portion of the failures at present and will continue even as models grow in capabilities because situations and cases that humans cannot reliably supervise will persist. Confabulations are a particularly noteworthy failure mode for question answering but appear in other domains too. Semantic entropy needs no previous domain knowledge and we expect that algorithmic adaptations to other problems will allow similar advances in, for example, abstractive summarization. In addition, extensions to alternative input variations such as rephrasing or counterfactual scenarios would allow a similar method to act as a form of cross-examination 44 for scalable oversight through debate 45 .

The success of semantic entropy at detecting errors suggests that LLMs are even better at “knowing what they don’t know” than was argued by ref. 24 —they just don’t know they know what they don’t know. Our method explicitly does not directly address situations in which LLMs are confidently wrong because they have been trained with objectives that systematically produce dangerous behaviour, cause systematic reasoning errors or are systematically misleading the user. We believe that these represent different underlying mechanisms—despite similar ‘symptoms’—and need to be handled separately.

One exciting aspect of our approach is the way it makes use of classical probabilistic machine learning methods and adapts them to the unique properties of modern LLMs and free-form language generation. We hope to inspire a fruitful exchange of well-studied methods and emerging new problems by highlighting the importance of meaning when addressing language-based machine learning problems.

Semantic entropy as a strategy for overcoming confabulation builds on probabilistic tools for uncertainty estimation. It can be applied directly to any LLM or similar foundation model without requiring any modifications to the architecture. Our ‘discrete’ variant of semantic uncertainty can be applied even when the predicted probabilities for the generations are not available, for example, because access to the internals of the model is limited.

In this section we introduce background on probabilistic methods and uncertainty in machine learning, discuss how it applies to language models and then discuss our contribution, semantic entropy, in detail.

Uncertainty and machine learning

We aim to detect confabulations in LLMs, using the principle that the model will be uncertain about generations for which its output is going to be arbitrary.

One measure of uncertainty is the predictive entropy of the output distribution, which measures the information one has about the output given the input 25 . The predictive entropy (PE) for an input sentence x is the conditional entropy ( H ) of the output random variable Y with realization y given x ,

A low predictive entropy indicates an output distribution which is heavily concentrated whereas a high predictive entropy indicates that many possible outputs are similarly likely.

Aleatoric and epistemic uncertainty

We do not distinguish between aleatoric and epistemic uncertainty in our analysis. Researchers sometimes separate aleatoric uncertainty (uncertainty in the underlying data distribution) from epistemic uncertainty (caused by having only limited information) 46 . Further advances in uncertainty estimation which separate these kinds of uncertainty would enhance the potential for our semantic uncertainty approach by allowing extensions beyond entropy.

Joint probabilities of sequences of tokens

Generative LLMs produce strings of text by selecting tokens in sequence. Each token is a wordpiece that often represents three or four characters (though especially common sequences and important words such as numbers typically get their own token). To compute entropies, we need access to the probabilities the LLM assigns to the generated sequence of tokens. The probability of the entire sequence, s , conditioned on the context, x , is the product of the conditional probabilities of new tokens given past tokens, whose resulting log-probability is \(\log P({\bf{s}}| {\boldsymbol{x}})={\sum }_{i}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , where s i is the i th output token and s < i denotes the set of previous tokens.

Length normalization

When comparing the log-probabilities of generated sequences, we use ‘length normalization’, that is, we use an arithmetic mean log-probability, \(\frac{1}{N}{\sum }_{i}^{N}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , instead of the sum. In expectation, longer sequences have lower joint likelihoods because of the conditional independence of the token probabilities 47 . The joint likelihood of a sequence of length N shrinks exponentially in N . Its negative log-probability therefore grows linearly in N , so longer sentences tend to contribute more to entropy. We therefore interpret length-normalizing the log-probabilities when estimating the entropy as asserting that the expected uncertainty of generations is independent of sentence length. Length normalization has some empirical success 48 , including in our own preliminary experiments, but little theoretical justification in the literature.

Principles of semantic uncertainty

If we naively calculate the predictive entropy directly from the probabilities of the generated sequence of tokens, we conflate the uncertainty of the model over the meaning of its answer with the uncertainty over the exact tokens used to express that meaning. For example, even if the model is confident in the meaning of a generation, there are still usually many different ways for phrasing that generation without changing its meaning. For the purposes of detecting confabulations, the uncertainty of the LLM over meanings is more important than the uncertainty over the exact tokens used to express those meanings.

Our semantic uncertainty method therefore seeks to estimate only the uncertainty the LLM has over the meaning of its generation, not the choice of words. To do this, we introduce an algorithm that clusters model generations by meaning and subsequently calculates semantic uncertainty. At a high level this involves three steps:

Generation: sample output sequences of tokens from the predictive distribution of a LLM given a context x .

Clustering: cluster sequences by their meaning using our clustering algorithm based on bidirectional entailment.

Entropy estimation: estimate semantic entropy by summing probabilities of sequences that share a meaning following equation ( 2 ) and compute their entropy.

Generating a set of answers from the model

Given some context x as input to the LLM, we sample M sequences, { s (1) , …,  s ( M ) } and record their token probabilities, { P ( s (1) ∣ x ), …,  P ( s ( M ) ∣ x )}. We sample all our generations from a single model, varying only the random seed used for sampling from the token probabilities. We do not observe the method to be particularly sensitive to details of the sampling scheme. In our implementation, we sample at temperature 1 using nucleus sampling ( P  = 0.9) (ref. 49 ) and top- K sampling ( K  = 50) (ref. 50 ). We also sample a single generation at low temperature (0.1) as an estimate of the ‘best generation’ of the model to the context, which we use to assess the accuracy of the model. (A lower sampling temperature increases the probability of sampling the most likely tokens).

Clustering by semantic equivalence

To estimate semantic entropy we need to cluster generated outputs from the model into groups of outputs that mean the same thing as each other.

This can be described using ‘semantic equivalence’ which is the relation that holds between two sentences when they mean the same thing. We can formalize semantic equivalence mathematically. Let the space of tokens in a language be \({\mathcal{T}}\) . The space of all possible sequences of tokens of length N is then \({{\mathcal{S}}}_{N}\equiv {{\mathcal{T}}}^{N}\) . Note that N can be made arbitrarily large to accommodate whatever size of sentence one can imagine and one of the tokens can be a ‘padding’ token which occurs with certainty for each token after the end-of-sequence token. For some sentence \({\bf{s}}\in {{\mathcal{S}}}_{N}\) , composed of a sequence of tokens, \({s}_{i}\in {\mathcal{T}}\) , there is an associated meaning. Theories of meaning are contested 51 . However, for specific models and deployment contexts many considerations can be set aside. Care should be taken comparing very different models and contexts.

Let us introduce a semantic equivalence relation, E (  ⋅  ,  ⋅  ), which holds for any two sentences that mean the same thing—we will operationalize this presently. Recall that an equivalence relation is any reflexive, symmetric and transitive relation and that any equivalence relation on a set corresponds to a set of equivalence classes. Each semantic equivalence class captures outputs that can be considered to express the same meaning. That is, for the space of semantic equivalence classes \({\mathcal{C}}\) the sentences in the set \(c\in {\mathcal{C}}\) can be regarded in many settings as expressing a similar meaning such that \(\forall {\bf{s}},{{\bf{s}}}^{{\prime} }\in c:E({\bf{s}},{{\bf{s}}}^{{\prime} })\) . So we can build up these classes of semantically equivalent sentences by checking if new sentences share a meaning with any sentences we have already clustered and, if so, adding them into that class.

We operationalize E (  ⋅  ,  ⋅  ) using the idea of bidirectional entailment, which has a long history in linguistics 52 and natural language processing 28 , 53 , 54 . A sequence, s , means the same thing as a second sequence, s ′, only if the sequences entail (that is, logically imply) each other. For example, ‘The capital of France is Paris’ entails ‘Paris is the capital of France’ and vice versa because they mean the same thing. (See later for a discussion of soft equivalence and cases in which bidirectional entailment does not guarantee equivalent meanings).

Importantly, we require that the sequences mean the same thing with respect to the context—key meaning is sometimes contained in the context. For example, ‘Paris’ does not entail ‘The capital of France is Paris’ because ‘Paris’ is not a declarative sentence without context. But in the context of the question ‘What is the capital of France?’, the one-word answer does entail the longer answer.

Detecting entailment has been the object of study of a great deal of research in NLI 55 . We rely on language models to predict entailment, such as DeBERTa-Large-MNLI 56 , which has been trained to predict entailment, or general-purpose LLMs such as GPT-3.5 (ref. 57 ), which can predict entailment given suitable prompts.

We then cluster sentences according to whether they bidirectionally entail each other using the algorithm presented in Extended Data Fig. 1 . Note that, to check if a sequence should be added to an existing cluster, it is sufficient to check if the sequence bidirectionally entails any of the existing sequences in that cluster (we arbitrarily pick the first one), given the transitivity of semantic equivalence. If a sequence does not share meaning with any existing cluster, we assign it its own cluster.

Computing the semantic entropy

Having determined the classes of generated sequences that mean the same thing, we can estimate the likelihood that a sequence generated by the LLM belongs to a given class by computing the sum of the probabilities of all the possible sequences of tokens which can be considered to express the same meaning as

Formally, this treats the output as a random variable whose event-space is the space of all possible meaning-classes, C , a sub- σ -algebra of the standard event-space S . We can then estimate the semantic entropy (SE) as the entropy over the meaning-distribution,

There is a complication which prevents direct computation: we do not have access to every possible meaning-class c . Instead, we can only sample c from the sequence-generating distribution induced by the model. To handle this, we estimate the expectation in equation ( 3 ) using a Rao–Blackwellized Monte Carlo integration over the semantic equivalence classes C ,

where \(P({C}_{i}| {\boldsymbol{x}})=\frac{P({c}_{i}| {\boldsymbol{x}})}{{\sum }_{c}P(c| {\boldsymbol{x}})}\) estimates a categorical distribution over the cluster meanings, that is, ∑ i P ( C i ∣ x ) = 1. Without this normalization step cluster ‘probabilities’ could exceed one because of length normalization, resulting in degeneracies. Equation ( 5 ) is the estimator giving our main method that we refer to as semantic entropy throughout the text.

For scenarios in which the sequence probabilities are not available, we propose a variant of semantic entropy which we call ‘discrete’ semantic entropy. Discrete semantic entropy approximates P ( C i ∣ x ) directly from the number of generations in each cluster, disregarding the token probabilities. That is, we approximate P ( C i ∣ x ) as \({\sum }_{1}^{M}\frac{{I}_{c={C}_{i}}}{M}\) , the proportion of all the sampled answers which belong to that cluster. Effectively, this just assumes that each output that was actually generated was equally probable—estimating the underlying distribution as the categorical empirical distribution. In the limit of M the estimator converges to equation ( 5 ) by the law of large numbers. We find that discrete semantic entropy results in similar performance empirically.

We provide a worked example of the computation of semantic entropy in Supplementary Note  1 .

Semantic entropy is designed to detect confabulations, that is, model outputs with arbitrary meaning. In our experiments, we use semantic uncertainty to predict model accuracy, demonstrating that confabulations make up a notable fraction of model mistakes. We further show that semantic uncertainty can be used to improve model accuracy by refusing to answer questions when semantic uncertainty is high. Last, semantic uncertainty can be used to give users a way to know when model generations are probably unreliable.

We use the datasets BioASQ 34 , SQuAD 33 , TriviaQA 32 , SVAMP 37 and NQ-Open 35 . BioASQ is a life-sciences question-answering dataset based on the annual challenge of the same name. The specific dataset we use is based on the QA dataset from Task B of the 2023 BioASQ challenge (11B). SQuAD is a reading comprehension dataset whose context passages are drawn from Wikipedia and for which the answers to questions can be found in these passages. We use SQuAD 1.1 which excludes the unanswerable questions added in v.2.0 that are deliberately constructed to induce mistakes so they do not in practice cause confabulations to occur. TriviaQA is a trivia question-answering dataset. SVAMP is a word-problem maths dataset containing elementary-school mathematical reasoning tasks. NQ-Open is a dataset of realistic questions aggregated from Google Search which have been chosen to be answerable without reference to a source text. For each dataset, we use 400 train examples and 400 test examples randomly sampled from the original larger dataset. Note that only some of the methods require training, for example semantic entropy does not use the training data. If the datasets themselves are already split into train and test (or validation) samples, we sample our examples from within the corresponding split.

All these datasets are free-form, rather than multiple choice, because this better captures the opportunities created by LLMs to produce free-form sentences as answers. We refer to this default scenario as our ‘sentence-length’ experiments. In Supplementary Note  7 , we also present results for confabulation detection in a ‘short-phrase’ scenario, in which we constrain model answers on these datasets to be as concise as possible.

To make the problems more difficult and induce confabulations, we do not provide the context passages for any of the datasets. When the context passages are provided, the accuracy rate is too high for these datasets for the latest generations of models to meaningfully study confabulations.

For sentence-length generations we use: Falcon 39 Instruct (7B and 40B), LLaMA 2 Chat 38 (7B, 13B and 70B) and Mistral 40 Instruct (7B).

In addition to reporting results for semantic entropy, discrete semantic entropy and naive entropy, we consider two strong baselines.

Embedding regression is a supervised baseline inspired by the P (IK) method 24 . In that paper, the authors fine-tune their proprietary LLM on a dataset of questions to predict whether the model would have been correct. This requires access to a dataset of ground-truth answers to the questions. Rather than fine-tuning the entire LLM in this way, we simply take the final hidden units and train a logistic regression classifier to make the same prediction. By contrast to their method, this is much simpler because it does not require fine-tuning the entire language model, as well as being more reproducible because the solution to the logistic regression optimization problem is not as seed-dependent as the fine-tuning procedure. As expected, this supervised approach performs well in-distribution but fails when the distribution of questions is different from that on which the classifier is trained.

The second baseline we consider is the P (True) method 24 , in which the model first samples M answers (identically to our semantic entropy approach) and then is prompted with the list of all answers generated followed by the highest probability answer and a question whether this answer is “(a) True” or “(b) False”. The confidence score is then taken to be the probability with which the LLM responds with ‘a’ to the multiple-choice question. The performance of this method is boosted with a few-shot prompt, in which up to 20 examples from the training set are randomly chosen, filled in as above, but then provided with the actual ground truth of whether the proposed answer was true or false. In this way, the method can be considered as supervised ‘in-context’ because it makes use of some ground-truth training labels but can be used without retraining the model. Because of context-size constraints, this method cannot fit a full 20 few-shot examples in the context when input questions are long or large numbers of generations are used. As a result, we sometimes have to reduce the number of few-shot examples to suit the context size and we note this in the  Supplementary Material .

Entailment estimator

Any NLI classification system could be used for our bidirectional entailment clustering algorithm. We consider two different kinds of entailment detector.

One option is to use an instruction-tuned LLM such as LLaMA 2, GPT-3.5 (Turbo 1106) or GPT-4 to predict entailment between generations. We use the following prompt:

We are evaluating answers to the question {question} Here are two possible answers: Possible Answer 1: {text1} Possible Answer 2: {text2} Does Possible Answer 1 semantically entail Possible Answer 2? Respond with entailment, contradiction, or neutral.

Alternatively, we consider using a language model trained for entailment prediction, specifically the DeBERTa-large model 56 fine-tuned on the NLI dataset MNLI 58 . This builds on past work towards paraphrase identification based on embedding similarity 59 , 60 and BERT-style models 61 , 62 . We template more simply, checking if DeBERTa predicts entailment between the concatenation of the question and one answer and the concatenation of the question and another answer. Note that DeBERTa-large is a relatively lightweight model with only 1.5B parameters which is much less powerful than most of the LLMs under study.

In Supplementary Note 2 , we carefully evaluate the benefits and drawbacks of these methods for entailment prediction. We settle on using GPT-3.5 with the above prompt, as its entailment predictions agree well with human raters and lead to good confabulation detection performance.

In Supplementary Note  3 , we provide a discussion of the computational cost and choosing the number of generations for reliable clustering.

Prompting templates

We use a simple generation template for all sentence-length answer datasets:

Answer the following question in a single brief but complete sentence. Question: {question} Answer:

Metrics and accuracy measurements

We use three main metrics to evaluate our method: AUROC, rejection accuracy and AURAC. Each of these is grounded in an automated factuality estimation measurement relative to the reference answers provided by the datasets that we use.

AUROC, rejection accuracy and AURAC

First, we use the AUROC curve, which measures the reliability of a classifier accounting for both precision and recall. The AUROC can be interpreted as the probability that a randomly chosen correct answer has been assigned a higher confidence score than a randomly chosen incorrect answer. For a perfect classifier, this is 1.

Second, we compute the ‘rejection accuracy at X %’, which is the question-answering accuracy of the model on the most-confident X % of the inputs as identified by the respective uncertainty method. If an uncertainty method works well, predictions on the confident subset should be more accurate than predictions on the excluded subset and the rejection accuracy should increase as we reject more inputs.

To summarize this statistic we compute the AURAC—the total area enclosed by the accuracies at all cut-off percentages X %. This should increase towards 1 as given uncertainty method becomes more accurate and better at detecting likely-inaccurate responses but it is more sensitive to the overall accuracy of the model than the AUROC metric.

In Supplementary Note  5 , we provide the unaggregated rejection accuracies for sentence-length generations.

Assessing accuracy

For the short-phrase-length generation setting presented in Supplementary Note  7 , we simply assess the accuracy of the generations by checking if the F1 score of the commonly used SQuAD metric exceeds 0.5. There are limitations to such simple scoring rules 63 but this method is widely used in practice and its error is comparatively small on these standard datasets.

For our default scenario, the longer sentence-length generations, this measure fails, as the overlap between the short reference answer and our long model answer is invariably too small. For sentence-length generations, we therefore automatically determine whether an answer to the question is correct or incorrect by using GPT-4 to compare the given answer to the reference answer. We use the template:

We are assessing the quality of answers to the following question: {question} The expected answer is: {reference answer} The proposed answer is: {predicted answer} Within the context of the question, does the proposed answer mean the same as the expected answer? Respond only with yes or no.

We make a small modification for datasets with several reference answers: line two becomes “The following are expected answers to this question:” and the final line asks “does the proposed answer mean the same as any of the expected answers?”.

In Supplementary Note 6 , we check the quality of our automated ground-truth evaluations against human judgement by hand. We find that GPT-4 gives the best results for determining model accuracy and thus use it in all our sentence-length experiments.

In this section we describe the application of semantic entropy to confabulation detection in longer model generations, specifically paragraph-length biographies.

We introduce a biography-generation dataset—FactualBio—available alongside this paper. FactualBio is a collection of biographies of individuals who are notable enough to have Wikipedia pages but not notable enough to have large amounts of detailed coverage, generated by GPT-4 (v.0613). To generate the dataset, we randomly sampled 21 individuals from the WikiBio dataset 64 . For each biography, we generated a list of factual claims contained in each biography using GPT-4, with 150 total factual claims (the total number is only coincidentally a round number). For each of these factual claims, we manually determined whether the claim was correct or incorrect. Out of 150 claims, 45 were incorrect. As before, we apply confabulation detection to detect incorrect model predictions, even though there may be model errors which are not confabulations.

Prompting and generation

Given a paragraph-length piece of LLM-generated text, we apply the following sequence of steps:

Automatically decompose the paragraph into specific factual claims using an LLM (not necessarily the same as the original).

For each factual claim, use an LLM to automatically construct Q questions which might have produced that claim.

For each question, prompt the original LLM to generate M answers.

For each question, compute the semantic entropy of the answers, including the original factual claim.

Average the semantic entropies over the questions to arrive at a score for the original factual claim.

We pursue this slightly indirect way of generating answers because we find that simply resampling each sentence creates variation unrelated to the uncertainty of the model about the factual claim, such as differences in paragraph structure.

We decompose the paragraph into factual claims using the following prompt:

Please list the specific factual propositions included in the answer above. Be complete and do not leave any factual claims out. Provide each claim as a separate sentence in a separate bullet point.

We found that we agreed with the decompositions in all cases in the dataset.

We then generate six questions for each of the facts from the decomposition. We generate these questions by prompting the model twice with the following:

Following this text: {text so far} You see the sentence: {proposition} Generate a list of three questions, that might have generated the sentence in the context of the preceding original text, as well as their answers. Please do not use specific facts that appear in the follow-up sentence when formulating the question. Make the questions and answers diverse. Avoid yes-no questions. The answers should not be a full sentence and as short as possible, e.g. only a name, place, or thing. Use the format “1. {question} – {answer}”.

These questions are not necessarily well-targeted and the difficulty of this step is the main source of errors in the procedure. We generate three questions with each prompt, as this encourages diversity of the questions, each question targeting a different aspect of the fact. However, we observed that the generated questions will sometimes miss obvious aspects of the fact. Executing the above prompt twice (for a total of six questions) can improve coverage. We also ask for brief answers because the current version of GPT-4 tends to give long, convoluted and highly hedged answers unless explicitly told not to.

Then, for each question, we generate three new answers using the following prompt:

We are writing an answer to the question “{user question}”. So far we have written: {text so far} The next sentence should be the answer to the following question: {question} Please answer this question. Do not answer in a full sentence. Answer with as few words as possible, e.g. only a name, place, or thing.

We then compute the semantic entropy over these answers plus the original factual claim. Including the original fact ensures that the estimator remains grounded in the original claim and helps detect situations in which the question has been interpreted completely differently from the original context. We make a small modification to handle the fact that GPT-4 generations often include refusals to answer questions. These refusals were not something we commonly observe in our experiments with LLaMA 2, Falcon or Mistral models. If more than half of the answers include one of the strings ‘not available’, ‘not provided’, ‘unknown’ or ‘unclear’ then we treat the semantic uncertainty as maximal.

We then average the semantic entropies for each question corresponding to the factual claim to get an entropy for this factual claim.

Despite the extra assumptions and complexity, we find that this method greatly outperforms the baselines.

To compute semantic entailment between the original claim and regenerated answers, we rely on the DeBERTa entailment prediction model as we find empirically that DeBERTa predictions result in higher train-set AUROC than other methods. Because DeBERTa has slightly lower recall than GPT-3.5/4, we use a modified set-up for which we say the answers mean the same as each other if at least one of them entails the other and neither is seen to contradict the other—a kind of ‘non-defeating’ bidirectional entailment check rather than true bidirectional entailment. The good performance of DeBERTa in this scenario is not surprising as both factual claims and regenerated answers are relatively short. We refer to Supplementary Notes 2 and 3 for ablations and experiments regarding our choice of entailment estimator for paragraph-length generations.

We implement two baselines. First, we implement a variant of the P (True) method, which is adapted to the new setting. For each factoid, we generate a question with answers in the same way as for semantic entropy. We then use the following prompt:

Question: {question} Here are some brainstormed ideas: {list of regenerated answers} Possible answer: {original answer} Is the possible answer true? Respond with “yes” or “no”.

As we cannot access the probabilities GPT-4 assigns to predicting ‘yes’ and ‘no’ as the next token, we approximate this using Monte Carlo samples. Concretely, we execute the above prompt ten times (at temperature 1) and then take the fraction of answers which was ‘yes’ as our unbiased Monte Carlo estimate of the token probability GPT-4 assigns to ‘yes’.

As a second, simpler, baseline we check if the model thinks the answer is true. We simply ask:

Following this text: {text so far} You see this statement: {proposition} Is it likely that the statement is true? Respond with ‘yes’ or ‘no’.

It is interesting that this method ought to perform very well if we think that the model has good ‘self-knowledge’ (that is, if “models mostly know what they don’t know” 24 ) but in fact semantic entropy is much better at detecting confabulations.

Data availability

The data used for the short-phrase and sentence-length generations are publicly available and the released code details how to access it. We release a public version of the FactualBio dataset as part of the code base for reproducing the paragraph-length experiments.

Code availability

We release all code used to produce the main experiments. The code for short-phrase and sentence-length experiments can be found at github.com/jlko/semantic_uncertainty and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ). The code for paragraph-length experiments can be found at github.com/jlko/long_hallucinations and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ).

GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).

Xiao, Y. & Wang, W. Y. On hallucination and predictive uncertainty in conditional language generation. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics 2734–2744 (Association for Computational Linguistics, 2021).

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T. & Saenko, K. Object hallucination in image captioning. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E., Chiang, D., Hockenmaier, J. & Tsujii, J.) 4035–4045 (Association for Computational Linguistics, 2018).

Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times (8 Jun 2023).

Opdahl, A. L. et al. Trustworthy journalism through AI. Data Knowl. Eng . 146 , 102182 (2023).

Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307 , e230163 (2023).

Article   PubMed   Google Scholar  

Schulman, J. Reinforcement learning from human feedback: progress and challenges. Presented at the Berkeley EECS Colloquium. YouTube www.youtube.com/watch?v=hhiLw5Q_UFg (2023).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55 , 248 (2023).

Maynez, J., Narayan, S., Bohnet, B. & McDonald, R. On faithfulness and factuality in abstractive summarization. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 1906–1919 (Association for Computational Linguistics, 2020).

Filippova, K. Controlled hallucinations: learning to generate faithfully from noisy data. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 864–870 (Association for Computational Linguistics, 2020).

Berrios, G. Confabulations: a conceptual history. J. Hist. Neurosci. 7 , 225–241 (1998).

Article   CAS   PubMed   Google Scholar  

Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Transact. Mach. Learn. Res. (2022).

Evans, O. et al. Truthful AI: developing and governing AI that does not lie. Preprint at https://arxiv.org/abs/2110.06674 (2021).

Amodei, D. et al. Concrete problems in AI safety. Preprint at https://arxiv.org/abs/1606.06565 (2016).

Jiang, Z., Araki, J., Ding, H. & Neubig, G. How can we know when language models know? On the calibration of language models for question answering. Transact. Assoc. Comput. Linguist. 9 , 962–977 (2021).

Article   Google Scholar  

Desai, S. & Durrett, G. Calibration of pre-trained transformers. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 295–302 (Association for Computational Linguistics, 2020).

Glushkova, T., Zerva, C., Rei, R. & Martins, A. F. Uncertainty-aware machine translation evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (eds Moens, M-F., Huang, X., Specia, L. & Yih, S.) 3920–3938 (Association for Computational Linguistics, 2021).

Wang, Y., Beck, D., Baldwin, T. & Verspoor, K. Uncertainty estimation and reduction of pre-trained models for text regression. Transact. Assoc. Comput. Linguist. 10 , 680–696 (2022).

Baker, S. & Kanade, T. Hallucinating faces. In Proc. Fourth IEEE International Conference on Automatic Face and Gesture Recognition . 83–88 (IEEE, Catalogue no PR00580, 2002).

Eliot, L. AI ethics lucidly questioning this whole hallucinating AI popularized trend that has got to stop. Forbes Magazine (24 August 2022).

Shanahan, M. Talking about large language models. Commun. Assoc. Comp. Machinery 67 , 68–79 (2024).

MacKay, D. J. C. Information-based objective functions for active data selection. Neural Comput. 4 , 590–604 (1992).

Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://arxiv.org/abs/2207.05221 (2022).

Lindley, D. V. On a measure of the information provided by an experiment. Ann. Math. Stat. 27 , 986–1005 (1956).

Article   MathSciNet   Google Scholar  

Xiao, T. Z., Gomez, A. N. & Gal, Y. Wat zei je? Detecting out-of-distribution translations with variational transformers. In Workshop on Bayesian Deep Learning at the Conference on Neural Information Processing Systems (NeurIPS, Vancouver, 2019).

Christiano, P., Cotra, A. & Xu, M. Eliciting Latent Knowledge (Alignment Research Center, 2021); https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit .

Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D. & Marchetti, A. Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora. In Proc. 2011 Conference on Empirical Methods in Natural Language Processing 670–679 (Association for Computational Linguistics, 2011).

Honovich, O. et al. TRUE: Re-evaluating factual consistency evaluation. In Proc. Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering 161–175 (Association for Computational Linguistics, 2022).

Falke, T., Ribeiro, L. F. R., Utama, P. A., Dagan, I. & Gurevych, I. Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2214–2220 (Association for Computational Linguistics, 2019).

Laban, P., Schnabel, T., Bennett, P. N. & Hearst, M. A. SummaC: re-visiting NLI-based models for inconsistency detection in summarization. Trans. Assoc. Comput. Linguist. 10 , 163–177 (2022).

Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proc. 55th Annual Meeting of the Association for Computational Linguistics 1601–1611 (Association for Computational Linguistics. 2017).

Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine compression of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J., Duh, K. & Carreras, X.) 2383–2392 (Association for Computational Linguistics, 2016).

Tsatsaronis, G. et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16 , 138 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Lee, K., Chang, M.-W. & Toutanova, K. Latent retrieval for weakly supervised open domain question answering. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 6086–6096 (Association for Computational Linguistics, 2019).

Kwiatkowski, T. et al. Natural questions: a benchmark for question answering research. Transact. Assoc. Comput. Linguist. 7 , 452–466 (2019).

Patel, A., Bhattamishra, S. & Goyal, N. Are NLP models really able to solve simple math word problems? In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 2080–2094 (Assoc. Comp. Linguistics, 2021).

Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

Penedo, G. et al. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. In Proc. 36th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 79155–79172 (Curran Associates, 2023)

Jiang, A. Q. et al. Mistral 7B. Preprint at https://arxiv.org/abs/2310.06825 (2023).

Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero-Resource Black-Box hallucination detection for generative large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 9004–9017 (Assoc. Comp. Linguistics, 2023).

Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H. & Gal, Y. Deep deterministic uncertainty: a new simple baseline. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 24384–24394 (Computer Vision Foundation, 2023).

Schuster, T., Chen, S., Buthpitiya, S., Fabrikant, A. & Metzler, D. Stretching sentence-pair NLI models to reason over long documents and clusters. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 394–412 (Association for Computational Linguistics, 2022).

Barnes, B. & Christiano, P. Progress on AI Safety via Debate. AI Alignment Forum www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1 (2020).

Irving, G., Christiano, P. & Amodei, D. AI safety via debate. Preprint at https://arxiv.org/abs/1805.00899 (2018).

Der Kiureghian, A. & Ditlevsen, O. Aleatory or epistemic? Does it matter? Struct. Saf. 31 , 105–112 (2009).

Malinin, A. & Gales, M. Uncertainty estimation in autoregressive structured prediction. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=jN5y-zb5Q7m (2021).

Murray, K. & Chiang, D. Correcting length bias in neural machine translation. In Proc. Third Conference on Machine Translation (eds Bojar, O. et al.) 212–223 (Assoc. Comp. Linguistics, 2018).

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=rygGQyrFvH (2020).

Fan, A., Lewis, M. & Dauphin, Y. Hierarchical neural story generation. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 889–898 (Association for Computational Linguistics, 2018).

Speaks, J. in The Stanford Encyclopedia of Philosophy (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford Univ., 2021).

Culicover, P. W. Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11 , 78–88 (1968).

Google Scholar  

Padó, S., Cer, D., Galley, M., Jurafsky, D. & Manning, C. D. Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach. Transl. 23 , 181–193 (2009).

Androutsopoulos, I. & Malakasiotis, P. A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38 , 135–187 (2010).

MacCartney, B. Natural Language Inference (Stanford Univ., 2009).

He, P., Liu, X., Gao, J. & Chen, W. Deberta: decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations https://openreview.net/forum?id=XPZIaotutsD (2021).

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Williams, A., Nangia, N. & Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Walker, M. et al.) 1112–1122 (Assoc. Comp. Linguistics, 2018).

Yu, L., Hermann, K. M., Blunsom, P. & Pulman, S. Deep learning for answer sentence selection. Preprint at https://arxiv.org/abs/1412.1632 (2014).

Socher, R., Huang, E., Pennin, J., Manning, C. D. & Ng, A. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th Conference on Neural Information Processing Systems (eds Shawe-Taylor, J. et al.) (2011)

He, R., Ravula, A., Kanagal, B. & Ainslie, J. Realformer: Transformer likes residual attention. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (eds Zhong, C., et al.) 929–943 (Assoc. Comp. Linguistics, 2021).

Tay, Y. et al. Charformer: fast character transformers via gradient-based subword tokenization. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=JtBRnrlOEFN (2022).

Kane, H., Kocyigit, Y., Abdalla, A., Ajanoh, P. & Coulibali, M. Towards neural similarity evaluators. In Workshop on Document Intelligence at the 32nd conference on Neural Information Processing (2019).

Lebret, R., Grangier, D. & Auli, M. Neural text generation from structured data with application to the biography domain. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J. et al.) 1203–1213 (Association for Computational Linguistics, 2016).

Kossen, J., jlko/semantic_uncertainty: Initial release v.1.0.0. Zenodo https://doi.org/10.5281/zenodo.10964366 (2024).

Download references

Acknowledgements

We thank G. Irving, K. Perlin, J. Richens, L. Rimell and M. Turpin for their comments or discussion related to this work. We thank K. Handa for his help with the human evaluation of our automated accuracy assessment. We thank F. Bickford Smith and L. Melo for their code review. Y.G. is supported by a Turing AI Fellowship funded by the UK government’s Office for AI, through UK Research and Innovation (grant reference EP/V030302/1), and delivered by the Alan Turing Institute.

Author information

These authors contributed equally: Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn

Authors and Affiliations

OATML, Department of Computer Science, University of Oxford, Oxford, UK

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn & Yarin Gal

You can also search for this author in PubMed   Google Scholar

Contributions

S.F. led the work from conception to completion and proposed using bidirectional entailment to cluster generations as a way of computing entropy in LLMs. He wrote the main text, most of the Methods and Supplementary Information and prepared most of the figures. J.K. improved the mathematical formalization of semantic entropy; led the extension of semantic entropy to sentence- and paragraph-length generations; wrote the code for, and carried out, all the experiments and evaluations; wrote much of the Methods and Supplementary Information and prepared drafts of many figures; and gave critical feedback on the main text. L.K. developed the initial mathematical formalization of semantic entropy; wrote code for, and carried out, the initial experiments around semantic entropy and its variants which demonstrated the promise of the idea and helped narrow down possible research avenues to explore; and gave critical feedback on the main text. Y.G. ideated the project, proposing the idea to differentiate semantic and syntactic diversity as a tool for detecting hallucinations, provided high-level guidance on the research and gave critical feedback on the main text; he runs the research laboratory in which the work was carried out.

Corresponding author

Correspondence to Sebastian Farquhar .

Ethics declarations

Competing interests.

S.F. is currently employed by Google DeepMind and L.K. by OpenAI. For both, this paper was written under their University of Oxford affiliation. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Mirella Lapata and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 algorithm outline for bidirectional entailment clustering..

Given a set of outputs in response to a context, the bidirectional entailment answer returns a set of sets of outputs which have been classified as sharing a meaning.

Supplementary information

Supplementary information.

Supplementary Notes 1–7, Figs. 1–10, Tables 1–4 and references. Includes, worked example for semantic entropy calculation, discussion of limitations and computational cost of entailment clustering, ablation of entailment prediction and clustering methods, discussion of automated accuracy assessment, unaggregated results for sentence-length generations and further results for short-phrase generations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Farquhar, S., Kossen, J., Kuhn, L. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630 , 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0

Download citation

Received : 17 July 2023

Accepted : 12 April 2024

Published : 19 June 2024

Issue Date : 20 June 2024

DOI : https://doi.org/10.1038/s41586-024-07421-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

database recent research papers

Microsoft Research: Advancing science and technology to benefit humanity

Microsoft Research Blog

Introducing AutoGen Studio: A low-code interface for building multi-agent workflows 

White icons representing (from left to right) agents (multi), workflow, tasks, and coding on a blue to purple to pink gradient background.

Research Focus: Week of June 24, 2024  

June 26, 2024

SWAN diagram

Born in the research lab a decade ago, SWAN continues to accelerate networking in the Microsoft Cloud  

June 20, 2024 | Victor Bahl

Diagrams showing features of habitual behavior (e.g., eating snack when focusing on work) and goal-directed behavior (planning a meal to lose weight). Left: habitual behavior with features like automatic, model-free, and fast; Right: goal-directed behavior with features like thoughtful, model-based, and slow.

Synergizing habits and goals with variational Bayes: A new framework for biological and artificial embodied agents  

June 19, 2024 | Dongqi Han

Explore Microsoft Research Forum

various abstract 3D shapes on a light blue background

Microsoft Research Forum  

Microsoft Research Forum | Episode 3 | Jacki O'Neill

Keynote: Building Globally Equitable AI  

Microsoft Research Forum | Episode 3 | panel discussion

Panel Discussion: Generative AI for Global Impact: Challenges and Opportunities  

Research Forum | Episode 3 - abstract chalkboard background with colorful hands

Research Forum Brief | June 2024  

Careers in research, principal data science manager – office experience organization  .

Location : Hyderabad, Telangana, India

Data Scientist II – OneDrive-SharePoint team  

Principal machine learning engineer – azure ml  , senior data scientist – cxe data services team  .

Location : Bangalore, Karnataka, India

Senior Data Scientist – Windows  

Data scientist – clipchamp  .

Locations : Adelaide, South Australia, Australia; Brisbane, Queensland, Australia; Canberra, Australian Capital Territory, Australia; Melbourne, Victoria, Australia; Remote; Sydney, New South Wales, Australia

Data & Applied Scientist II – Bing Local Team  

Location : Barcelona, Spain

Principal Researcher – AI for Code  

Location : Cambridge, UK

Data Scientist – Azure Edge  

Locations : Ireland; Remote

Research Intern – Audio and Acoustics  

Location : Munich, Bavaria, Germany

Senior Data Scientist – Small and Medium Business (SMB)  

Locations : Dublin, Ireland; Remote

Principal Data Scientist – Industry Solutions Engineering team  

Locations : Amsterdam, Netherlands; London, UK

Senior Data Scientist – Education  

Location : Herzliya, Tel Aviv, Israel

Senior Security Researcher – Microsoft Defender For Endpoint  

Principal ai architect – microsoft defender for endpoint  .

Locations : Beer-Sheva, Israel; Haifa, Israel; Herzliya, Tel Aviv, Israel; Nazareth, Northern, Israel

Data Science and Research: MSc & PhD Internship Opportunities  

Data scientist – office of the chief economist  .

Location : Redmond, WA, US

Senior Researcher – Quantum  

Location : Santa Barbara, CA, US

Principal Data Scientist – Threat Protection Research Team  

Data science – minecraft player and data insights (padi)  , data scientist – customer solution areas  .

Locations : Remote (within US); United States

Principal Research Scientist – Responsible & Open Ai Research (ROAR)  

Events & conferences, icml 2024  .

Upcoming: July 21, 2024 – July 27, 2024

Vienna, Austria

Microsoft Research Forum | Episode 4  

Upcoming: September 3, 2024

News & awards

Why ai sometimes gets it wrong — and big strides to address it  .

Microsoft News Center  |  Jun 20, 2024

1 big thing: Cutting through the BS of AI  

Axios Science  |  Jun 20, 2024

Martez Mott receives CRA-WP Skip Ellis Early Career Award  

Computing Research Association  |  Jun 18, 2024

Chatbot teamwork makes the AI dream work  

Wired  |  Jun 6, 2024

  • Follow on Twitter
  • Like on Facebook
  • Follow on LinkedIn
  • Subscribe on Youtube
  • Follow on Instagram
  • Subscribe to our RSS feed

Share this page:

  • Share on Twitter
  • Share on Facebook
  • Share on LinkedIn
  • Share on Reddit

US professor charged with manipulating data for Alzheimer’s drug trial

  • Medium Text

A scientist looks at hypometabolic and hypoperfusion patterns at the single-subject level from a patient suffering from Alzheimer's disease at the Memory Centre at the Department of Readaptation and Geriatrics of the University Hospital (HUG) in Geneva

  • Company Cassava Sciences Inc Follow

Sign up here.

Reporting by Luc Cohen in New York and Marisa Taylor in Washington; Editing by Daniel Wallis

Our Standards: The Thomson Reuters Trust Principles. New Tab , opens new tab

database recent research papers

Thomson Reuters

Reports on the New York federal courts. Previously worked as a correspondent in Venezuela and Argentina.

database recent research papers

Marisa Taylor, a Pulitzer Prize-winning investigative reporter, has more than two decades of experience covering business, healthcare, the Justice Department, and national security. As a Washington, D.C.-based reporter, she helped break the Panama Papers, which exposed offshore companies linked to more than 140 politicians. Taylor was also part of a team that exposed the CIA’s monitoring of Senate Intelligence Committee staff. She previously reported out of Texas, California, Virginia and Mexico. https://www.pulitzer.org/winners/staff-reuters https://www.reuters.com/authors/marisa-taylor/

Read Next / Editor's Picks

Newport and Camel cigarettes are stacked on a shelf inside a tobacco store in New York

Industry Insight Chevron

database recent research papers

David Thomas

database recent research papers

Mike Scarcella, David Thomas

database recent research papers

Karen Sloan

database recent research papers

Henry Engler

Reference management. Clean and simple.

The top list of research databases for medicine and healthcare

Health and medicine

3. Cochrane Library

4. pubmed central (pmc), 5. uptodate, frequently asked questions about research databases for medicine and healthcare, related articles.

Web of Science and Scopus are interdisciplinary research databases and have a broad scope. For biomedical research, medicine, and healthcare there are a couple of outstanding academic databases that provide true value in your daily research.

Scholarly databases can help you find scientific articles, research papers , conference proceedings, reviews and much more. We have compiled a list of the top 5 research databases with a special focus on healthcare and medicine.

PubMed is the number one source for medical and healthcare research. It is hosted by the National Institutes of Health (NIH) and provides bibliographic information including abstracts and links to the full text publisher websites for more than 28 million articles.

  • Coverage: around 35 million items
  • Abstracts: ✔
  • Related articles: ✔
  • References: ✘
  • Cited by: ✘
  • Links to full text: ✔
  • Export formats: XML, NBIB

Search interface of PubMed

Pro tip: Use a reference manager like Paperpile to keep track of all your sources. Paperpile integrates with PubMed and many popular databases. You can save references and PDFs directly to your library using the Paperpile buttons and later cite them in thousands of citation styles:

database recent research papers

EMBASE (Excerpta Medica Database) is a proprietary research database that also includes PubMed. It can also be accessed by other database providers such as Ovid .

  • Coverage: 38 million articles
  • References: ✔
  • Cited by: ✔
  • Full text: ✔ (requires institutional subscription to EMBASE and individual publishers)
  • Export formats: RIS

Search interface of Embase

The Cochrane Library is best know for its systematic reviews. There are 53 review groups around the world that ensure that the published reviews are of high-quality and evidence based. Articles are updated over time to reflect new research.

  • Coverage: several thousand high quality reviews
  • Full text: ✔
  • Export formats: RIS, BibTeX

Search interface of the Cochrane Library

PubMed Central is the free, open access branch of PubMed. It includes full-text versions for all indexed papers. You might also want to check out its sister site Europe PMC .

  • Coverage: more than 8 million articles
  • Export formats: APA, MLA, AMA, RIS, NBIB

Search interface of PMC

Like the Cochrane Library, UpToDate provides detailed reviews for clinical topics. Reviews are constantly updated to provide an up-to-date view.

  • Coverage: several thousand articles from over 420 peer-reviewed journals
  • Related articles: ✘
  • Full text: ✔ (requires institutional subscription)
  • Export formats: ✘

Search interface of UpToDate

PubMed is the number one source for medical and healthcare research. It is hosted at the National Institutes of Health (NIH) and provides bibliographic information including abstracts and links to the full text publisher websites for more than 35 million items.

EMBASE (Excerpta Medica Database) is a proprietary research database that also includes in its corpus PubMed. It can also be accessed by other database providers such as Ovid.

database recent research papers

  • - Google Chrome

Intended for healthcare professionals

  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • Trends in...

Trends in cardiovascular disease incidence among 22 million people in the UK over 20 years: population based study

  • Related content
  • Peer review
  • Geert Molenberghs , professor 4 ,
  • Geert Verbeke , professor 4 ,
  • Francesco Zaccardi , associate professor 5 ,
  • Claire Lawson , associate professor 5 ,
  • Jocelyn M Friday , data scientist 1 ,
  • Huimin Su , PhD student 2 ,
  • Pardeep S Jhund , professor 1 ,
  • Naveed Sattar , professor 6 ,
  • Kazem Rahimi , professor 3 ,
  • John G Cleland , professor 1 ,
  • Kamlesh Khunti , professor 5 ,
  • Werner Budts , professor 1 7 ,
  • John J V McMurray , professor 1
  • 1 School of Cardiovascular and Metabolic Health, British Heart Foundation Cardiovascular Research Centre, University of Glasgow, Glasgow, UK
  • 2 Department of Cardiovascular Sciences, KU Leuven, Leuven, Belgium
  • 3 Deep Medicine, Nuffield Department of Women’s and Reproductive Health, University of Oxford, Oxford, UK
  • 4 Interuniversity Institute for Biostatistics and statistical Bioinformatics (I-BioStat), Hasselt University and KU Leuven, Belgium
  • 5 Leicester Real World Evidence Unit, Diabetes Research Centre, University of Leicester, Leicester, UK
  • 6 College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK
  • 7 Congenital and Structural Cardiology, University Hospitals Leuven, Belgium
  • Correspondence to: N Conrad nathalie.conrad{at}kuleuven.be (or @nathalie_conrad on X)
  • Accepted 1 May 2024

Objective To investigate the incidence of cardiovascular disease (CVD) overall and by age, sex, and socioeconomic status, and its variation over time, in the UK during 2000-19.

Design Population based study.

Setting UK.

Participants 1 650 052 individuals registered with a general practice contributing to Clinical Practice Research Datalink and newly diagnosed with at least one CVD from 1 January 2000 to 30 June 2019.

Main outcome measures The primary outcome was incident diagnosis of CVD, comprising acute coronary syndrome, aortic aneurysm, aortic stenosis, atrial fibrillation or flutter, chronic ischaemic heart disease, heart failure, peripheral artery disease, second or third degree heart block, stroke (ischaemic, haemorrhagic, and unspecified), and venous thromboembolism (deep vein thrombosis or pulmonary embolism). Disease incidence rates were calculated individually and as a composite outcome of all 10 CVDs combined and were standardised for age and sex using the 2013 European standard population. Negative binomial regression models investigated temporal trends and variation by age, sex, and socioeconomic status.

Results The mean age of the population was 70.5 years and 47.6% (n=784 904) were women. The age and sex standardised incidence of all 10 prespecified CVDs declined by 19% during 2000-19 (incidence rate ratio 2017-19 v 2000-02: 0.80, 95% confidence interval 0.73 to 0.88). The incidence of coronary heart disease and stroke decreased by about 30% (incidence rate ratios for acute coronary syndrome, chronic ischaemic heart disease, and stroke were 0.70 (0.69 to 0.70), 0.67 (0.66 to 0.67), and 0.75 (0.67 to 0.83), respectively). In parallel, an increasing number of diagnoses of cardiac arrhythmias, valve disease, and thromboembolic diseases were observed. As a result, the overall incidence of CVDs across the 10 conditions remained relatively stable from the mid-2000s. Age stratified analyses further showed that the observed decline in coronary heart disease incidence was largely restricted to age groups older than 60 years, with little or no improvement in younger age groups. Trends were generally similar between men and women. A socioeconomic gradient was observed for almost every CVD investigated. The gradient did not decrease over time and was most noticeable for peripheral artery disease (incidence rate ratio most deprived v least deprived: 1.98 (1.87 to 2.09)), acute coronary syndrome (1.55 (1.54 to 1.57)), and heart failure (1.50 (1.41 to 1.59)).

Conclusions Despite substantial improvements in the prevention of atherosclerotic diseases in the UK, the overall burden of CVDs remained high during 2000-19. For CVDs to decrease further, future prevention strategies might need to consider a broader spectrum of conditions, including arrhythmias, valve diseases, and thromboembolism, and examine the specific needs of younger age groups and socioeconomically deprived populations.

Introduction

Since the 1970s, the prevention of coronary disease, both primary and secondary, has improved considerably, largely attributable to public health efforts to control risk factors, such as antismoking legislation, and the widespread use of drugs such as statins. 1 2

Improvements in mortality due to heart disease have, however, stalled in several high income countries, 3 and reports suggest that the incidence of heart disease might even be increasing among younger people. 4 5 6 Conversely, along with coronary heart disease, other cardiovascular conditions are becoming relatively more prominent in older people, altering the profile of cardiovascular disease (CVD) in ageing societies. The importance of non-traditional risk factors for atherosclerotic diseases, such as socioeconomic deprivation, has also been increasingly recognised. Whether socioeconomic deprivation is as strongly associated with other CVDs as with atherosclerosis is uncertain, but it is important to understand as many countries have reported an increase in socioeconomic inequalities. 7

Large scale epidemiological studies are therefore needed to investigate secular trends in CVDs to target future preventive efforts, highlight the focus for future clinical trials, and identify healthcare resources required to manage emerging problems. Existing comprehensive efforts, such as statistics on CVD from leading medical societies or the Global Burden of Diseases studies, have helped toward this goal, but reliable age standardised incidence rates for all CVDs, how these vary by population subgroups, and changes over time are currently not available. 8 9 10

We used a large longitudinal database of linked primary care, secondary care, and death registry records from a representative sample of the UK population 11 12 to assess trends in the incidence of 10 of the most common CVDs in the UK during 2000-19, and how these differed by sex, age, socioeconomic status, and region.

Data source and study population

We used anonymised electronic health records from the GOLD and AURUM datasets of Clinical Practice Research Datalink (CPRD). CPRD contains information on about 20% of the UK population and is broadly representative of age, sex, ethnicity, geographical spread, and socioeconomic deprivation. 11 12 It is also one of the largest databases of longitudinal medical records from primary care in the world and has been validated for epidemiological research for a wide range of conditions. 11 We used the subset of CPRD records that linked information from primary care, secondary care from Hospital Episodes Statistics (HES admitted patient care and HES outpatient) data, and death certificates from the Office for National Statistics (ONS). Linkage was possible for a subset of English practices, covering about 50% of the CPRD records. Data coverage dates were 1 January 1985 to 31 December 2019 for primary care data (including drug prescription data), 1 April 1997 to 30 June 2019 for secondary care data, and 2 January 1998 to 30 May 2019 for death certificates.

Included in the study were men and women registered with a general practice for at least one year during the study period (1 January 2000 to 30 June 2019) whose records were classified by CPRD as acceptable for use in research and approved for HES and ONS linkage.

Study endpoints

The primary endpoint was the first presentation of CVD as recorded in primary or secondary care. We investigated 10 CVDs: acute coronary syndrome, aortic aneurysm, aortic stenosis, atrial fibrillation or flutter, chronic ischaemic heart disease, heart failure, peripheral artery disease, second or third degree heart block, stroke (ischaemic, haemorrhagic, or unspecified), and venous thromboembolism (deep vein thrombosis or pulmonary embolism). We defined incident diagnoses as the first record of that condition in primary care or secondary care regardless of its order in the patient’s record.

Diseases were considered individually and as a composite outcome of all 10 CVDs combined. For the combined analyses, we calculated the primary incidence (considering only the first recorded CVD in each patient, reflecting the number of patients affected by CVDs) and the total incidence (considering all incident CVD diagnoses in each patient, reflecting the cumulative number of CVD diagnoses). We performed sensitivity analyses including diagnoses recorded on death certificates.

To identify diagnoses, we compiled a list of diagnostic codes based on the coding schemes in use in each data source following previously established methods. 13 14 15 We used ICD-10 (international classification of diseases, 10th revision) codes for diagnoses recorded in secondary care, ICD-9 (international classification of diseases, ninth revision) (in use until 31 December 2000) and ICD-10 codes for diagnoses recorded on death certificates (used in sensitivity analyses only), the UK Office of Population Censuses and Surveys classification (OPCS-4) for procedures performed in secondary care settings, and a combination of Read, SNOMED, and local EMIS codes for diagnoses recorded in primary care records (see supplementary table S1). 16 Supplementary texts S1, S2, and S3 describe our approach to the generation of the diagnostic code list as well as considerations and sensitivity analyses into the validity of diagnoses recorded in UK electronic health records.

We selected covariates to represent a range of known cardiovascular risk factors. For clinical data, including systolic and diastolic blood pressure, smoking status, cholesterol (total:high density lipoprotein ratio), and body mass index (BMI), we abstracted data from primary care records as the most recent measurement within two years before the incident CVD diagnosis. BMI was categorised as underweight (<18.5), normal (18.5-24.9), overweight (25-29.9), and obesity (≥30). Information on the prevalence of chronic kidney disease, dyslipidaemia, hypertension, and type 2 diabetes was obtained as the percentage of patients with a diagnosis recorded in their primary care or secondary care record at any time up to and including the date of a first CVD diagnosis. Patients’ socioeconomic status was described using the index of multiple deprivation 2015, 17 a composite measure of seven dimensions (income, employment, education, health, crime, housing, living environment) and provided by CPRD. Measures of deprivation are calculated at small area level, covering an average population of 1500 people, and are presented in fifths, with the first 20% and last 20% representing the least and most deprived areas, respectively. We extracted information on ethnicity from both primary and secondary care records, and we used secondary care data when records differed. Ethnicity was grouped into four categories: African/Caribbean, Asian, white, and mixed/other. Finally, we extracted information on cardiovascular treatments (ie, aspirin and other antiplatelets, alpha adrenoceptor antagonists, aldosterone antagonists/mineralocorticoid receptor antagonists, angiotensin converting enzyme inhibitors, angiotensin II receptor antagonists, beta blockers, calcium channel blockers, diuretics, nitrates, oral anticoagulants, and statins) as the number of patients with at least two prescriptions of each drug class within six months after incident CVD, among patients alive and registered with a general practitioner 30 days after the diagnosis. Supplementary table S2 provides a list of substances included in each drug class. Prescriptions were extracted from primary care records up to 31 December 2019.

Statistical analyses

Categorical data for patient characteristics are presented as frequencies (percentages), and continuous data are presented as means and standard deviations (SDs) for symmetrically distributed data or medians and interquartile ranges (IQRs) for non-symmetrically distributed data, over the whole CVD cohort and stratified by age, sex, socioeconomic status, region, and calendar year of diagnosis. For variables with missing entries, we present numbers and percentages of records with missing data. For categorical variables, frequencies refer to complete cases.

Incidence rates of CVD were calculated by dividing the number of incident diagnoses by the number of patient years in the cohort. Category specific rates were computed separately for subgroups of age, sex, socioeconomic status, region, and calendar year of diagnosis. Age calculations were updated for each calendar year. To ensure calculations referred to incident diagnoses, we excluded individuals, from both the numerator and the denominator populations, with a disease of interest diagnosed before the study start date (1 January 2000), or within the first 12 months of registration with their general practice. Time at risk started at the latest of the patient’s registration date plus 12 months, 30 June of their birth year, or study start date; and stopped at the earliest of death, transfer out of practice, last collection date of the practice, incidence of the disease of interest, or linkage end date (30 June 2019). Disease incidence was standardised for age and sex 18 using the 2013 European standard population 19 in five year age bands up to age 90 years.

Negative binomial regression models were used to calculate overall and category specific incidence rate ratios and corresponding 95% confidence intervals (CIs). 20 Models were adjusted for calendar year of diagnosis, age (categorised into five years age bands), sex, socioeconomic status, and region. We chose negative binomial models over Poisson models to account for potential overdispersion in the data. Sensitivity analyses comparing Poisson and negative binomial models showed similar results.

Study findings are reported according to the RECORD (reporting of studies conducted using observational routinely collected health data) recommendations. 21 We performed statistical analyses in R, version 4.3.3 (R Foundation for Statistical Computing, Vienna, Austria).

Patient and public involvement

No patients or members of the public were directly involved in this study owing to constraints on funding and time.

A total of 22 009 375 individuals contributed data between 1 January 2000 and 30 June 2019, with 146 929 629 patient years of follow-up. Among those we identified 2 906 770 new CVD diagnoses, affecting 1 650 052 patients. Mean age at first CVD diagnosis was 70.5 (SD 15.0) years, 47.6% (n=784 904) of patients were women, and 11.6% (n=191 421), 18.0% (n=296 554), 49.7% (n=820 892), and 14.2% (n=233 833) of patients had a history of chronic kidney disease, dyslipidaemia, hypertension, and type 2 diabetes, respectively, at the time of their first CVD diagnosis ( table 1 ).

Characteristics of patients with a first diagnosis of CVD, 2000-19. Values are number (percentage) unless stated otherwise

  • View inline

During 2017-19, the most common CVDs were atrial fibrillation or flutter (age-sex standardised incidence 478 per 100 000 person years), heart failure (367 per 100 000 person years), and chronic ischaemic heart disease (351 per 100 000 person years), followed by acute coronary syndrome (190 per 100 000 person years), venous thromboembolism (183 per 100 000 person years), and stroke (181 per 100 000 patient years) ( fig 1 ).

Fig 1

Incidence of a first diagnosis of cardiovascular disease per 100 000 person years, 2000-19. Incidence rates are age-sex standardised to the 2013 European standard population. Any cardiovascular disease refers to the primary incidence of cardiovascular disease across the10 conditions investigated (ie, number of patients with a first diagnosis of cardiovascular disease). See supplementary table S4 for crude incidence rates by age and sex groups. IRR=incidence rate ratio

  • Download figure
  • Open in new tab
  • Download powerpoint

Temporal trends

The primary incidence of CVDs (ie, the number of patients with CVD) decreased by 20% during 2000-19 (age-sex standardised incidence rate ratio 2017-19 v 2000-02: 0.80 (95% CI 0.73 to 0.88)). However, the total incidence of CVD (ie, the total number of new CVD diagnoses) remained relatively stable owing to an increasing number of subsequent diagnoses among patients already affected by a first CVD (incidence rate ratio 2017-19 v 2000-02: 1.00 (0.91 to 1.10)).

The observed decline in CVD incidence was largely due to declining rates of atherosclerotic diseases, in particular acute coronary syndrome, chronic ischaemic heart disease, and stroke, which decreased by about 30% during 2000-19. The incidence of peripheral artery disease also declined, although more modestly (incidence rate ratio 2017-19 v 2000-02: 0.89 (0.80 to 0.98)) ( fig 1 ).

The incidence of non-atherosclerotic heart diseases increased at varying rates, with incidence of aortic stenosis and heart block more than doubling over the study period (2017-19 v 2000-02: 2.42 (2.13 to 2.74) and 2.22 (1.99 to 2.46), respectively) ( fig 1 ). These increasing rates of non-atherosclerotic heart diseases balanced the reductions in ischaemic diseases so that the overall incidence of CVD across the 10 conditions appeared to reach a plateau and to remain relatively stable from 2007-08 (incidence rate ratio 2017-19 v 2005-07: 1.00 (0.91 to 1.10)) ( fig 2 ).

Fig 2

Age standardised incidence of cardiovascular disease by sex, 2000-19. Any cardiovascular disease refers to the primary incidence of cardiovascular disease across the 10 conditions investigated (ie, number of patients with a first diagnosis of cardiovascular disease). IRR=incidence rate ratio

Age stratified analyses further showed that the observed decrease in incidence of chronic ischaemic heart disease, acute coronary syndrome, and stroke was largely due to a reduced incidence in those aged >60 years, whereas incidence rates in those aged <60 years remained relatively stable ( fig 3 and fig 4 ).

Fig 3

Sex standardised incidence of cardiovascular disease in all age groups. Any cardiovascular disease refers to the primary incidence of cardiovascular disease across the 10 conditions investigated (ie, number of patients with a first diagnosis of cardiovascular disease)

Fig 4

Sex standardised incidence of cardiovascular diseases by age subgroups <69 years. Any cardiovascular disease refers to the primary incidence of cardiovascular disease across the 10 conditions investigated (ie, number of patients with a first diagnosis of cardiovascular disease)

Age at diagnosis

CVD incidence was largely concentrated towards the end of the life span, with a median age at diagnosis generally between 65 and 80 years. Only venous thromboembolism was commonly diagnosed before age 45 years ( fig 5 ). Over the study period, age at first CVD diagnosis declined for several conditions, including stroke (on average diagnosed 1.9 years earlier in 2019 than in 2000), heart block (1.3 years earlier in 2019 than in 2000), and peripheral artery disease (1 year earlier in 2019 than in 2000) (see supplementary figure S1). Adults with a diagnosis before age 60 years were more likely to be from lower socioeconomic groups and to have a higher prevalence of several risk factors, including obesity, smoking, and high cholesterol levels (see supplementary table S3).

Fig 5

Incidence rates of cardiovascular diseases calculated by one year age bands and divided into a colour gradient of 20 quantiles to reflect incidence density by age. IQR=interquartile range

Incidence by sex

Age adjusted incidence of all CVDs combined was higher in men (incidence rate ratio for women v men: 1.46 (1.41 to 1.51)), with the notable exception of venous thromboembolism, which was similar between men and women. The incidence of aortic aneurysms was higher in men (3.49 (3.33 to 3.65)) ( fig 2 ). The crude incidence of CVD, however, was similar between men and women (1069 per 100 000 patient years and 1176 per 100 000 patient years, respectively), owing to the higher number of women in older age groups. Temporal trends in disease incidence were generally similar between men and women ( fig 2 ).

Incidence by socioeconomic status

The most deprived socioeconomic groups had a higher incidence of any CVDs (incidence rate ratio most deprived v least deprived: 1.37 (1.30 to 1.44)) ( fig 6 ). A socioeconomic gradient was observed across almost every condition investigated. That gradient did not decrease over time, and it was most noticeable for peripheral artery disease (incidence rate ratio most deprived v least deprived: 1.98 (1.87 to 2.09)), acute coronary syndrome (1.55 (1.54 to 1.57)), and heart failure (1.50 (1.41 to 1.59)). For aortic aneurysms, atrial fibrillation, heart failure, and aortic stenosis, socioeconomic inequalities in disease incidence appeared to increase over time.

Fig 6

Age-sex standardised incidence rates of cardiovascular diseases by socioeconomic status (index of multiple deprivation 2015). Any cardiovascular disease refers to the primary incidence of cardiovascular disease across the 10 conditions investigated (ie, number of patients with a first diagnosis of cardiovascular disease). Yearly incidence estimates were smoothed using loess (locally estimated scatterplot smoothing) regression lines

Regional differences

Higher incidence rates were seen in northern regions (north west, north east, Yorkshire and the Humber) of England for all 10 conditions investigated, even after adjusting for socioeconomic status. Aortic aneurysms and aortic stenosis had the strongest regional gradients, with incidence rates about 30% higher in northern regions compared with London. Geographical variations remained modest, however, and did not appear to change considerably over time (see supplementary figure S2).

Sensitivity analyses

In sensitivity analyses that used broader disease definitions, that included diagnoses recorded on death certificates, that relied on longer lookback periods for exclusion of potentially prevalent diagnoses, or that were restricted to diagnoses recorded during hospital admissions, temporal trends in disease incidence appeared similar (see supplementary figures S3-S6).

Secondary prevention treatments

The proportion of patients using statins and antihypertensive drugs after a first CVD diagnosis increased over time, whereas the use of non-dihydropyridines calcium channel blockers, nitrates, and diuretics decreased over time. Non-vitamin K antagonist oral anticoagulants increasingly replaced vitamin K anticoagulants (see supplementary figure S7).

The findings of this study suggest that important changes occurred in the distribution of CVDs during 2000-19 and that several areas are of concern. The incidence of non-atherosclerotic heart diseases was shown to increase, the decline in atherosclerotic disease in younger people was stalling, and socioeconomic inequalities had a substantial association across almost every CVD investigated.

Implications for clinical practice and policy

Although no causal inference can be made from our data, the decline in rates of ischaemic diseases coincided with reductions in the prevalence of risk factors such as smoking, hypertension, and raised cholesterol levels in the general population over the same period, 22 and this finding suggests that efforts in the primary and secondary prevention of atherosclerotic diseases have been successful. The decline in stroke was not as noticeable as that for coronary heart disease, which may reflect the rising incidence of atrial fibrillation. The variation in trends for peripheral artery disease could be due to differences in risk factors (eg, a stronger association with diabetes), the multifaceted presentations and causes, and the introduction of systematic leg examinations for people with diabetes. 23 24

All the non-atherosclerotic diseases, however, appeared to increase during 2000-19. For some conditions, such as heart failure, the observed increase remained modest, whereas for others, such as aortic stenosis and heart block, incidence rates doubled. All analyses in this study were standardised for age and sex, to illustrate changes in disease incidence independently of changes in population demographics. Whether these trends solely reflect increased awareness, access to diagnostic tests, or even screening (eg, for abdominal aortic aneurysm 25 ) and coding practices, is uncertain. Reductions in premature death from coronary heart disease may have contributed to the emergence of these other non-atherosclerotic CVDs. Regardless, the identification of increasing numbers of people with these problems has important implications for health services, especially the provision of more surgical and transcatheter valve replacement, pacemaker implantation, and catheter ablation for atrial fibrillation. Importantly, these findings highlight the fact that for many cardiovascular conditions such as heart block, aortic aneurysms, and non-rheumatic valvular diseases, current medical practice remains essentially focused on the management of symptoms and secondary prevention and that more research into underlying causes and possible primary prevention strategies is needed. 26 27

These varying trends also mean that the contribution of individual CVDs towards the overall burden has changed. For example, atrial fibrillation or flutter are now the most common CVDs in the UK. Atrial fibrillation is also a cause (and consequence) of heart failure, and these two increasingly common problems may amplify the incidence of each other. Venous thromboembolism and heart block also appeared as important contributors to overall CVD burden, with incidence rates similar to those of stroke and acute coronary syndrome, yet both receive less attention in terms of prevention efforts.

The stalling decline in the rate of coronary heart disease in younger age groups is of concern, has also been observed in several other high income countries, and may reflect rising rates of physical inactivity, obesity, and type 2 diabetes in young adults. 4 6 28 The stalled decline suggests prevention approaches may need to be expanded beyond antismoking legislation, blood pressure control, and lipid lowering interventions to include the promotion of physical activity, weight control, and use of new treatments shown to reduce cardiovascular risk in people with type 2 diabetes. 29 Although CVD incidence is generally low in people aged <60 years, identifying those at high risk of developing CVD at a young age and intervening before problems occur could reduce premature morbidity and mortality and have important economic implications.

Our study further found that socioeconomic inequalities may contribute to CVD burden, and that this association is not restricted to selected conditions but is visible across most CVDs. The reasons behind the observed increase in risk in relation to socioeconomic inequalities are likely to be multifactorial and to include environmental, occupational, psychosocial, and behavioural risk factors, including established cardiovascular risk factors such as smoking, obesity, nutrition, air pollution, substance misuse, and access to care. 30 How these findings apply to different countries is likely to be influenced by socioeconomic structures and healthcare systems, although health inequalities have been reported in numerous countries. 30 One important factor in the present study is that access to care is free at the point of care in the UK, 31 and yet socioeconomic inequalities persist despite universal health coverage and they did not appear to improve over time. Independently of the specificities of individual countries, our findings highlight the importance of measuring and considering health inequalities and suggest that dealing with the social determinants of health—the conditions under which people are born, live, work, and age—could potentially bring substantial health improvements across a broad range of chronic conditions.

Finally, our results reflect disease incidence based on diagnostic criteria, screening practices, availability, and accuracy of diagnostic tests in place at a particular time and therefore must be interpreted within this context. 32 Several of the health conditions investigated are likely to being sought and detected with increased intensity over the study period. For example, during the study period the definition of myocardial infarction was revised several times, 33 34 35 and high sensitivity troponins were progressively introduced in the UK from 2010. These more sensitive markers of cardiac injury are thought to have increased the detection rates for less severe disease. 36 37 Similarly, increased availability of computed tomography may have increased detection rates for stroke. 38 These changes could have masked an even greater decline in these conditions than observed in the present study. Conversely, increased use of other biochemical tests (such as natriuretic peptides) and more sensitive imaging techniques might have increased the detection of other conditions. 39 40 41 The implementation of a screening programme for aortic aneurysm and incentive programmes aimed at improving coding practices, including the documentation of CVD, associated risk factors and comorbidities, and treatment of these, are also likely to have contributed to the observed trends. 25 42 43 As a result, the difference in incidence estimates and prevalence of comorbidities over time may not reflect solely changes in the true incidence but also differences in ascertainment of people with CVD. 44 Nonetheless, long term trends in large and unconstrained populations offer valuable insights for healthcare resource planning and for the design of more targeted prevention strategies that could otherwise not be answered by using smaller cohorts, cross sectional surveys, or clinical trials; and precisely because they are based on routinely reported diagnoses they are more likely to capture the burden of disease as experienced by doctors and health services.

Strengths and limitations of this study

A key strength of this study is its statistical power, with >140 million person years of data. The large size of the cohort allowed us to perform incidence calculations for a broad spectrum of conditions, and to examine the influence of age, sex, and socioeconomic status as well as trends over 20 years. One important limitation of our study was the modest ethnic diversity in our cohort and the lack of information on ethnicity for the denominator population, which precluded us from stratifying incidence estimates by ethnic group. Our analyses were also limited by the unavailability or considerable missingness of additional variables potentially relevant to the development of CVD, such as smoking, body mass index, imaging data, women specific cardiovascular risk factors (eg, pregnancy associated hypertension and gestational diabetes), and blood biomarkers. Further research may also need to consider an even wider spectrum of CVDs, including individual types of valve disease, pregnancy related conditions, and infection related heart diseases. Research using databases with electronic health records is also reliant on the accuracy of clinical coding input by doctors in primary care as part of a consultation, or in secondary care as part of a hospital admission. We therefore assessed the validity of diagnoses in UK electronic health records data and considered it to be appropriate in accordance with the >200 independent validation studies reporting an average positive predictive value of about 90% for recorded diagnoses. 45 Observed age distributions were also consistent with previous studies and added to the validity of our approach. Nevertheless, our results must be interpreted within the context and limitations of routinely collected data from health records, diagnostic criteria, screening practices, the availability and accuracy of diagnostic tests in place at that time, and the possibility that some level of miscoding is present or that some bias could have been introduced by restricting the cohort to those patients with at least 12 months of continuous data.

Conclusions

Efforts to challenge the notion of the inevitability of vascular events with ageing, and evidence based recommendations for coronary heart disease prevention, have been successful and can serve as a model for other non-communicable diseases. Our findings show that it is time to expand efforts to improve the prevention of CVDs. Broadening research and implementation efforts in both primary and secondary prevention to non-atherosclerotic diseases, tackling socioeconomic inequalities, and introducing better risk prediction and management among younger people appear to be important opportunities to tackle CVDs.

What is already known on this topic

Recent data show that despite decades of declining rates of cardiovascular mortality, the burden from cardiovascular disease (CVD) appears to have stalled in several high income countries

What this study adds

This observational study of a representative sample of 22 million people from the UK during 2000-19 found reductions in CVD incidence to have been largely restricted to ischaemic heart disease and stroke, and were paralleled by a rising number of diagnoses of cardiac arrhythmias, valve disease, and thromboembolic events

Venous thromboembolism and heart block were important contributors to the overall burden of CVDs, with incidence rates similar to stroke and acute coronary syndromes

Improvements in rates of coronary heart disease almost exclusively appeared to benefit those aged >60 years, and the CVD burden in younger age groups appeared not to improve

Ethics statements

Ethical approval.

This study was approved by the Clinical Practice Research Datalink Independent Scientific Advisory Committee.

Data availability statement

Access to Clinical Practice Research Datalink (CPRD) data is subject to a license agreement and protocol approval process that is overseen by CPRD’s research data governance process. A guide to access is provided on the CPRD website ( https://www.cprd.com/data-access ) To facilitate the subsequent use and replication of the findings from this study, aggregated data tables are provided with number of events and person years at risk by individual condition and by calendar year, age (by five year age band), sex, socioeconomic status, and region (masking field with fewer than five events, as per CPRD data security and privacy regulations) on our GitHub repository ( https://github.com/nathalieconrad/CVD_incidence ).

Acknowledgments

We thank Hilary Shepherd, Sonia Coton, and Eleanor L Axson from the Clinical Practice Research Datalink for their support and expertise in preparing the dataset underlying these analyses.

Contributors: NC and JJVM conceived and designed the study. NC, JJVM, GM, and GV designed the statistical analysis plan and NC performed the statistical analysis. All authors contributed to interpreting the results, drafting the manuscript, and the revisions. NC, GM, and GV had permission to access the raw data and NC and GM verified the raw data. All authors gave final approval of the version to be published and accept responsibility to submit the manuscript for publication. NC and JJVM accept full responsibility for the conduct of the study, had access to aggregated data, and controlled the decision to publish. They are the guarantors. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Funding: This study was funded by a personal fellowship from the Research Foundation Flanders (grant No 12ZU922N), a research grant from the European Society of Cardiology (grant No App000037070), and the British Heart Foundation Centre of Research Excellence (grant No RE/18/6/34217). The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.

Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ and declare: NC is funded by a personal fellowship from the Research Foundation Flanders and a research grant from the European Society of Cardiology. JMF, PSJ, JGC, NS, and JJVM are supported by British Heart Foundation Centre of Research Excellence. PSJ and JJVM are further supported by the Vera Melrose Heart Failure Research Fund. JJVM has received funding to his institution from Amgen and Cytokinetics for his participation in the steering sommittee for the ATOMIC-HF, COSMIC-HF, and GALACTIC-HF trials and meetings and other activities related to these trials; has received payments through Glasgow University from work on clinical trials, consulting, and other activities from Alnylam, Amgen, AstraZeneca, Bayer, Boehringer Ingelheim, Bristol Myers Squibb, Cardurion, Dal-Cor, GlaxoSmithKline, Ionis, KBP Biosciences, Novartis, Pfizer, and Theracos; and has received personal lecture fees from the Corpus, Abbott, Hikma, Sun Pharmaceuticals, Medscape/Heart.Org, Radcliffe Cardiology, Alkem Metabolics, Eris Lifesciences, Lupin, ProAdWise Communications, Servier Director, and Global Clinical Trial Partners. NS declares consulting fees or speaker honorariums, or both, from Abbott Laboratories, Afimmune, Amgen, AstraZeneca, Boehringer Ingelheim, Lilly, Hanmi Pharmaceuticals, Janssen, Merck Sharp & Dohme, Novartis, Novo Nordisk, Pfizer, Roche Diagnostics, and Sanofi; and grant support paid to his university from AstraZeneca, Boehringer Ingelheim, Novartis, and Roche Diagnostics. KK has acted as a consultant or speaker or received grants for investigator initiated studies for Astra Zeneca, Bayer, Novartis, Novo Nordisk, Sanofi-Aventis, Lilly, Merck Sharp & Dohme, Boehringer Ingelheim, Oramed Pharmaceuticals, Roche, and Applied Therapeutics. KK is supported by the National Institute for Health and Care Research (NIHR) Applied Research Collaboration East Midlands (ARC EM) and the NIHR Leicester Biomedical Research Centre (BRC). CL is funded by an NIHR Advanced Research Fellowship (NIHR-300111) and supported by the Leicester BRC. PSJ has received speaker fees from AstraZeneca, Novartis, Alkem Metabolics, ProAdWise Communications, Sun Pharmaceuticals, and Intas Pharmaceuticals; has received advisory board fees from AstraZeneca, Boehringer Ingelheim, and Novartis; has received research funding from AstraZeneca, Boehringer Ingelheim, Analog Devices; his employer, the University of Glasgow, has been remunerated for clinical trial work from AstraZeneca, Bayer, Novartis, and Novo Nordisk; and is the Director of Global Clinical Trial Partners. HS is supported by the China Scholarship Council. Other authors report no support from any organisation for the submitted work, no financial relationships with any organisations that might have an interest in the submitted work in the previous three years, and no other relationships or activities that could appear to have influenced the submitted work.

Transparency: The lead author (NC) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.

Dissemination to participants and related patient and public communities: Results from this study will be shared with patient associations and foundations dedicated to preventing cardiovascular diseases, such as the European Heart Network and the American Heart Association. To reach the public, findings will also be press released alongside publication of this manuscript. Social media (eg, X) will be used to draw attention to the work and stimulate debate about its findings. Finally, the underlying developed algorithms will be freely available for academic use at https://github.com/nathalieconrad/CVD_incidence .

Provenance and peer review: Not commissioned; externally peer reviewed.

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/ .

  • ↵ Institute for Health Metrics and Evaluation. Gobal Burden of Diseases Viz Hub. 2023. https://vizhub.healthdata.org/gbd-compare/
  • Ananth CV ,
  • Rutherford C ,
  • Rosenfeld EB ,
  • O’Flaherty M ,
  • Allender S ,
  • Scarborough P ,
  • Andersson C ,
  • Abdalla SM ,
  • Mensah GA ,
  • Johnson CO ,
  • GBD-NHLBI-JACC Global Burden of Cardiovascular Diseases Writing Group
  • Almarzooq ZI ,
  • American Heart Association Council on Epidemiology and Prevention Statistics Committee and Stroke Statistics Subcommittee
  • Townsend N ,
  • Atlas Writing Group, European Society of Cardiology
  • Herrett E ,
  • Gallagher AM ,
  • Bhaskaran K ,
  • ↵ Wolf A, Dedman D, Campbell J, et al. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum. International Journal of Epidemiology. 2019 Mar 11 [cited 2019 Mar 22]; https://academic.oup.com/ije/advance-article/doi/10.1093/ije/dyz034/5374844
  • Verbeke G ,
  • Molenberghs G ,
  • Ferguson LD ,
  • ↵ Medicines and Healthcare products Regulatory Agency. What coding systems are used in CPRD data? 2023. https://www.cprd.com/defining-your-study-population
  • ↵ Department for Communities and Local Government (DCLG). The English Index of Multiple Deprivation 2015: Guidance. 2015; pp1-7. https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015
  • ↵ Kirkwood B, Sterne J. Essential medical statistics. 2010.
  • ↵ Eurostat’s task force. Revision of the European Standard Population Report. 2013.
  • Benchimol EI ,
  • Guttmann A ,
  • RECORD Working Committee
  • ↵ NHS Digital. Health Survey for England, 2021 part 1. https://digital.nhs.uk/data-and-information/publications/statistical/health-survey-for-england/2021
  • Criqui MH ,
  • ↵ Health and Social Care Information Centre. Quality and Outcomes Framework - Indicators 2011-12. 2011. https://digital.nhs.uk/data-and-information/publications/statistical/quality-and-outcomes-framework-achievement-prevalence-and-exceptions-data/quality-and-outcomes-framework-2011-12
  • Jacomelli J ,
  • Summers L ,
  • Stevenson A ,
  • Earnshaw JJ
  • Vahanian A ,
  • Beyersdorf F ,
  • ESC/EACTS Scientific Document Group
  • Glikson M ,
  • Nielsen JC ,
  • Kronborg MB ,
  • ESC Scientific Document Group
  • Stevenson C ,
  • Peeters A ,
  • Federici M ,
  • Schultz WM ,
  • Verbakel JY ,
  • Thygesen K ,
  • Alpert JS ,
  • Joint ESC/ACCF/AHA/WHF Task Force for the Redefinition of Myocardial Infarction
  • Joint ESC/ACCF/AHA/WHF Task Force for the Universal Definition of Myocardial Infarction
  • Sandoval Y ,
  • High-STEACS investigators
  • Camargo ECS ,
  • Singhal AB ,
  • Wiener RS ,
  • Schwartz LM ,
  • Sappler N ,
  • Roalfe AK ,
  • Lay-Flurrie SL ,
  • Ordóñez-Mena JM ,
  • Herbert A ,
  • Wijlaars L ,
  • Zylbersztejn A ,
  • Cromwell D ,
  • ↵ NHS Digital. Quality and Outcomes Framework (QOF), 2019-20. Indicator definitions. 2020. https://digital.nhs.uk/data-and-information/publications/statistical/quality-and-outcomes-framework-achievement-prevalence-and-exceptions-data/2019-20
  • Pronovost PJ
  • Thomas SL ,
  • Schoonen WM ,

database recent research papers

One Cohort at a Time: A New Perspective on the Declining Gender Pay Gap

This paper studies the interaction between the decrease in the gender pay gap and the stagnation in the careers of younger workers, analyzing data from the United States, Italy, Canada, and the United Kingdom. We propose a model of the labor market in which a larger supply of older workers can crowd out younger workers from top-paying positions. These negative career spillovers disproportionately affect the career trajectories of younger men because they are more likely than younger women to hold higher-paying jobs at baseline. The data strongly support this cohort-driven interpretation of the shrinking gender pay gap. The whole decline in the gap originates from (i) newer worker cohorts who enter the labor market with smaller-than-average gender pay gaps and (ii) older worker cohorts who exit with higher-than-average gender pay gaps. As predicted by the model, the gender pay convergence at labor-market entry stems from younger men's larger positional losses in the wage distribution. Younger men experience the largest positional losses within higher-paying firms, in which they become less represented over time at a faster rate than younger women. Finally, we document that labor-market exit is the sole contributor to the decline in the gender pay gap after the mid-1990s, which implies no full gender pay convergence for the foreseeable future. Consistent with our framework, we find evidence that most of the remaining gender pay gap at entry depends on predetermined educational choices.

We thank Patricia Cortés, Gordon Dahl, Fabian Lange, Claudia Olivetti, Michael Powell, Uta Schönberg, as well as participants at various seminars and conferences for helpful comments. We thank Sergey Abramenko, Thomas Barden, Carolina Bussotti, Sean Chen, and Chengmou Lei for outstanding research assistance. The realization of this article was possible thanks to the sponsorship of the “VisitINPS Scholars” program. The views expressed in this paper are those of the authors only and should not be attributed to the Bank of Italy, the Eurosystem, or the National Bureau of Economic Research.

MARC RIS BibTeΧ

Download Citation Data

Working Groups

Mentioned in the news, more from nber.

In addition to working papers , the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter , the NBER Digest , the Bulletin on Retirement and Disability , the Bulletin on Health , and the Bulletin on Entrepreneurship  — as well as online conference reports , video lectures , and interviews .

15th Annual Feldstein Lecture, Mario Draghi, "The Next Flight of the Bumblebee: The Path to Common Fiscal Policy in the Eurozone cover slide

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: burst image super-resolution with base frame selection.

Abstract: Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst Image (NEBI), that includes the burst frames at varying exposure times to obtain a broader range of irradiance and motion characteristics within a scene. As burst shots with non-uniform exposures exhibit varying levels of degradation, fusing information of the burst shots into the first frame as a base frame may not result in optimal image quality. To address this limitation, we propose a Frame Selection Network (FSN) for non-uniform scenarios. This network seamlessly integrates into existing super-resolution methods in a plug-and-play manner with low computational costs. The comparative analysis reveals the effectiveness of the nonuniform setting for the practical scenario and our FSN on synthetic-/real- NEBI datasets.
Comments: CVPR2024W NTIRE accepted
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

Research: Using AI at Work Makes Us Lonelier and Less Healthy

  • David De Cremer
  • Joel Koopman

database recent research papers

Employees who use AI as a core part of their jobs report feeling more isolated, drinking more, and sleeping less than employees who don’t.

The promise of AI is alluring — optimized productivity, lightning-fast data analysis, and freedom from mundane tasks — and both companies and workers alike are fascinated (and more than a little dumbfounded) by how these tools allow them to do more and better work faster than ever before. Yet in fervor to keep pace with competitors and reap the efficiency gains associated with deploying AI, many organizations have lost sight of their most important asset: the humans whose jobs are being fragmented into tasks that are increasingly becoming automated. Across four studies, employees who use it as a core part of their jobs reported feeling lonelier, drinking more, and suffering from insomnia more than employees who don’t.

Imagine this: Jia, a marketing analyst, arrives at work, logs into her computer, and is greeted by an AI assistant that has already sorted through her emails, prioritized her tasks for the day, and generated first drafts of reports that used to take hours to write. Jia (like everyone who has spent time working with these tools) marvels at how much time she can save by using AI. Inspired by the efficiency-enhancing effects of AI, Jia feels that she can be so much more productive than before. As a result, she gets focused on completing as many tasks as possible in conjunction with her AI assistant.

  • David De Cremer is a professor of management and technology at Northeastern University and the Dunton Family Dean of its D’Amore-McKim School of Business. His website is daviddecremer.com .
  • JK Joel Koopman is the TJ Barlow Professor of Business Administration at the Mays Business School of Texas A&M University. His research interests include prosocial behavior, organizational justice, motivational processes, and research methodology. He has won multiple awards from Academy of Management’s HR Division (Early Career Achievement Award and David P. Lepak Service Award) along with the 2022 SIOP Distinguished Early Career Contributions award, and currently serves on the Leadership Committee for the HR Division of the Academy of Management .

Partner Center

IMAGES

  1. Writing a Databases Research Paper

    database recent research papers

  2. (PDF) Recent Research Papers

    database recent research papers

  3. (PDF) Impact of database management in modern world

    database recent research papers

  4. Database Essentials

    database recent research papers

  5. how to search and download research papers published in 2021

    database recent research papers

  6. Database Management System Abstract Computer Science Essay Free Essay

    database recent research papers

VIDEO

  1. Final Year Projects

  2. Database system past papers 2017_2021 |#punjab university Lahore #innovateITzone

  3. Database vs. Journal الفرق بين المجلة و قاعدة البيانات

  4. Introduction to using Pubmed

  5. Database Basics

  6. What is RePEc- Research Papers in Economics? (economics)(Elsevier)(springer)(database)

COMMENTS

  1. 19024 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATABASE MANAGEMENT SYSTEMS. Find methods information, sources, references or conduct a literature ...

  2. The best academic research databases [Update 2024]

    Organize your papers in one place. Try Paperpile. 1. Scopus. Scopus is one of the two big commercial, bibliographic databases that cover scholarly literature from almost any discipline. Besides searching for research articles, Scopus also provides academic journal rankings, author profiles, and an h-index calculator. 2.

  3. 8446 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATABASE RESEARCH. Find methods information, sources, references or conduct a literature review on ...

  4. 57687 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATABASE DESIGN. Find methods information, sources, references or conduct a literature review on ...

  5. Databases

    A database is one or more sets of data, for example numbers, characters and images, bundled together with software that enables data to be added, removed or retrieved. Databases can be used to ...

  6. data science Latest Research Papers

    Assessing the effects of fuel energy consumption, foreign direct investment and GDP on CO2 emission: New data science evidence from Europe & Central Asia. Fuel . 10.1016/j.fuel.2021.123098 . 2022 . Vol 314 . pp. 123098. Author (s): Muhammad Mohsin . Sobia Naseem .

  7. Privacy Prevention of Big Data Applications: A Systematic Literature

    Big Data varies from traditional technology in four aspects, according to current research: volume, diversity, speed, and value. The velocity, diversity, and volume of massive data present new security concerns including a large cloud infrastructure, the distinction between data source and design, and the cascading nature of data collection.

  8. Academic Databases

    Academic research isn't difficult if you know where and how to search for scholarly articles and research papers. Here's how to do it. How to use Google Scholar: the ultimate guide ... Your research is stuck, and you need to find new sources. Take a look at our compilation of free academic search engines: Google Scholar BASE CORE Science.gov ...

  9. PubMed

    PubMed is a comprehensive database of biomedical literature from various sources, including MEDLINE, life science journals, and online books. You can search for citations, access full text content, and explore topics related to health, medicine, and biology. PubMed also provides advanced search options and tools for researchers and clinicians.

  10. Database Search

    What is Database Search? Harvard Library licenses hundreds of online databases, giving you access to academic and news articles, books, journals, primary sources, streaming media, and much more. The contents of these databases are only partially included in HOLLIS. To make sure you're really seeing everything, you need to search in multiple places.

  11. Academic research: how to search online databases [8 steps ...

    Find databases that are specifically related to your topic. 3. Set up the search parameters within a database to be as narrow as possible. 4. Ask a librarian for help. 5. Slowly expand your search to get additional results. 6. Use the pro features of the database.

  12. database security Latest Research Papers

    One way to maintain the security of the database is to use encryption techniques. The method used to secure the database is encryption using the ROTI3 and Caesar Cipher methods. Both of these methods have advantages in processing speed. For thisreason, the author will compare the use of the two algorithms above in terms of the encryption and ...

  13. Big Data Research

    About the journal. The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in …. View full aims & scope.

  14. JSTOR Home

    Harness the power of visual materials—explore more than 3 million images now on JSTOR. Enhance your scholarly research with underground newspapers, magazines, and journals. Explore collections in the arts, sciences, and literature from the world's leading museums, archives, and scholars. JSTOR is a digital library of academic journals ...

  15. Search eLibrary :: SSRN

    Definitions of Measures Associated with References, Cites, and Citations. Total References: Total number of references to other papers that have been resolved to date, for papers in the SSRN eLibrary. Total Citations: Total number of cites to papers in the SSRN eLibrary whose links have been resolved to date. Note: The links for the two pages containing a paper's References and Citation links ...

  16. distributed database systems Latest Research Papers

    Different Types. In this paper, we study about the different types of fault tolerance techniques which are used in various distributed database systems. The main focus of this research is about how the data are storedin the servers, fault detection techniques and the recovery techniques used. A fault can occur for many reasons.

  17. Research Area: DBMS

    Faculty and students at Berkeley have repeatedly defined and redefined the broad field of data management, combining deep intellectual impact with the birth of multi-billion dollar industries, including relational databases, RAID storage, scalable Internet search, and big data analytics. Berkeley also gave birth to many of the most widely-used ...

  18. 10 Current Database Research Topic Ideas in 2024

    10 Current Database Research Topic Ideas in 2024. As we head towards the second half of 2024, the world of technology evolves at a rapid pace. With the rise of AI and blockchain, the demand for data, its management and the need for security increases rapidly. A logical consequence of these changes is the way fields like database security ...

  19. Search

    Find the research you need | With 160+ million publications, 1+ million questions, and 25+ million researchers, this is where everyone can access science

  20. Detecting hallucinations in large language models using ...

    Large language model (LLM) systems, such as ChatGPT1 or Gemini2, can show impressive reasoning and question-answering capabilities but often 'hallucinate' false outputs and unsubstantiated ...

  21. Tirzepatide for the Treatment of Obstructive Sleep Apnea and Obesity

    At baseline, the mean AHI was 51.5 events per hour in trial 1 and 49.5 events per hour in trial 2, and the mean body-mass index (BMI, the weight in kilograms divided by the square of the height in ...

  22. Microsoft Research

    Explore research at Microsoft, a site featuring the impact of research along with publications, products, downloads, and research careers. ... A new framework for biological and artificial embodied agents . June 19, 2024 | Dongqi Han ... Data Science and Research: MSc & PhD Internship Opportunities . Posted: April 22, ...

  23. US professor charged with manipulating data for Alzheimer's drug trial

    A U.S. medical professor has been charged with fraud for allegedly submitting false data to get millions of dollars in public funds for research into a drug to treat Alzheimer's disease.

  24. The top list of research databases for medicine and healthcare

    For biomedical research, medicine, and healthcare there are a couple of outstanding academic databases that provide true value in your daily research. Scholarly databases can help you find scientific articles, research papers, conference proceedings, reviews and much more. We have compiled a list of the top 5 research databases with a special ...

  25. Star botanist likely made up data about nutritional supplements, new

    Star botanist likely made up data about nutritional supplements, new probe finds ... In a 43-page letter to UG, the group charged that called data essential to the landmark paper and two others were "missing, fraudulent, or plagiarized," and said Newmaster did not disclose financial conflicts of interest. Two of Newmaster's co-authors on ...

  26. Trends in cardiovascular disease incidence among 22 million people in

    Objective To investigate the incidence of cardiovascular disease (CVD) overall and by age, sex, and socioeconomic status, and its variation over time, in the UK during 2000-19. Design Population based study. Setting UK. Participants 1 650 052 individuals registered with a general practice contributing to Clinical Practice Research Datalink and newly diagnosed with at least one CVD from 1 ...

  27. One Cohort at a Time: A New Perspective on the Declining Gender Pay Gap

    This paper studies the interaction between the decrease in the gender pay gap and the stagnation in the careers of younger workers, analyzing data from the United States, Italy, Canada, and the United Kingdom. We propose a model of the labor market in which a larger supply of older workers can crowd out younger workers from top-paying positions.

  28. Burst Image Super-Resolution with Base Frame Selection

    Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst ...

  29. Research: Using AI at Work Makes Us Lonelier and Less Healthy

    Joel Koopman is the TJ Barlow Professor of Business Administration at the Mays Business School of Texas A&M University. His research interests include prosocial behavior, organizational justice ...

  30. ResearchGate

    Access 160+ million publications and connect with 25+ million researchers. Join for free and gain visibility by uploading your research.