Society Homepage About Public Health Policy Contact

Data-driven hypothesis generation in clinical research: what we learned from a human subject study, article sidebar.

hypothesis generation by

Submit your own article

Register as an author to reserve your spot in the next issue of the Medical Research Archives.

Join the Society

The European Society of Medicine is more than a professional association. We are a community. Our members work in countries across the globe, yet are united by a common goal: to promote health and health equity, around the world.

Join Europe’s leading medical society and discover the many advantages of membership, including free article publication.

Main Article Content

Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study design, data collection, and result analysis. In this perspective article, the authors provide a literature review on the following topics first: scientific thinking, reasoning, medical reasoning, literature-based discovery, and a field study to explore scientific thinking and discovery. Over the years, scientific thinking has shown excellent progress in cognitive science and its applied areas: education, medicine, and biomedical research. However, a review of the literature reveals the lack of original studies on hypothesis generation in clinical research. The authors then summarize their first human participant study exploring data-driven hypothesis generation by clinical researchers in a simulated setting. The results indicate that a secondary data analytical tool, VIADS—a visual interactive analytic tool for filtering, summarizing, and visualizing large health data sets coded with hierarchical terminologies, can shorten the time participants need, on average, to generate a hypothesis and also requires fewer cognitive events to generate each hypothesis. As a counterpoint, this exploration also indicates that the quality ratings of the hypotheses thus generated carry significantly lower ratings for feasibility when applying VIADS. Despite its small scale, the study confirmed the feasibility of conducting a human participant study directly to explore the hypothesis generation process in clinical research. This study provides supporting evidence to conduct a larger-scale study with a specifically designed tool to facilitate the hypothesis-generation process among inexperienced clinical researchers. A larger study could provide generalizable evidence, which in turn can potentially improve clinical research productivity and overall clinical research enterprise.

Article Details

The  Medical Research Archives  grants authors the right to publish and reproduce the unrevised contribution in whole or in part at any time and in any form for any scholarly non-commercial purpose with the condition that all publications of the contribution include a full citation to the journal as published by the  Medical Research Archives .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.23(7); 2013 Jul

Hypothesis-generating research and predictive medicine

Leslie g. biesecker.

National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA

Genomics has profoundly changed biology by scaling data acquisition, which has provided researchers with the opportunity to interrogate biology in novel and creative ways. No longer constrained by low-throughput assays, researchers have developed hypothesis-generating approaches to understand the molecular basis of nature—both normal and pathological. The paradigm of hypothesis-generating research does not replace or undermine hypothesis-testing modes of research; instead, it complements them and has facilitated discoveries that may not have been possible with hypothesis-testing research. The hypothesis-generating mode of research has been primarily practiced in basic science but has recently been extended to clinical-translational work as well. Just as in basic science, this approach to research can facilitate insights into human health and disease mechanisms and provide the crucially needed data set of the full spectrum of genotype–phenotype correlations. Finally, the paradigm of hypothesis-generating research is conceptually similar to the underpinning of predictive genomic medicine, which has the potential to shift medicine from a primarily population- or cohort-based activity to one that instead uses individual susceptibility, prognostic, and pharmacogenetic profiles to maximize the efficacy and minimize the iatrogenic effects of medical interventions.

The goal of this article is to describe how recent technological changes provide opportunities to undertake novel approaches to biomedical research and to practice genomic preventive medicine. Massively parallel sequencing is the primary technology that will be addressed here ( Mardis 2008 ), but the principles apply to many other technologies, such as proteomics, metabolomics, transcriptomics, etc. Readers of this journal are well aware of the precipitous fall of sequencing costs over the last several decades. The consequence of this fall is that we are no longer in a scientific and medical world where the throughput (and the costs) of testing is the key limiting factor around which these enterprises are organized. Once one is released from this limiting factor, one may ask whether these enterprises should be reorganized. Here I outline the principles of how these enterprises are organized, show how high-throughput biology can allow alternative organizations of these enterprises to be considered, and show how biology and medicine are in many ways similar. The discussion includes three categories of enterprises: basic research, clinical research, and medical practice.

The basic science hypothesis-testing paradigm

The classical paradigm for basic biological research has been to develop a specific hypothesis that can be tested by the application of a prospectively defined experiment (see Box 1 ). I suggest that one of the major (although not the only) factors that led to the development of this paradigm is that experimental design was limited by the throughput of available assays. This low throughput mandated that the scientific question had to be focused narrowly to make the question tractable. However, the paradigm can be questioned if the scientist has the ability to assay every potential attribute of a given type (e.g., all genes). If the hypothesis is only needed to select the assay, one does not need a hypothesis to apply a technology that assays all attributes. In the case of sequencing, the radical increase in throughput can release scientists from the constraint of the specific hypothesis because it has allowed them to interrogate essentially all genotypes in a genome in a single assay. This capability facilitates fundamental biological discoveries that were impossible or impractical with a hypothesis-testing mode of scientific inquiry. Examples of this approach are well demonstrated by several discoveries that followed the sequencing of a number of genomes. An example was the discovery that the human gene count was just over 20,000 ( International Human Genome Sequencing Consortium 2004 ), much lower than prior estimates. This result, although it was much debated and anticipated, was not a hypothesis that drove the human genome project, but nonetheless was surprising and led to insights into the nuances of gene regulation and transcriptional isoforms to explain the complexity of the human organism. The availability of whole genome sequence data from multiple species facilitated analyses of conservation. While it was expected that protein-coding regions, and to a lesser extent promoters and 5′- and 3′-untranslated regions of genes, would exhibit recognizable sequence conservation, it was unexpected that an even larger fraction of the genomes outside of genes are highly conserved ( Mouse Genome Sequencing Consortium 2002 ). This surprising and unanticipated discovery has spawned a novel field of scientific inquiry to determine the functional roles of these elements, which are undoubtedly important in physiology and pathophysiology. These discoveries demonstrate the power of hypothesis-generating basic research to illuminate important biological principles.

Basic science hypothesis-testing and hypothesis-generating paradigms

An external file that holds a picture, illustration, etc.
Object name is 1051tbl1.jpg

Clinical and translational research

The approach to clinical research grew out of the basic science paradigm as described above. The first few steps of selecting a scientific problem and developing a hypothesis are similar, with the additional step ( Box 2 ) of rigorously defining a phenotype and then carefully selecting research participants with and without that trait. As in the basic science paradigm, the hypothesis is tested by the application of a specific assay to the cases and controls. Again, this paradigm has been incredibly fruitful and should not be abandoned, but the hypothesis-generating approach can be used here as well. In this approach, a cohort of participants is consented, basic information is gathered on their health, and then a high-throughput assay, such as genome or exome sequencing, is applied to all of the participants. Again, because the assay tests all such attributes, the research design does not necessitate a priori selections of phenotypes and genes to be interrogated. Then, the researcher can examine the sequence data set for patterns and perturbations, form hypotheses about how such perturbations might affect the phenotype of the participants, and test that hypothesis with a clinical research evaluation. This approach has been used with data from genome-wide copy number assessments (array CGH and SNP arrays), but sequencing takes it to a higher level of interrogation and provides innumerable variants with which to work.

Clinical research paradigms

An external file that holds a picture, illustration, etc.
Object name is 1051tbl2.jpg

An example of this type of sequence-based hypothesis-generating clinical research started with a collaborative project in which we showed that mutations in the gene ACSF3 caused the biochemical phenotype of combined malonic and methylmalonic acidemia ( Sloan et al. 2011 ). At that time, the disorder was believed to be a classic pediatric, autosomal-recessive severe metabolic disorder with decompensation and sometimes death. We then queried the ClinSeq cohort ( Biesecker et al. 2009 ) to assess the carrier frequency, to estimate the population frequency of this rare disorder. Because ClinSeq is a cohort of adults with a range of atherosclerosis severity, we reasoned that this would serve as a control population for an unbiased estimate of ACSF3 heterozygote mutant alleles. Surprisingly, we identified a ClinSeq participant who was homozygous for one of the mutations identified in the children with the typical phenotype. Indeed, one potential interpretation of the data would be that the variant is, in fact, benign and was erroneously concluded to be pathogenic, based on finding it in a child with the typical phenotype. It has been shown that this error is common, with up to 20% of variants listed in databases as pathogenic actually being benign ( Bell et al. 2011 ). Further clinical research on this participant led to the surprising result that she had severely abnormal blood and urine levels of malonic and methylmalonic acid ( Sloan et al. 2011 ). This novel approach to translational research was a powerful confirmation that the mutation was indeed pathogenic, but there was another, even more important conclusion. We had conceptualized the disease completely incorrectly. Instead of being only a severe, pediatric metabolic disorder, it was instead a disorder with a wide phenotypic spectrum in which one component of the disease is a metabolic perturbation and another component is a susceptibility to severe decompensation and strokes. This research indeed raises many questions about the natural history of the disorder, whether the pediatric decompensation phenotype is attributable to modifiers, what the appropriate management of such an adult would be, etc.

Irrespective of these limitations, the understanding of the disease has markedly advanced, and the key to understanding the broader spectrum of this disease was the hypothesis-generating approach enabled by the massively parallel sequence data and the ability to phenotype patients iteratively from ClinSeq. The iterative phenotyping was essential because we could not have anticipated when the patients were originally ascertained that we would need to assay malonic and methylmalonic acid. Nor did we recognize prospectively that we should be evaluating apparently healthy patients in their seventh decade for this phenotype. Indeed, it is impossible to evaluate patients for all potential phenotypes prospectively, and it is essential to minimize ascertainment bias for patient recruitment in order to allow the discovery of the full spectrum of phenotypes associated with genomic variations. This latter issue has become a critical challenge for implementing predictive medicine, as described below.

Predictive genomic medicine in practice

The principles of scientific inquiry are parallel to the processes of clinical diagnosis ( Box 3 ). In the classic, hypothesis-testing paradigm, clinicians gather background information including chief complaint, 2 medical and family history, and physical examination, and use these data to formulate the differential diagnosis, which is a set of potential medical diagnoses that could explain the patient's signs and symptoms. Then, the clinician selects, among the myriad of tests (imaging, biochemical, genetic, physiologic, etc.), a few tests, the results of which should distinguish among (or possibly exclude entirely) the disorders on the differential diagnosis. Like the scientist, the physician must act as a test selector, because each of the tests is low throughput, time consuming, and expensive.

Clinical practice paradigms—hypothesis testing and hypothesis generating

An external file that holds a picture, illustration, etc.
Object name is 1051tbl3.jpg

As in the basic and translational research discussion above, the question could be raised as to whether the differential diagnostic paradigm is necessary for genetic disorders. Indeed, the availability of clinical genome and exome sequencing heralds an era when the test could be ordered relatively early in the diagnostic process, with the clinician serving in a more interpretative role, rather than as a test selector ( Hennekam and Biesecker 2012 ). This approach has already been adopted for copy number variation, because whole genome array CGH- or SNP-based approaches have mostly displaced more specific single-gene or single-locus assays and standard chromosome analyses ( Miller et al. 2010 ). But the paradigm can be taken beyond hypothesis-generating clinical diagnosis into predictive medicine. One can now begin to envision how whole genome approaches could be used to assess risks prospectively for susceptibility to late-onset disorders or occult or subclinical disorders. The heritable cancer susceptibility syndromes are a good example of this. The current clinical approach is to order a specific gene test if a patient presents with a personal history of an atypical or early-onset form of a specific cancer syndrome, or has a compelling family history of the disease. As in the prior examples, this is because individual cancer gene testing is expensive and low throughput. One can ask the question whether this is the ideal approach or if we could be screening for these disorders from genome or exome data. Again, we applied sequencing analysis for these genes to the ClinSeq cohort because they were not ascertained for that phenotype. In a published study of 572 exomes ( Johnston et al. 2012 ), updated here to include 850 exomes, we have identified 10 patients with seven distinct cancer susceptibility syndrome mutations. These were mostly familial breast and ovarian cancer ( BRCA1 and BRCA2 ), with one patient each with paraganglioma and pheochromocytoma ( SDHC ) and one with Lynch syndrome ( MSH6 ). What is remarkable about these diagnoses is that only about half of them had a convincing personal or family history of the disease, and thus most would have not been offered testing using the current, hypothesis-testing clinical paradigm. These data suggest that screening for these disorders using genome or exome sequencing could markedly improve our ability to identify such families before they develop or die from these diseases—the ideal of predictive genomic medicine.

Despite these optimistic scenarios and examples, it remains true that our ability to perform true predictive medicine is limited. These limitations include technical factors such as incomplete sequence coverage, imperfect sequence quality, inadequate knowledge regarding the penetrance and expressivity of most variants, uncertain medical approaches and utility of pursuing variants from genomic sequencing, and the poor preparation of most clinicians for addressing genomic concerns in the clinic ( Biesecker 2013 ). Recognizing all of these limitations, it is clear that we are not prepared to launch broad-scale implementation of predictive genomic medicine, nor should all research be structured using the hypothesis-generating approach.

Hypothesis-testing approaches to science and medicine have served us well and should continue. However, the advent of massively parallel sequencing and other high-throughput technologies provides opportunities to undertake hypothesis-generating approaches to science and medicine, which in turn provide unprecedented opportunities for discovery in the research realm. This can allow the discovery of results that were not anticipated or intended by the research design, yet provide critical insights into biology and pathophysiology. Similarly, hypothesis-generating clinical research has the potential to provide these same insights and, in addition, has the potential to provide us with data that will illuminate the full spectrum of genotype–phenotype correlations, eliminating the biases that have limited this understanding in the past. Finally, applying these principles to clinical medicine can provide new pathways to diagnosis and provide the theoretical basis for predictive medicine that can detect disease susceptibility and allow health to be maintained, instead of solely focusing on the treatment of evident disease.

Article is online at http://www.genome.org/cgi/doi/10.1101/gr.157826.113 .

2 The chief complaint is a brief description of the problem that led the patient to the clinician, such as “I have a cough and fever.”

  • Bell CJ, Dinwiddie DL, Miller NA, Hateley SL, Ganusova EE, Mudge J, Langley RJ, Zhang L, Lee CC, Schilkey FD, et al. 2011. Carrier testing for severe childhood recessive diseases by next-generation sequencing . Sci Transl Med 3 : 65ra64 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Biesecker LG 2013. Incidental findings are critical for genomics . Am J Hum Genet 92 : 648–651 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Biesecker LG, Mullikin JC, Facio FM, Turner C, Cherukuri PF, Blakesley RW, Bouffard GG, Chines PS, Cruz P, Hansen NF, et al. 2009. The ClinSeq Project: Piloting large-scale genome sequencing for research in genomic medicine . Genome Res 19 : 1665–1674 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hennekam RC, Biesecker LG 2012. Next-generation sequencing demands next-generation phenotyping . Hum Mutat 33 : 884–886 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • International Human Genome Sequencing Consortium 2004. Finishing the euchromatic sequence of the human genome . Nature 431 : 931–945 [ PubMed ] [ Google Scholar ]
  • Johnston JJ, Rubinstein WS, Facio FM, Ng D, Singh LN, Teer JK, Mullikin JC, Biesecker LG 2012. Secondary variants in individuals undergoing exome sequencing: Screening of 572 individuals identifies high-penetrance mutations in cancer-susceptibility genes . Am J Hum Genet 91 : 97–108 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Mardis ER 2008. The impact of next-generation sequencing technology on genetics . Trends Genet 24 : 133–141 [ PubMed ] [ Google Scholar ]
  • Miller DT, Adam MP, Aradhya S, Biesecker LG, Brothman AR, Carter NP, Church DM, Crolla JA, Eichler EE, Epstein CJ, et al. 2010. Consensus statement: Chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies . Am J Hum Genet 86 : 749–764 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of the mouse genome . Nature 420 : 520–562 [ PubMed ] [ Google Scholar ]
  • Sloan JL, Johnston JJ, Manoli I, Chandler RJ, Krause C, Carrillo-Carrasco N, Chandrasekaran SD, Sysol JR, O'Brien K, Hauser NS, et al. 2011. Exome sequencing identifies ACSF3 as a cause of combined malonic and methylmalonic aciduria . Nat Genet 43 : 883–886 [ PMC free article ] [ PubMed ] [ Google Scholar ]

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: suffix
  • failed: parcolumns
  • failed: xpatch

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices .

Hypothesis Generation with Large Language Models

Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled examples). To enable LLMs to handle arbitrarily long contexts, we generate initial hypotheses from a small number of examples and then update them iteratively to improve the quality of hypotheses. Inspired by multi-armed bandits, we design a reward function to inform the exploitation-exploration tradeoff in the update process. Our algorithm is able to generate hypotheses that enable much better predictive performance than few-shot prompting in classification tasks, improving accuracy by 31.7% on a synthetic dataset and by 13.9%, 3.3% and, 24.9% on three real-world datasets. We also outperform supervised learning by 12.8% and 11.2% on two challenging real-world datasets. Furthermore, we find that the generated hypotheses not only corroborate human-verified theories but also uncover new insights for the tasks.

1 Introduction

Hypothesis generation drives scientific progress. Mendel’s hypothesis on allele pairs lays the foundation for modern genetics; Einstein’s hypothesis in general theory of relativity led to the prediction and subsequent confirmation of gravitational waves. In the context of language modeling, the hypothesis on scaling law inspires recent progress in large language models (LLMs)  (Chowdhery et al., 2022 ) . Despite the importance of hypothesis generation, as Ludwig & Mullainathan ( 2024 ) points out, science has been curiously asymmetric. While many scientific publications present extensive formal and empirical evaluation of hypotheses, the generation of hypotheses happens off-stage by researchers. In order to generate novel hypotheses, researchers may read literature, analyze data, pick the brain of each other, and even “hallucinate” (see Kekulé’s discovery of the structure of the benzene molecule  (Rothenberg, 1995 ) ). Given the rise of large language models  (Brown et al., 2020 ; Anthropic, 2023 ; OpenAI, 2023b ) , we examine their potential of providing much needed assistance in hypothesis generation in this work.

In particular, we focus on hypothesis generation based on data, a common approach in empirical sciences. Our main question is how we can enable LLMs to generate hypotheses of high-quality. While one can easily prompt LLMs to generate hypotheses, LLMs may not be able to effectively leverage the input examples in a single long prompt. Moreover, it is important to have measures of quality in the generation process so that we can filter bad hypotheses and come up with better ones. These two observations motivate us to start with a setup analogous to supervised learning. We can iteratively prompt an LLM to generate hypotheses based on the training examples and use training accuracy as a measure of quality to guide the generation process. Conveniently, we can also evaluate the quality of the final generated hypotheses with their performance on held-out examples, similar to supervised learning.

To generate high-quality hypotheses with LLMs, we propose an algorithm inspired by the upper confidence bound algorithm in multi-armed bandits (Auer, 2002 ) ( HypoGeniC , Hypo thesis Gen eration i n C ontext; see Figure   1 ). Given initial hypotheses generated from a small number of examples, we need to assess their quality and propose new hypotheses to address their deficiencies. To navigate this exploration-exploitation tradeoff, we introduce a reward function and evaluate the top k 𝑘 k italic_k hypotheses for each training example. We maintain a wrong example bank to capture the gap in knowledge of the current hypotheses pool, and generate new hypotheses based on the wrong example bank to close the gap.

Refer to caption

The generated hypotheses naturally enable an interpretable hypothesis-based classifier. We propose a suite of inference strategies given a set of hypotheses. We apply our method to one synthetic task where there is a single known valid hypothesis and three real-world tasks ( Deceptive reviews , and Tweet popularity ). The real-world tasks focus on deception detection and message popularity prediction, which are known to be challenging even for humans  (Ott et al., 2011 ; Salganik et al., 2006 ) . Our algorithm can recover the hypothesis in the synthetic task and also provide useful hypotheses for the real-world tasks. In fact, our generated hypotheses consistently outperform few-shot in-context learning baselines across all four tasks (31.7% in Shoe sales , 13.9% in Deceptive reviews , 3.3% in and 24.9% in Tweet popularity ). The predictive performance matches and even outperforms oracle supervised learning with RoBERTa except in Deceptive reviews .

It is important to emphasize that although the utility of hypotheses in assisting downstream classification serves as an indicator for LLMs’ ability to generate hypotheses, our primary interest lies in the quality of the hypotheses. Thus, it is critical for the hypotheses to be interpretable beyond the LLM used to produce the hypotheses. We show that hypotheses generated by one LLM (e.g., GPT-3.5-turbo) can be used to make accurate inference by another LLM (e.g., Mixtral). On an out-of-distribution dataset for Deceptive reviews , we can even outperform the oracle fine-tuned RoBERTa. Such cross generalization provides strong evidence that we are able to generate hypotheses of high quality. Furthermore, through a qualitative analysis, our generated hypotheses not only confirm theories from existing literature but also provide new insights about the task. For instance, one novel hypothesis is that “reviews that mention personal experiences or special occasions, such as birthdays, anniversaries, or weddings, are more likely to be truthful”. We encourage future research on deception detection to explore these novel hypotheses.

Our work is connected to many recent studies on using LLMs to propose “hypotheses”, notably, Qiu et al. ( 2024 ) and Zhong et al. ( 2023 ) . Qiu et al. ( 2024 ) is motivated by testing the ability of LLMs to perform human-like induction reasoning, and Zhong et al. ( 2023 ) aims to support open-ended exploration. While similar in spirit, we examine the case of generating theories between input and labels for challenging problems where even researchers may struggle with proposing new hypotheses.

To summarize, we highlight the following contributions:

We propose a novel computational framework for generating and evaluating hypotheses with large language models.

Our generated hypotheses enable interpretable hypothesis-based classifiers that outperform in-context learning and even supervised learning for one synthetic and three real-world datasets. These hypotheses are also robust across different LLMs and out-of-distribution datasets.

We find our generated hypotheses to corroborate existing findings while also providing new insights for the tasks.

We begin with a description of the problem formulation. Given a set 𝒮 = { ( x 1 , y 1 ) , … , ( x n , y n ) } 𝒮 subscript 𝑥 1 subscript 𝑦 1 … subscript 𝑥 𝑛 subscript 𝑦 𝑛 \mathcal{S}=\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\} caligraphic_S = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } where x i subscript 𝑥 𝑖 x_{i} italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an example and y i subscript 𝑦 𝑖 y_{i} italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding label, the goal is to learn a set of hypotheses ℋ = { h 1 , … , h m } ℋ subscript ℎ 1 … subscript ℎ 𝑚 \mathcal{H}=\{h_{1},...,h_{m}\} caligraphic_H = { italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } that describe theories of relationships between x 𝑥 x italic_x and y 𝑦 y italic_y . To this end, we prompt an LLM to summarize demonstration examples into high-level hypotheses ( §   2.1 ). Then, during inference, the LLM makes inference based on the generated hypothesis ( §   2.2 ). Our code is available at https://github.com/ChicagoHAI/hypothesis_generation .

2.1 Hypothesis Generation

Our hypothesis generation algorithm ( Algorithm   1 ) is inspired by the upper confidence bound (UCB) algorithm (Auer, 2002 ) . Given a set of initial examples 𝒮 init ⊂ 𝒮 subscript 𝒮 init 𝒮 \mathcal{S}_{\mathrm{init}}\subset\mathcal{S} caligraphic_S start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT ⊂ caligraphic_S , we first prompt an LLM to generate hypotheses for 𝒮 init subscript 𝒮 init \mathcal{S}_{\mathrm{init}} caligraphic_S start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT , which serve as our initial hypothesis bank ℋ ℋ \mathcal{H} caligraphic_H . While initialized hypotheses may explain some portions of data, they often fall short of encompassing the full scope of the examples. We thus introduce an update stage which serves a dual purpose: 1) it increases the percentage of data explainable by the hypotheses and 2) it replaces any hypotheses that are found to be inaccurate.

In the update stage, for a training example s 𝑠 s italic_s , we select the top k 𝑘 k italic_k high-reward hypotheses from the hypothesis bank ℋ ℋ \mathcal{H} caligraphic_H . The LLM is prompted to make a prediction with each of the top k 𝑘 k italic_k high-reward hypotheses on s 𝑠 s italic_s . Then we compute the accuracy of the inference and accordingly update the reward for each of the hypotheses. If w h ⁢ y ⁢ p subscript 𝑤 ℎ 𝑦 𝑝 w_{hyp} italic_w start_POSTSUBSCRIPT italic_h italic_y italic_p end_POSTSUBSCRIPT hypotheses predict incorrectly for the example s 𝑠 s italic_s , then s 𝑠 s italic_s is added to a wrong example pool 𝒲 𝒲 \mathcal{W} caligraphic_W . Once the wrong example pool reaches a max size of w m ⁢ a ⁢ x subscript 𝑤 𝑚 𝑎 𝑥 w_{max} italic_w start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , the wrong examples in 𝒲 𝒲 \mathcal{W} caligraphic_W are used to generate new hypotheses, as shown in Algorithm   1 . The wrong example pool represents the gap in knowledge that the current pool of hypotheses has for the dataset. Thus by generating new hypotheses, the algorithm fills in these gaps. We update ℋ ℋ \mathcal{H} caligraphic_H with the newly generated hypotheses according to the rewards.

Reward. As mentioned above, each hypothesis has an associated reward. In our algorithm, we use the reward function in the UCB algorithm due to similarities between the multi-arm bandit problem and our problem formulation. In particular, we consider each hypothesis to be an arm and each training example to be a “pull”. We note, however, that unlike the multi-arm bandit problem, multiple hypotheses are tested for a singular train example. Moreover, there can be new arms after hypotheses are updated, altering the setting from the standard static arms scenario to a dynamic arms scenario. Formally, the reward is defined as

where 𝒮 i subscript 𝒮 𝑖 \mathcal{S}_{i} caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the set of examples that have been used to evaluate the hypothesis h i subscript ℎ 𝑖 h_{i} italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , t 𝑡 t italic_t is train time step, and α 𝛼 \alpha italic_α is a hyperparameter that controls the exploration term. The first term in the reward function denotes the accuracy of the hypothesis for all 𝒮 i subscript 𝒮 𝑖 \mathcal{S}_{i} caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . The second term is the exploration term, which is computed based on the number of times the hypothesis has been selected and the number of training examples visited so far. The accuracy term urges the algorithm to use well-performing hypotheses, whereas the exploration term encourages the algorithm to explore hypotheses that have not been selected many times. Thus, the reward function strikes a balance between exploration and exploitation.

For more details on implementation of HypoGeniC , refer to §   B.1 .

2.2 Hypothesis-based Inference

For efficiency purposes, we use each hypothesis on its own without accounting for their combinatorial effect during training; however, we should leverage the set of hypotheses as a whole during inference for at least two reasons. Firstly, some hypotheses may only apply to a subset of examples. Second, competing theories may require head-to-head comparisons. Hence, we develop multiple inference strategies to account for these different styles of reasoning (see Appendix   A for prompts and §   B.2 for implementation details).

Best-accuracy hypothesis. The hypothesis h ℎ h italic_h with the highest accuracy from the hypothesis bank ℋ ℋ \mathcal{H} caligraphic_H is included in the prompt to guide the model to perform inference.

Filter and weighted vote. One hypothesis may not be enough to explain the data. Thus, this approach uses a combination of relevant hypotheses to make predictions for a single example. For each example, we first filter hypotheses by prompting an LLM to judge which hypotheses are relevant to the example. Next, an LLM is prompted to generate predictions for each of the relevant hypotheses, and these predictions are aggregated with weighted vote , where the weight is the training accuracy of the corresponding hypothesis.

Single-step adaptive inference. Similar to filter and weighted vote , this approach leverages contextual information to choose hypotheses. The difference, however, is that it selects the most applicable hypothesis for each test example. Specifically, for a given test example, the LLM is tasked with identifying the most applicable hypothesis from a set of options. For each hypothesis, we provide instances from the training set where the hypothesis was accurate. Then, the LLM selects the most relevant hypothesis by comparing the test example to these training examples and evaluating their similarity. Thereafter, we apply the hypothesis to the test example to perform inference. Please note that this is all done in one step with a long prompt.

Two-step adaptive inference. We divide the previous inference strategy into two steps:

The LLM determines the most relevant set of examples by comparing the test example with the corresponding examples of the hypotheses.

Then, the corresponding hypothesis is provided to the LLM, which it uses to perform inference on the test example in a second prompt.

3 Experiment Setup

Next, we introduce the experiment setup to evaluate HypoGeniC .

3.1 Tasks and Datasets

The choice of appropriate tasks is critical for evaluating the ability of LLMs to generate hypothesis. The focus of our work is on generating hypotheses based on observed data. A prerequisite is that potential hypotheses do exist. In the context of classification, it implies that the classification performance is non-trivial. In addition, we need to ensure that the hypotheses describing the data are likely not a priori known by LLMs, which rules out standard tasks such as sentiment analysis. Therefore, we use four datasets that satisfy these requirements: a synthetic task where we know the true hypothesis and three real-world datasets that exhibit complex underlying patterns and constitute widely studied social science problems.

Shoe sales is a synthetic task we created to investigate the scenario where there is only one single valid hypothesis. The task is to predict the color of the shoe that the customer will buy based on their appearance. The input provides appearance features, namely, age, height, gender, color of the hat, color of the shirt, color of the bag, and size of the bag. We construct this dataset such that the color of the shoe must match the color of the shirt. Since there are six colors in total, this becomes a 6-class classification problem.

Deceptive review detection is an instance of deception detection, a widely studied phenomenon in psychology and other social sciences  (Granhag & Vrij, 2005 ) . This particular task ( Deceptive reviews ) requires distinguishing genuine reviews from fictitious ones (Ott et al., 2011 ) , where human performance is about chance  (Lai & Tan, 2019 ) . The dataset includes 800 genuine reviews and 800 fictitious reviews for 20 hotels in Chicago.

Predicting popularity is a notoriously challenging task in social sciences because it is known to be affected by seemingly random factors  (Salganik et al., 2006 ) . We use two datasets in this work: nd Tweet popularity . s derived from a dataset in the Upworthy Research Archive (Matias et al., 2021 ) . The original dataset was collected through A/B testing, where each user was shown pairs of a headline and image for multiple packages (articles). Each user was exposed to only one of these pairs per package, and the clicks were recorded for each pair per package. 1 1 1 The Upworthy Research Archive only provides the image IDs instead of the graphics. We thus only use the headlines for our dataset. This process resulted in a total of 150,816 headlines across 22,666 packages. We construct a binary classification dataset by choosing for each package the headline that received the most clicks and the one that received the fewest. We remove all sets of duplicate headlines, which results in our version of the ataset. The task for this dataset is to deduce which headline had more clicks in a pair. Tweet popularity uses a dataset of 13,174 tweet pairs (Tan et al., 2014 ) , which are matched by the topic and the author. Similar to the task is to predict which one received more retweets.

3.2 Baselines, Oracles, and Evaluation Metrics

We use three different LLMs in our experiments (Mixtral (Mistral, 2023 ) , GPT-3.5-turbo (OpenAI, 2023a ) , and Claude-2.1 (Anthropic, 2023 ) ). We compare our approach with the following methods.

Zero-shot and few-shot prompting. We provide LLMs with task-specific instructions (zero-shot), optionally accompanied by three demonstration examples (few-shot).

No updates. To assess the value of the update stage in our algorithm, we evaluate the performance of the initialized hypotheses. In particular, we pick the best-performing hypothesis on the training set and use it for inference on the test set.

Supervised Learning. We fine-tune RoBERTa on each of the datasets to serve as a non-interpretable oracle. We include results for training on 200 examples and 1000 examples. Since fine-tuning update model weights, we expect RoBERTa to set the upper bound on in-distribution datasets.

We randomly sample 200 training examples and 300 test examples for each dataset. Since all our datasets are classification tasks with ground truth labels, we use accuracy as our evaluation metric. To understand the effect of the number of training examples, we evaluate the performance of all methods at 10, 25, 50, 100, and 200 training examples. We also experiment with two different hypothesis bank sizes: 3 and 20 hypotheses to evaluate the impact of utilizing a larger number of hypotheses. The detailed hyperparameters of our approach can be found in §   B.3 .

To demonstrate the effectiveness of our hypothesis generation approach, we present results via three evaluation methods. First, we show that in the standard supervised learning setup, our generated hypotheses enable more accurate predictions than baselines and even oracles when using a small set of examples. Second, we evaluate the generated hypotheses by checking whether they can generalize across different inference LLMs and to out-of-distribution datasets. We find surprisingly consistent performance even when using a different LLM to make inference from the generated hypotheses. Finally, we conduct a qualitative analysis to show that the generated hypotheses not only corroborate with existing theories but also provide novel insights about the tasks at hand.

4.1 Performance on Heldout Test Sets

As discussed in the introduction, a side product of our approach is an interpretable hypothesis-based classifier. We compare its performance with standard supervised learning with fine-tuned RoBERTa and few-shot prompt learning ( Table   1 ).

Our generated hypotheses improve inference over standard zero-shot and few-shot inference. Across all LLMs, HypoGeniC outperforms the zero-shot learning by an average of 60% on Shoe sales , 22.7% on Deceptive reviews , 5.1% on and 30.6% on Tweet popularity . Similarly, we find that HypoGeniC shows an increase from few-shot learning by 31.7% on Shoe sales , 13.9% on Deceptive reviews , 3.3% on and 24.9% on Tweet popularity . Note that these results are inflated on Tweet popularity as safety mode is triggered for Mixtral and Claude-2.1 for zero-shot and few-shot learning respectively. The results show that hypothesis-based inference can increase the performance of LLMs significantly. One exception is that our method performs slightly worse (by 1%) than the few-shot baseline in the Tweet popularity with GPT-3.5-turbo. One possible reason is that the few-shot demonstrations are effective at eliciting the pretraining knowledge in GPT-3.5-turbo, possibly due to a large amount of tweets in pretraining data. For more detailed results, refer to Appendix   C .

We also evaluate generated hypotheses with oracle inference, where the model retrospectively picks the best hypothesis for each prediction from the bank. With oracle inference, HypoGeniC achieves on average 88.6% on Deceptive reviews , 84.1% on and 88% on Tweet popularity across all LLMs, which are superior to results in Table   1 . This result further suggests that hypotheses generated by HypoGeniC are of high quality and can lead to accurate predictions when the correct hypothesis is selected.

HypoGeniC matches or even exceeds RoBERTa with the same number of training examples on most datasets. Both HypoGeniC and RoBERTa yield 100% on the syntheic dataset. Moreover, HypoGeniC is 12.8% and 11.2% better than RoBERTa on nd Tweet popularity respectively with 200 training examples. Since RoBERTa learns by updating model weights to minimize the cross-entropy loss, it tends to benefit from more training examples, so we increase training examples to 1000 for RoBERTa. Despite the accuracy boost from more training examples, we find that HypoGeniC ’s best result still outperforms RoBERTa by 3.7% on nd 0.7% on Tweet popularity . One exception, however, is the Deceptive reviews dataset. We suspect that as word-level features are very useful in this dataset  (Ott et al., 2011 ) , they could be tougher for LLMs to extract but easier for fine-tuned models to grasp.

Updating hypothesis bank leads to hypotheses of higher quality. Comparing HypoGeniC with the “no updates” results, we find that updating hypotheses generally leads to better hypotheses, suggesting that our algorithm is effective at improving hypothesis quality. The improvement is on average 0.7% on Shoe sales , 5.8% on Deceptive reviews , 8.1% on and 7% on Tweet popularity . Another advantage of HypoGeniC over “no updates” is that sometimes the training examples exceed the context window size of LLMs, which can lead to degraded performance ( Figures   4 and  3 ).

Effect of inference strategy. Figure   2 shows HypoGeniC results with different inference strategies on Deceptive reviews . Single-step adaptive inference is the most effective. Generally, we find that hypotheses to be one-sided, focusing on either characteristics of truthful or deceptive reviews. We thus need to consider more than one hypothesis to make a correct prediction, so best-accuracy hypothesis or two-step adaptive inference would not be ideal for this dataset. On the other datasets, we find that the effect of inference strategy is much smaller. Best-accuracy hypothesis is sufficient for Shoe sales and and filter and weighted vote works best for Tweet popularity . Results for all datasets are in §   C.1 . Whichever inference strategy we use, the trend of HypoGeniC against few-shot learning and RoBERTa remains largely the same.

Generally, having more training examples and a larger hypothesis pool improves performance. We show performance for different methods as number of training examples increase in Figures   4 , 3 and  5 . We find HypoGeniC accuracy steadily increases as training size increases on Shoe sales , suggesting that an LLM is more likely to generate the best hypothesis given more examples. For the real-world datasets, however, the performance sometimes peaks at training size at 25 or 100 before reaching to 200. We suspect that the evaluation of the hypothesis bank would be less stable for the real-world datasets, since more than one correct hypotheses are needed for the task. We also find that using a hypothesis pool of size 20 leads to better performance than using a pool of size 3.

Refer to caption

4.2 Generalization of the Generated Hypotheses

Our primary interest lies in the quality of the hypotheses. A good hypothesis should enable accurate inference by any AI model or even human and also generalize to unseen out-of-distribution dataset. In this subsection, we mix and match different LLMs for generation and inference. We also evaluate the hypotheses in deceptive review prediction on a new out-of-distribution (OOD) dataset (Li et al., 2013 ) .

We find that the hypotheses generated by HypoGeniC generalize across models ( Table   2 ). Generally, we find Claude-2.1 and Mixtral to be better at inference. Thus, substituting the inference model with them lead to better performance for hypothesis generated with GPT-3.5-turbo. Subsituting Claude-2.1 and Mixtral as each other’s inference model lead to small changes in performance. On Shoe sales , the performance remains high (>90%) regardless of the inference model used.

Performance even increases for Deceptive reviews and hen using Claude-2.1 as the inference model. For the cases where performance drops from Claude-2.1 to Mixtral, the decrease is marginal: 2.3% on Deceptive reviews and 2.7% on Tweet popularity .

These results suggest that the hypotheses generated by HypoGeniC are generalizable across different LLMs, which somewhat contracts the claim in Qiu et al. ( 2024 ) that LLMs cannot reliably interpret the hypotheses. We suspect that the reason may be that our tasks only rely on natural language, while their tasks rely on notions of worlds and can be fed into symbolic interpreters.

Our generated hypotheses generalize to an out-of-distribution dataset.

Table   3 presents an overview for the OOD deceptive review dataset. This dataset differs from Deceptive reviews by including reviews from four cities sourced from different websites  (Li et al., 2013 ) . We find that HypoGeniC outperforms few-shot learning by an average of 19.1%. Despite the distribution shift, HypoGeniC surprisingly increases accuracy from Deceptive reviews by an average of 3.3%, suggesting our hypotheses generalize well to this OOD dataset. Claude-2.1 remains the best performing model. In comparison, the performance of RoBERTa drops by 11%. As a result, HypoGeniC with Claude-2.1 outperforms RoBERTa by 1.7%, demonstrating the robustness of hypothesis-based inference. Refer to §   C.3 for more details.

4.3 Qualitative Analysis

For the synthetic dataset, all models are able to find the true underlying hypothesis for Shoe sales : “customers tend to buy shoes that match the color of their shirt.” For the real-world datasets, we compare our hypotheses with findings from the literature. We confirm the validity of some of our hypotheses and discover new insights about the tasks that previous studies did not touch upon. We show a few examples in Table   4 , and the full list of hypotheses can be found in Appendix   D .

Our hypotheses confirm useful features in existing literature. For Deceptive reviews , we find that deceptive reviews are more likely to be emotional, use superlatives, or contain information that could not have been directly experienced. Similar findings are also found by previous studies on Deceptive reviews (Lai et al., 2020 ; Anderson & Simester, 2014 ; Ott et al., 2011 ; Li et al., 2014 ) . For Tweet popularity , we discover that tweets that are short and concise, with specific or relevant hashtags, or with emotional tones are more likely to be retweeted more, aligning with prior studies  (Tan et al., 2014 ; Gligorić et al., 2019 ) . For we find that revealing something new or using vivid language and imagery can drive engagement from readers to click on headlines. Previous studies also find these rules apply to online news headlines (Banerjee & Urminsky, 2021 ; Sadoski et al., 2000 ) .

We also discover new insights with our generated hypotheses. For the Deceptive reviews dataset, truthful reviews could mention the reviewer’s purpose for staying at the hotel (e.g., business trip, vacation), but deceptive ones tend not to have this information. For we find that headlines that frame the content in a personal or relatable way are clicked more. For Tweet popularity , tweets that mention influential individuals or organizations are more likely to be retweeted.

Intriguingly, one of our hypotheses contradicts a feature engineering result. Ott et al. ( 2011 ) find that the token “future” is associated with deceptive reviews, while one of our hypotheses says that mentions of “past experiences or future travel plans” are indicative of truthfulness. This discrepancy is interesting, because the context for the token “future” is unclear. It could be mentioned in the context of future plans but could also be mentioned as a complaint about “never going to stay at the hotel in the future.” Feature engineering is limited due to the contextual ambiguity, whereas our generated hypotheses and their interpretation by LLMs overcome such limitations.

Our automatic evaluation of hypothesis quality also reflects negative findings . Given mixed evidence from previous literature on the effect of “reading ease” on headline clicks, Banerjee & Urminsky ( 2021 ) finds that reading ease negatively impacts click-through rates in hrough careful feature engineering. Consistent with this result, we found that the hypotheses that claim “straightforward” and “clear” writing to be indicative of higher click-through rates have relatively lower accuracies during training.

5 Additional Related Work

Concept/pattern discovery. In addition to Qiu et al. ( 2024 ) and Zhong et al. ( 2023 ) discussed in the introduction, other studies have worked along similar lines  (Wang et al., 2023b ; Singh et al., 2023 ; Piriyakulkij & Ellis, 2024 ) . For example, similar to Qiu et al. ( 2024 ) , Tenenbaum et al. ( 2011 ) is motivated by human inductive reasoning and examines concept induction in synthetic settings. Ellis et al. ( 2020 ) further learns to program concepts. Similar to Zhong et al. ( 2023 ) , Pham et al. ( 2024 ) generate and refine a list of topics to achieve interpretable topic modeling for open-ended exploration. Relatedly, Honovich et al. ( 2022 ) explore the deduction of task description from examples. Our work, in contrast, focuses on hypothesis generation between the input and the label for real-world challenging tasks and uses a UCB-style reward to propose novel algorithms.

Reasoning with LLMs. Although it is not our primary goal, our results show that hypothesis-based classifiers can outperform few-shot prompting. As hypotheses may be viewed as a form of reasoning, it is related to reasoning with LLMs (Wei et al., 2022 ; Wang et al., 2023a , i.a.) . In particular, our work differs from chain-of-thought reasoning because no predefined reasoning structure is available. Moreover, an important distinction between reasoning and hypothesis generation is that the former leverages established reasoning, while the latter requires both proposition and verification of the hypotheses, with the goal of discovering unknown knowledge.

LLMs for (social) sciences. Increasing attention has been brought to the use of LLMs in social science research  (Ziems et al., 2024 ; Kim & Lee, 2023 , i.a.) . Our experiments demonstrate the potential of LLMs in generating hypotheses for social science research to discover unknown knowledge in the data. Furthermore, our approach can be extended to natural sciences for general scientific discovery.

6 Conclusion

In this work, we propose HypoGeniC , a novel method that leverages LLMs to generate hypotheses with the goal of discovering unknown knowledge. With HypoGeniC , we are not only are able to generate human-interpretable hypotheses but also achieve better predictive performance against competitive baselines and even oracles. Furthermore, our method can generalize well with different models and datasets, including open models. Notably, with our generated hypotheses, we uncover new insights in real-world tasks that are widely studied in social sciences. HypoGeniC can be directly applied to natural language related tasks in social sciences. We encourage future work to explore hypothesis generation that requires additional modalities and/or leverages existing literature.

7 Limitations

In this paper, we aim to provide a robust framework for hypothesis generation, as opposed to focusing on the optimization of results. Thus, we did not perform an extensive hyperparameter search with the generation portion of HypoGeniC . We did not adjust the value of k 𝑘 k italic_k , which determines ℋ top subscript ℋ top \mathcal{H}_{\mathrm{top}} caligraphic_H start_POSTSUBSCRIPT roman_top end_POSTSUBSCRIPT in Algorithm   1 to maintain efficiency. Additionatlly, we only considered the effect of using a hypothesis bank size of 3 3 3 3 and 20 20 20 20 to only test using an extremely small hypothesis bank size and a large one. The ideal hypothesis bank size may require further investigation. Finally, we only tested the size of our wrong example bank w m ⁢ a ⁢ x subscript 𝑤 𝑚 𝑎 𝑥 w_{max} italic_w start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT as 10 10 10 10 to strike a balance between context window sizes and generation of good quality hypotheses. We believe that a more thorough hyperparameter search could improve the performance of our methodology.

Additionally, HypoGeniC has high latency, specifically when using inference methods that require multiple prompts. For example, the filter and weighted vote inference policy requires iterating through the top hypotheses to determine relevance and then performing inference if it is relevant. For single-step adaptive inference and best accuracy hypothesis, however, HypoGeniC is efficient. We also note that the main bottleneck with performing inference lies with performing API calls to the LLM, which is a limitation not directly related to HypoGeniC . Given that we request reasoning for all inference prompts, the procedure can be time-consuming.

  • Anderson & Simester (2014) Anderson, E. T. and Simester, D. I. Reviews without a purchase: Low ratings, loyal customers, and deception. Journal of Marketing Research , 51(3):249–269, 2014.
  • Anthropic (2023) Anthropic. Claude 2 , 2023.
  • Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research , 3(Nov):397–422, 2002.
  • Banerjee & Urminsky (2021) Banerjee, A. and Urminsky, O. The language that drives engagement: A systematic large-scale analysis of headline experiments. Social Science Research Network , 2021.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners . In Proceedings of NeurIPS , volume 33, pp.  1877–1901, 2020.
  • Chowdhery et al. (2022) Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S., Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S., Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., and Fiedel, N. PaLM: Scaling language modeling with pathways , April 2022.
  • Ellis et al. (2020) Ellis, K., Wong, C., Nye, M., Sablé-Meyer, M., Cary, L., Morales, L., Hewitt, L., Solar-Lezama, A., and Tenenbaum, J. B. DreamCoder: growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning . Philosophical Transactions of the Royal Society A , 381, 2020.
  • Gligorić et al. (2019) Gligorić, K., Anderson, A., and West, R. Causal effects of brevity on style and success in social media. In Proceedings of ACM HCI , 2019.
  • Granhag & Vrij (2005) Granhag, P. A. and Vrij, A. Deception detection. Psychology and law: An empirical perspective , pp.  43–92, 2005.
  • Honovich et al. (2022) Honovich, O., Shaham, U., Bowman, S. R., and Levy, O. Instruction induction: From few examples to natural language task descriptions . In Proceedings of ACL , 2022.
  • Kim & Lee (2023) Kim, J. and Lee, B. AI-augmented surveys: Leveraging large language models and surveys for opinion prediction , 2023.
  • Lai & Tan (2019) Lai, V. and Tan, C. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of FAccT , 2019.
  • Lai et al. (2020) Lai, V., Liu, H., and Tan, C. "Why is ’Chicago’ deceptive?" Towards building model-driven tutorials for humans . In Proceedings of CHI , 2020.
  • Li et al. (2013) Li, J., Ott, M., and Cardie, C. Identifying manipulated offerings on review portals . In Proceedings of EMNLP , 2013.
  • Li et al. (2014) Li, J., Ott, M., Cardie, C., and Hovy, E. Towards a general rule for identifying deceptive opinion spam . In Proceedings of ACL , pp.  1566–1576, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
  • Ludwig & Mullainathan (2024) Ludwig, J. and Mullainathan, S. Machine learning as a tool for hypothesis generation* . The Quarterly Journal of Economics , pp.  qjad055, 01 2024. ISSN 0033-5533.
  • Matias et al. (2021) Matias, J. N., Munger, K., Quere, M. A. L., and Ebersole, C. R. The upworthy research archive, a time series of 32,487 experiments in U.S. media . Scientific Data , 8, 2021.
  • Mistral (2023) Mistral. Mixtral of experts , 2023.
  • OpenAI (2023a) OpenAI. Chatgpt , 2023a.
  • OpenAI (2023b) OpenAI. Gpt-4 technical report, 2023b.
  • Ott et al. (2011) Ott, M., Choi, Y., Cardie, C., and Hancock, J. T. Finding deceptive opinion spam by any stretch of the imagination . In Proceedings of ACL , 2011.
  • Pham et al. (2024) Pham, C. M., Hoyle, A., Sun, S., and Iyyer, M. Topicgpt: A prompt-based topic modeling framework . In Proceedings of NAACL , 2024.
  • Piriyakulkij & Ellis (2024) Piriyakulkij, T. and Ellis, K. Doing experiments and revising rules with natural language and probabilistic reasoning , 2024.
  • Qiu et al. (2024) Qiu, L., Jiang, L., Lu, X., Sclar, M., Pyatkin, V., Bhagavatula, C., Wang, B., Kim, Y., Choi, Y., Dziri, N., and Ren, X. Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement . In Proceedings of ICLR , 2024.
  • Rothenberg (1995) Rothenberg, A. Creative cognitive processes in kekule’s discovery of the structure of the benzene molecule. The American journal of psychology , pp.  419–438, 1995.
  • Sadoski et al. (2000) Sadoski, M., Goetz, E. T., and Rodriguez, M. Engaging texts: Effects of concreteness on comprehensibility, interest, and recall in four text types. Journal of Educational Psychology , 92(1):85, 2000.
  • Salganik et al. (2006) Salganik, M. J., Dodds, P. S., and Watts, D. J. Experimental study of inequality and unpredictability in an artificial cultural market. Science , 311(5762):854–856, 2006.
  • Singh et al. (2023) Singh, C., Morris, J. X., Aneja, J., Rush, A. M., and Gao, J. Explaining patterns in data with language models via interpretable autoprompting , 2023.
  • Tan et al. (2014) Tan, C., Lee, L., and Pang, B. The effect of wording on message propagation: Topic- and author-controlled natural experiments on twitter. In Proceedings of ACL , 2014.
  • Tenenbaum et al. (2011) Tenenbaum, J. B., Kemp, C., Griffiths, T. L., and Goodman, N. D. How to grow a mind: Statistics, structure, and abstraction . Science , 331:1279 – 1285, 2011.
  • Wang et al. (2023a) Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models . In Proceedings of ICLR , 2023a.
  • Wang et al. (2023b) Wang, Z., Shang, J., and Zhong, R. Goal-driven explainable clustering via language descriptions . In Proceedings of EMNLP , 2023b.
  • Wei et al. (2022) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H., Le, Q., and Zhou, D. Chain of thought prompting elicits reasoning in large language models . In Proceedings of NeurIPS , 2022.
  • Zhong et al. (2023) Zhong, R., Zhang, P., Li, S., Ahn, J., Klein, D., and Steinhardt, J. Goal driven discovery of distributional differences via language descriptions . In Proceedings of NeurIPS , 2023.
  • Ziems et al. (2024) Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., and Yang, D. Can large language models transform computational social science? Computational Linguistics , pp.  1–55, 2024.

Appendix A Prompts

We follow the general prompt engineering guide from Claude (Anthropic, 2023 ) to craft the prompts. Specifically for all the prompts we use for LLMs, we split them into instruction and user prompts. In the instruction prompt, we first set a tone and context, followed by an explicit task description, and then specify the answer format. The user prompt then includes useful information such as past examples and learned hypothesis. By the end of the user prompt, we ask the LLM to make a prediction. At generation time, we input the instruction prompt to LLMs as system prompt, wrapped by the corresponding system prompt tokens for each model. Below are some example templates for the prompts associated with each task.

A.1 Shoe Sales

A.2 deceptive reviews, a.3 headlines with more clicks, a.4 retweeted more, appendix b implementation details, b.1 hypogenic implementation.

When initializing the rewards of newly generated hypotheses, we use the examples in the wrong example bank to do so. Given that we work in a low data regime, for hypotheses generated near the end of the training loop, the accuracies of hypotheses are likely to be biased. To counter this phenomenon, we also allow for the hypotheses to use the initial examples 𝒮 init subscript 𝒮 init \mathcal{S}_{\mathrm{init}} caligraphic_S start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT for initializing rewards. By allowing the hypotheses to initialize reward with more examples, the accuracy lies closer to its true value, allowing for fair comparison between earlier generated hypotheses and newer ones.

Dynamic hypotheses update

In Algorithm   1 , we display how we generate and update the hypotheses pool ℋ ℋ \mathcal{H} caligraphic_H . In particular, we add an example s 𝑠 s italic_s to the wrong example bank 𝒲 𝒲 \mathcal{W} caligraphic_W if the number of hypotheses that incorrectly predict s 𝑠 s italic_s is greater than w h ⁢ y ⁢ p subscript 𝑤 ℎ 𝑦 𝑝 w_{hyp} italic_w start_POSTSUBSCRIPT italic_h italic_y italic_p end_POSTSUBSCRIPT . In our implementation, we use a linearly increasing w h ⁢ y ⁢ p subscript 𝑤 ℎ 𝑦 𝑝 w_{hyp} italic_w start_POSTSUBSCRIPT italic_h italic_y italic_p end_POSTSUBSCRIPT as training time t 𝑡 t italic_t increases. This allows our algorithm to update the hypotheses more frequently at early stage of training, and less frequently at the end.

B.2 Inference method implementations

Filter and weighted vote.

In order to filter the hypotheses, we iterate through the top k 𝑘 k italic_k hypotheses ranked by reward. For each hypothesis, we ask the Large Language Model (LLM) if it is relevant. Thereafter, for each of the relevant hypotheses, the LLM is prompted to use the hypothesis to make predictions. Then, for each predicted label, we add up the accuracy scores from the hypotheses that outputted that particular label. The final label is the one that has highest total accuracy score.

One-step adaptive and two-step adaptive inference

The detailed framework of our adaptive inference methods is split into two parts - hypotheses pruning and hypotheses selection. In the case where we have a large number of hypotheses, it is likely that some hypotheses in ℋ ℋ \mathcal{H} caligraphic_H have overlaps or are paraphrases of each other.

We address this issue with the following procedure:

During training, we record the examples that each hypothesis correctly predicts.

Then we create one-hot encodings for each hypothesis, where the i 𝑖 i italic_i -th element of the one-hot encoding is 1 if the hypothesis correctly predicts the i 𝑖 i italic_i -th example, and 0 otherwise. We subsequently compute a similarity matrix between each pair of hypotheses by taking the pairwise cosine similarities.

Lastly, we create a linear program with the objective of maximizing the sum of accuracies of the selected hypotheses, subject to the constraint that every pair of the selected hypotheses has a similarity score below a predefined threshold γ 𝛾 \gamma italic_γ .

After pruning the set of hypotheses, we prompt the LLM to pick one hypothesis for its final prediction, as described in §   2.2 . For the single-step adaptive inference, we ask the LLM to select a hypothesis and make a prediction in one prompt. On the other hand, with the two-step adaptive inference, we first prompt the LLM to select a hypothesis and then prompt the LLM again to make a prediction based on the selected hypothesis.

B.3 Hyperparameters

For the training stage, we set a limit on the hypothesis bank size, experimenting with sizes H = 3 𝐻 3 H=3 italic_H = 3 and H = 20 𝐻 20 H=20 italic_H = 20 to determine the impact of utilizing a larger number of hypotheses. Throughout all the experiments, we use the reward coefficient α = 0.5 𝛼 0.5 \alpha=0.5 italic_α = 0.5 , w m ⁢ a ⁢ x = 10 subscript 𝑤 𝑚 𝑎 𝑥 10 w_{max}=10 italic_w start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 10 , 𝚗𝚞𝚖 ⁢ _ ⁢ 𝚒𝚗𝚒𝚝 = 10 𝚗𝚞𝚖 _ 𝚒𝚗𝚒𝚝 10 \mathtt{num\_init}=10 typewriter_num _ typewriter_init = 10 , and we have two different sets of the rest of hyperparameters for hypothesis bank sizes of 3 3 3 3 and 20 20 20 20 .

With H = 3 𝐻 3 H=3 italic_H = 3 , we use k = 2 𝑘 2 k=2 italic_k = 2 and generate 1 1 1 1 hypothesis per update. For inference, we employ all 3 3 3 3 hypotheses for filter and weighted vote. For single-step and two-step adaptive inference, we use all 3 3 3 3 hypotheses with γ = 0.3 𝛾 0.3 \gamma=0.3 italic_γ = 0.3 and provide 5 5 5 5 examples to each hypothesis.

In the case of H = 20 𝐻 20 H=20 italic_H = 20 , we use k = 10 𝑘 10 k=10 italic_k = 10 and generate 5 5 5 5 hypotheses per update. Then we take the top 5 5 5 5 hypotheses, ranked by their training accuracies, for filter and weighted vote. For single-step and two-step adaptive inference, we use the top 5 5 5 5 hypotheses with γ = 0.7 𝛾 0.7 \gamma=0.7 italic_γ = 0.7 and provide 5 5 5 5 examples each.

Appendix C Detailed Results

C.1 hypogenic performance across inference strategies.

Refer to caption

§   C.1 presents the best results for all of our inference strategies, considering every dataset and all hyperparameter configurations.

For Shoe sales , we observe that all the models perform effectively by using the best hypothesis inference strategy. Surprisingly, Mixtral is unable to perform perfectly. This is because despite generating the hypothesis that fully describes the data, Mixtral opts not to apply the hypotheses, favoring to choose a random label for the sake of “variety”. Both GPT-3.5-turbo and Mixtral display similar patterns across the inference strategies, with best-accuracy hypothesis, filter and weighted vote, and two-step adaptive inference all having comparable performance. However, for all models we find single-step adaptive inference drops in accuracy. Given that two-step adaptive inference performs well, it is likely that the long prompt causes the model difficulty in choosing the correct hypotheses. For Claude-2.1, we see that filter and weighted vote drops in performance. As this method searches for relevant hypotheses, the model is likely finding that inaccurate patterns relevant, which end up outweighing the inference of the best hypothesis.

For Deceptive reviews , Claude-2.1 is the best performing model across all inference policies. Across the models, we highlight that single-step adaptive inference method works best for this dataset. In this inference method, the prompt specifically includes the aims of determining if a review is deceptive. This likely helps the model use the context provided to better decide which set of example resembles the test example most. Hence, splitting up the prompt may have caused performance to suffer.

We find that s the most challenging dataset. As mentioned in §   3.1 , the original dataset was created with both images and headlines paired together. In our version of the dataset, we only use the headlines, so we are missing a crucial variable that contributes to understanding click behavior. Therefore, based off only headlines, it is difficult to generate hypotheses that truly capture the data. Despite this challenge, we note that our hypotheses can still adeptly capture a large portion of data with 63.7% being our highest accuracy. Specifically, we find that the best-accuracy hypothesis strategy performs best. We also note that filter and weighted vote can provide strong performance as in the case of Claude-2.1 and GPT-3.5-turbo, suggesting that hypotheses corroborating with each other can lead to better performance. We observe that GPT-3.5-turbo is the best performing model here, with all inference policies (aside from single-step adaptive) having high accuracy.

Finally, over the Tweet popularity dataset, we find that the filter and weighted vote is the best choice for inference policy, with it being the best inference method for GPT-3.5-turbo and Mixtral. This indicates that using hypotheses in conjunction is useful as multiple variables together adeptly characterize the dataset. The performance of the rest of the inference policies has no clear pattern over this dataset.

Machine Learning as a Tool for Hypothesis Generation

While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a systematic procedure to generate novel hypotheses about human behavior, which uses the capacity of machine learning algorithms to notice patterns people might not. We illustrate the procedure with a concrete application: judge decisions about who to jail. We begin with a striking fact: The defendant’s face alone matters greatly for the judge’s jailing decision. In fact, an algorithm given only the pixels in the defendant’s mugshot accounts for up to half of the predictable variation. We develop a procedure that allows human subjects to interact with this black-box algorithm to produce hypotheses about what in the face influences judge decisions. The procedure generates hypotheses that are both interpretable and novel: They are not explained by demographics (e.g. race) or existing psychology research; nor are they already known (even if tacitly) to people or even experts. Though these results are specific, our procedure is general. It provides a way to produce novel, interpretable hypotheses from any high-dimensional dataset (e.g. cell phones, satellites, online behavior, news headlines, corporate filings, and high-frequency time series). A central tenet of our paper is that hypothesis generation is in and of itself a valuable activity, and hope this encourages future work in this largely “pre-scientific” stage of science.

This is a revised version of Chicago Booth working paper 22-15 “Algorithmic Behavioral Science: Machine Learning as a Tool for Scientific Discovery.” We gratefully acknowledge support from the Alfred P. Sloan Foundation, Emmanuel Roman, and the Center for Applied Artificial Intelligence at the University of Chicago. For valuable comments we thank Andrei Shliefer, Larry Katz and five anonymous referees, as well as Marianne Bertrand, Jesse Bruhn, Steven Durlauf, Joel Ferguson, Emma Harrington, Supreet Kaur, Matteo Magnaricotte, Dev Patel, Betsy Levy Paluck, Roberto Rocha, Evan Rose, Suproteem Sarkar, Josh Schwartzstein, Nick Swanson, Nadav Tadelis, Richard Thaler, Alex Todorov, Jenny Wang and Heather Yang, as well as seminar participants at Bocconi, Brown, Columbia, ETH Zurich, Harvard, MIT, Stanford, the University of California Berkeley, the University of Chicago, the University of Pennsylvania, the 2022 Behavioral Economics Annual Meetings and the 2022 NBER summer institute. For invaluable assistance with the data and analysis we thank Cecilia Cook, Logan Crowl, Arshia Elyaderani, and especially Jonas Knecht and James Ross. This research was reviewed by the University of Chicago Social and Behavioral Sciences Institutional Review Board (IRB20-0917) and deemed exempt because the project relies on secondary analysis of public data sources. All opinions and any errors are of course our own. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.

MARC RIS BibTeΧ

Download Citation Data

Published Versions

Jens Ludwig & Sendhil Mullainathan, 2024. " Machine Learning as a Tool for Hypothesis Generation, " The Quarterly Journal of Economics, vol 139(2), pages 751-827.

Working Groups

Conferences, more from nber.

In addition to working papers , the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter , the NBER Digest , the Bulletin on Retirement and Disability , the Bulletin on Health , and the Bulletin on Entrepreneurship  — as well as online conference reports , video lectures , and interviews .

15th Annual Feldstein Lecture, Mario Draghi, "The Next Flight of the Bumblebee: The Path to Common Fiscal Policy in the Eurozone cover slide

February 7, 2024

hypothesis generation by

Demystifying Hypothesis Generation: A Guide to AI-Driven Insights

Hypothesis generation involves making informed guesses about various aspects of a business, market, or problem that need further exploration and testing. This article discusses the process you need to follow while generating hypothesis and how an AI tool, like Akaike's BYOB can help you achieve the process quicker and better.

hypothesis generation by

What is Hypothesis Generation?

Hypothesis generation involves making informed guesses about various aspects of a business, market, or problem that need further exploration and testing. It's a crucial step while applying the scientific method to business analysis and decision-making. 

Here is an example from a popular B-school marketing case study: 

A bicycle manufacturer noticed that their sales had dropped significantly in 2002 compared to the previous year. The team investigating the reasons for this had many hypotheses. One of them was: “many cycling enthusiasts have switched to walking with their iPods plugged in.” The Apple iPod was launched in late 2001 and was an immediate hit among young consumers. Data collected manually by the team seemed to show that the geographies around Apple stores had indeed shown a sales decline.

Traditionally, hypothesis generation is time-consuming and labour-intensive. However, the advent of Large Language Models (LLMs) and Generative AI (GenAI) tools has transformed the practice altogether. These AI tools can rapidly process extensive datasets, quickly identifying patterns, correlations, and insights that might have even slipped human eyes, thus streamlining the stages of hypothesis generation.

These tools have also revolutionised experimentation by optimising test designs, reducing resource-intensive processes, and delivering faster results. LLMs' role in hypothesis generation goes beyond mere assistance, bringing innovation and easy, data-driven decision-making to businesses.

Hypotheses come in various types, such as simple, complex, null, alternative, logical, statistical, or empirical. These categories are defined based on the relationships between the variables involved and the type of evidence required for testing them. In this article, we aim to demystify hypothesis generation. We will explore the role of LLMs in this process and outline the general steps involved, highlighting why it is a valuable tool in your arsenal.

Understanding Hypothesis Generation

A hypothesis is born from a set of underlying assumptions and a prediction of how those assumptions are anticipated to unfold in a given context. Essentially, it's an educated, articulated guess that forms the basis for action and outcome assessment.

A hypothesis is a declarative statement that has not yet been proven true. Based on past scholarship , we could sum it up as the following: 

  • A definite statement, not a question
  • Based on observations and knowledge
  • Testable and can be proven wrong
  • Predicts the anticipated results clearly
  • Contains a dependent and an independent variable where the dependent variable is the phenomenon being explained and the independent variable does the explaining

In a business setting, hypothesis generation becomes essential when people are made to explain their assumptions. This clarity from hypothesis to expected outcome is crucial, as it allows people to acknowledge a failed hypothesis if it does not provide the intended result. Promoting such a culture of effective hypothesising can lead to more thoughtful actions and a deeper understanding of outcomes. Failures become just another step on the way to success, and success brings more success.

Hypothesis generation is a continuous process where you start with an educated guess and refine it as you gather more information. You form a hypothesis based on what you know or observe.

Say you're a pen maker whose sales are down. You look at what you know:

  • I can see that pen sales for my brand are down in May and June.
  • I also know that schools are closed in May and June and that schoolchildren use a lot of pens.
  • I hypothesise that my sales are down because school children are not using pens in May and June, and thus not buying newer ones.

The next step is to collect and analyse data to test this hypothesis, like tracking sales before and after school vacations. As you gather more data and insights, your hypothesis may evolve. You might discover that your hypothesis only holds in certain markets but not others, leading to a more refined hypothesis.

Once your hypothesis is proven correct, there are many actions you may take - (a) reduce supply in these months (b) reduce the price so that sales pick up (c) release a limited supply of novelty pens, and so on.

Once you decide on your action, you will further monitor the data to see if your actions are working. This iterative cycle of formulating, testing, and refining hypotheses - and using insights in decision-making - is vital in making impactful decisions and solving complex problems in various fields, from business to scientific research.

How do Analysts generate Hypotheses? Why is it iterative?

A typical human working towards a hypothesis would start with:

    1. Picking the Default Action

    2. Determining the Alternative Action

    3. Figuring out the Null Hypothesis (H0)

    4. Inverting the Null Hypothesis to get the Alternate Hypothesis (H1)

    5. Hypothesis Testing

The default action is what you would naturally do, regardless of any hypothesis or in a case where you get no further information. The alternative action is the opposite of your default action.

The null hypothesis, or H0, is what brings about your default action. The alternative hypothesis (H1) is essentially the negation of H0.

For example, suppose you are tasked with analysing a highway tollgate data (timestamp, vehicle number, toll amount) to see if a raise in tollgate rates will increase revenue or cause a volume drop. Following the above steps, we can determine:

Now, we can start looking at past data of tollgate traffic in and around rate increases for different tollgates. Some data might be irrelevant. For example, some tollgates might be much cheaper so customers might not have cared about an increase. Or, some tollgates are next to a large city, and customers have no choice but to pay. 

Ultimately, you are looking for the level of significance between traffic and rates for comparable tollgates. Significance is often noted as its P-value or probability value . P-value is a way to measure how surprising your test results are, assuming that your H0 holds true.

The lower the p-value, the more convincing your data is to change your default action.

Usually, a p-value that is less than 0.05 is considered to be statistically significant, meaning there is a need to change your null hypothesis and reject your default action. In our example, a low p-value would suggest that a 10% increase in the toll rate causes a significant dip in traffic (>3%). Thus, it is better if we keep our rates as is if we want to maintain revenue. 

In other examples, where one has to explore the significance of different variables, we might find that some variables are not correlated at all. In general, hypothesis generation is an iterative process - you keep looking for data and keep considering whether that data convinces you to change your default action.

Internal and External Data 

Hypothesis generation feeds on data. Data can be internal or external. In businesses, internal data is produced by company owned systems (areas such as operations, maintenance, personnel, finance, etc). External data comes from outside the company (customer data, competitor data, and so on).

Let’s consider a real-life hypothesis generated from internal data: 

Multinational company Johnson & Johnson was looking to enhance employee performance and retention.  Initially, they favoured experienced industry candidates for recruitment, assuming they'd stay longer and contribute faster. However, HR and the people analytics team at J&J hypothesised that recent college graduates outlast experienced hires and perform equally well.  They compiled data on 47,000 employees to test the hypothesis and, based on it, Johnson & Johnson increased hires of new graduates by 20% , leading to reduced turnover with consistent performance. 

For an analyst (or an AI assistant), external data is often hard to source - it may not be available as organised datasets (or reports), or it may be expensive to acquire. Teams might have to collect new data from surveys, questionnaires, customer feedback and more. 

Further, there is the problem of context. Suppose an analyst is looking at the dynamic pricing of hotels offered on his company’s platform in a particular geography. Suppose further that the analyst has no context of the geography, the reasons people visit the locality, or of local alternatives; then the analyst will have to learn additional context to start making hypotheses to test. 

Internal data, of course, is internal, meaning access is already guaranteed. However, this probably adds up to staggering volumes of data. 

Looking Back, and Looking Forward

Data analysts often have to generate hypotheses retrospectively, where they formulate and evaluate H0 and H1 based on past data. For the sake of this article, let's call it retrospective hypothesis generation.

Alternatively, a prospective approach to hypothesis generation could be one where hypotheses are formulated before data collection or before a particular event or change is implemented. 

For example: 

A pen seller has a hypothesis that during the lean periods of summer, when schools are closed, a Buy One Get One (BOGO) campaign will lead to a 100% sales recovery because customers will buy pens in advance.  He then collects feedback from customers in the form of a survey and also implements a BOGO campaign in a single territory to see whether his hypothesis is correct, or not.
The HR head of a multi-office employer realises that some of the company’s offices have been providing snacks at 4:30 PM in the common area, and the rest have not. He has a hunch that these offices have higher productivity. The leader asks the company’s data science team to look at employee productivity data and the employee location data. “Am I correct, and to what extent?”, he asks. 

These examples also reflect another nuance, in which the data is collected differently: 

  • Observational: Observational testing happens when researchers observe a sample population and collect data as it occurs without intervention. The data for the snacks vs productivity hypothesis was observational. 
  • Experimental: In experimental testing, the sample is divided into multiple groups, with one control group. The test for the non-control groups will be varied to determine how the data collected differs from that of the control group. The data collected by the pen seller in the single territory experiment was experimental.

Such data-backed insights are a valuable resource for businesses because they allow for more informed decision-making, leading to the company's overall growth. Taking a data-driven decision, from forming a hypothesis to updating and validating it across iterations, to taking action based on your insights reduces guesswork, minimises risks, and guides businesses towards strategies that are more likely to succeed.

How can GenAI help in Hypothesis Generation?

Of course, hypothesis generation is not always straightforward. Understanding the earlier examples is easy for us because we're already inundated with context. But, in a situation where an analyst has no domain knowledge, suddenly, hypothesis generation becomes a tedious and challenging process.

AI, particularly high-capacity, robust tools such as LLMs, have radically changed how we process and analyse large volumes of data. With its help, we can sift through massive datasets with precision and speed, regardless of context, whether it's customer behaviour, financial trends, medical records, or more. Generative AI, including LLMs, are trained on diverse text data, enabling them to comprehend and process various topics.

Now, imagine an AI assistant helping you with hypothesis generation. LLMs are not born with context. Instead, they are trained upon vast amounts of data, enabling them to develop context in a completely unfamiliar environment. This skill is instrumental when adopting a more exploratory approach to hypothesis generation. For example, the HR leader from earlier could simply ask an LLM tool: “Can you look at this employee productivity data and find cohorts of high-productivity and see if they correlate to any other employee data like location, pedigree, years of service, marital status, etc?” 

For an LLM-based tool to be useful, it requires a few things:

  • Domain Knowledge: A human could take months to years to acclimatise to a particular field fully, but LLMs, when fed extensive information and utilising Natural Language Processing (NLP), can familiarise themselves in a very short time.
  • Explainability:   Explainability is its ability to explain its thought process and output to cease being a "black box".
  • Customisation: For consistent improvement, contextual AI must allow tweaks, allowing users to change its behaviour to meet their expectations. Human intervention and validation is a necessary step in adoptingAI tools. NLP allows these tools to discern context within textual data, meaning it can read, categorise, and analyse data with unimaginable speed. LLMs, thus, can quickly develop contextual understanding and generate human-like text while processing vast amounts of unstructured data, making it easier for businesses and researchers to organise and utilise data effectively.LLMs have the potential to become indispensable tools for businesses. The future rests on AI tools that harness the powers of LLMs and NLP to deliver actionable insights, mitigate risks, inform decision-making, predict future trends, and drive business transformation across various sectors.

Together, these technologies empower data analysts to unravel hidden insights within their data. For our pen maker, for example, an AI tool could aid data analytics. It can look through historical data to track when sales peaked or go through sales data to identify the pens that sold the most. It can refine a hypothesis across iterations, just as a human analyst would. It can even be used to brainstorm other hypotheses. Consider the situation where you ask the LLM, " Where do I sell the most pens? ". It will go through all of the data you have made available - places where you sell pens, the number of pens you sold - to return the answer. Now, if we were to do this on our own, even if we were particularly meticulous about keeping records, it would take us at least five to ten minutes, that too, IF we know how to query a database and extract the needed information. If we don't, there's the added effort required to find and train such a person. An AI assistant, on the other hand, could share the answer with us in mere seconds. Its finely-honed talents in sorting through data, identifying patterns, refining hypotheses iteratively, and generating data-backed insights enhance problem-solving and decision-making, supercharging our business model.

Top-Down and Bottom-Up Hypothesis Generation

As we discussed earlier, every hypothesis begins with a default action that determines your initial hypotheses and all your subsequent data collection. You look at data and a LOT of data. The significance of your data is dependent on the effect and the relevance it has to your default action. This would be a top-down approach to hypothesis generation.

There is also the bottom-up method , where you start by going through your data and figuring out if there are any interesting correlations that you could leverage better. This method is usually not as focused as the earlier approach and, as a result, involves even more data collection, processing, and analysis. AI is a stellar tool for Exploratory Data Analysis (EDA). Wading through swathes of data to highlight trends, patterns, gaps, opportunities, errors, and concerns is hardly a challenge for an AI tool equipped with NLP and powered by LLMs.

EDA can help with: 

  • Cleaning your data
  • Understanding your variables
  • Analysing relationships between variables

An AI assistant performing EDA can help you review your data, remove redundant data points, identify errors, note relationships, and more. All of this ensures ease, efficiency, and, best of all, speed for your data analysts.

Good hypotheses are extremely difficult to generate. They are nuanced and, without necessary context, almost impossible to ascertain in a top-down approach. On the other hand, an AI tool adopting an exploratory approach is swift, easily running through available data - internal and external. 

If you want to rearrange how your LLM looks at your data, you can also do that. Changing the weight you assign to the various events and categories in your data is a simple process. That’s why LLMs are a great tool in hypothesis generation - analysts can tailor them to their specific use cases. 

Ethical Considerations and Challenges

There are numerous reasons why you should adopt AI tools into your hypothesis generation process. But why are they still not as popular as they should be?

Some worry that AI tools can inadvertently pick up human biases through the data it is fed. Others fear AI and raise privacy and trust concerns. Data quality and ability are also often questioned. Since LLMs and Generative AI are developing technologies, such issues are bound to be, but these are all obstacles researchers are earnestly tackling.

One oft-raised complaint against LLM tools (like OpenAI's ChatGPT) is that they 'fill in' gaps in knowledge, providing information where there is none, thus giving inaccurate, embellished, or outright wrong answers; this tendency to "hallucinate" was a major cause for concern. But, to combat this phenomenon, newer AI tools have started providing citations with the insights they offer so that their answers become verifiable. Human validation is an essential step in interpreting AI-generated hypotheses and queries in general. This is why we need a collaboration between the intelligent and artificially intelligent mind to ensure optimised performance.

Clearly, hypothesis generation is an immensely time-consuming activity. But AI can take care of all these steps for you. From helping you figure out your default action, determining all the major research questions, initial hypotheses and alternative actions, and exhaustively weeding through your data to collect all relevant points, AI can help make your analysts' jobs easier. It can take any approach - prospective, retrospective, exploratory, top-down, bottom-up, etc. Furthermore, with LLMs, your structured and unstructured data are taken care of, meaning no more worries about messy data! With the wonders of human intuition and the ease and reliability of Generative AI and Large Language Models, you can speed up and refine your process of hypothesis generation based on feedback and new data to provide the best assistance to your business.

Related Posts

The latest industry news, interviews, technologies, and resources.

hypothesis generation by

On the Origin of Large Language Models: Tracing AI’s Big Bang

Discover how Large Language Models (LLMs) originated. Learn about the transition from language models to LARGE language models, thereby triggering AI’s Big Bang.

hypothesis generation by

This AI tool predicts lung cancer with 94% accuracy in just 1 year of screening

Researchers from the Massachusetts General Cancer Center and MIT have developed Sybil, a deep-learning tool to change this. Using a data set of more than 20,000 LDCT scans, Sybil predicts a patient’s lung cancer risk for the next six years.

hypothesis generation by

Smart Home Technologies: What’s cooking in AI?

Smart home technologies refer to home appliances and systems that can be monitored and controlled remotely using the IoT (Internet of Things)

hypothesis generation by

Doing AI Even Before AI Said Hi to the World – Akaike Technologies

As an AI-native company, Akaike Technologies offers comprehensive and sustainable AI solutions that resolve these challenges while catering to the unique needs of each business.

Knowledge Center

Case Studies

© 2023 Akaike Technologies Pvt. Ltd. and/or its associates and partners

Terms of Use

Privacy Policy

Terms of Service

© Akaike Technologies Pvt. Ltd. and/or its associates and partners

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 15 January 2004

Functional genomic hypothesis generation and experimentation by a robot scientist

  • Ross D. King 1 ,
  • Kenneth E. Whelan 1 ,
  • Ffion M. Jones 1 ,
  • Philip G. K. Reiser 1 ,
  • Christopher H. Bryant 2 ,
  • Stephen H. Muggleton 3 ,
  • Douglas B. Kell 4 &
  • Stephen G. Oliver 5  

Nature volume  427 ,  pages 247–252 ( 2004 ) Cite this article

7552 Accesses

373 Citations

67 Altmetric

Metrics details

The question of whether it is possible to automate the scientific process is of both great theoretical interest 1 , 2 and increasing practical importance because, in many scientific areas, data are being generated much faster than they can be effectively analysed. We describe a physically implemented robotic system that applies techniques from artificial intelligence 3 , 4 , 5 , 6 , 7 , 8 to carry out cycles of scientific experimentation. The system automatically originates hypotheses to explain observations, devises experiments to test these hypotheses, physically runs the experiments using a laboratory robot, interprets the results to falsify hypotheses inconsistent with the data, and then repeats the cycle. Here we apply the system to the determination of gene function using deletion mutants of yeast ( Saccharomyces cerevisiae ) and auxotrophic growth experiments 9 . We built and tested a detailed logical model (involving genes, proteins and metabolites) of the aromatic amino acid synthesis pathway. In biological experiments that automatically reconstruct parts of this model, we show that an intelligent experiment selection strategy is competitive with human performance and significantly outperforms, with a cost decrease of 3-fold and 100-fold (respectively), both cheapest and random-experiment selection.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

hypothesis generation by

Similar content being viewed by others

hypothesis generation by

A robotic prebiotic chemist probes long term reactions of complexifying mixtures

hypothesis generation by

A mobile robotic chemist

hypothesis generation by

Enhancing robot evolution through Lamarckian principles

Popper, K. The Logic of Scientific Discovery (Hutchinson, London, 1972)

MATH   Google Scholar  

Sloman, A. The Computer Revolution in Philosophy (Harvester, Hassocks, Sussex, 1978); available online from 〈 http://www.cs.bham.ac.uk/research/cogaff/crp/ 〉

Google Scholar  

Buchanan, B. G., Sutherland, G. L. & Feigenbaum, E. A. in Machine Intelligence Vol. 4 (eds Meltzer, B. & Michie, D.) 209–254 (Edinburgh Univ. Press, 1969)

Langley, P., Simon, H. A., Bradshaw, G. L. & Zytkow, J. M. Scientific Discovery: Computational Explorations of the Creative Process (MIT Press, Cambridge, Massachusetts, 1987)

Z̈ytkow, J. M., Zhu, J. & Hussam, A. Automated discovery in a chemistry laboratory. in Proceedings of the 8th National Conference on Artificial Intelligence (AAAI-1990) (eds Dietterich, T. & Swartout, W.) 889–894 (MIT, Cambridge, Massachusetts, 1990)

King, R. D., Muggleton, S. H., Srinivasan, A. & Sternberg, M. J. E. Structure–activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Natl Acad. Sci. USA 93 , 438–442 (1996)

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Valdes-Perez, R. E. Discovery tools for science applications. Commun. ACM 42 , 37–41 (1999)

Article   Google Scholar  

Langley, P. The computational support of scientific discovery. Int. J. Hum.–Comput. Stud. 53 , 393–410 (2000)

Beadle, G. W. & Tatum, E. I. Genetic control of biochemical reactions in Neurospora . Proc. Natl Acad. Sci. USA 27 , 499–506 (1941)

Pierce, C. S. Collected Papers of Charles Sanders Pierce (Harvard Univ. Press, 1958)

Mewes, H. W. et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 30 , 31–34 (2002); 〈 http://mips.gsf.de/proj/yeast/CYGD/db/index.html 〉

Article   CAS   PubMed   PubMed Central   Google Scholar  

Reiser, P. G. K. et al. Developing a logical model of yeast metabolism. Electron. Trans. Artif. Intell. 5 , 223–244 (2001)

Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28 , 27–30 (2000); 〈 http://www.genome.ad.jp/kegg/ 〉

Mitchell, T. M. Machine Learning (McGraw-Hill, New York, 1997)

Langley, P. Elements of Machine Learning (Morgan Kaufmann, San Mateo, California, 1996)

Cohen, D. A., Ghabhramani, Z. & Jordan, M. I. Active learning with statistical models. J. Artif. Intell. Res. 4 , 129–145 (1996)

Lin, F.-R. & Shaw, M. J. Active training of backpropagation neural networks using the learning by experimentation methodology. Ann. Oper. Res. 75 , 129–145 (1997)

Fedorov, V. V. Theory of Optimal Experiments (Academic, London, 1972)

Muggleton, S. & Page, D. in Machine Intelligence Vol. 15 (eds Furukawa, K., Michie, D. & Muggleton, S.) 248–267 (Oxford Univ. Press, 1999)

Bryant, C. H., Muggleton, S. H., Oliver, S. G., Kell, D. B., Reiser, P. G. K. & King, R. D. Combining inductive logic programming, active learning, and robotics to discover the function of genes. Electron. Trans. Artif. Intell. 5 , 1–36 (2001)

Flach, P. & Kaka, A. Abduction and Induction (Kluwer, London, 2000)

Book   Google Scholar  

Zupan, B. et al. in Proceedings of the Eighth European Conference on Artificial Intelligence in Medicine (eds Qualini, S., Barahona, P. & Andreassen, S.) 304–313 (Springer, Berlin, 2001)

Muggleton, S. Inverse entailment and Progol. New Generation Comput. J. 13 , 245–286 (1995)

Rabitz, H., de Vivie-Riedle, R., Motzkus, M. & Kompa, K. Whither the future of controlling quantum phenomena? Science 288 , 824–828 (2000)

Article   ADS   CAS   PubMed   Google Scholar  

Cochran, W. G. & Cox, G. M. Experimental Designs (Wiley, New York, 1992)

Brachmann, C. B. et al. Designer deletion strains derived from Saccharomyces cerevisiae S288C: a useful set of strains and plasmids for PCR-mediated gene disruption and other applications. Yeast 14 , 115–132 (1998)

Article   CAS   PubMed   Google Scholar  

Giaever, G. et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418 , 387–391 (2002)

Download references

Acknowledgements

We thank D. Page, U. Sarkans, A. Tamaddoni, M. Sternberg, A. Sloman and D. Michie for their help and advice, and D. Struttman for technical assistance. The work was funded by the BBSRC, the EPSRC, the Wellcome Trust and PharmDM.

Author information

Authors and affiliations.

Department of Computer Science, University of Wales, Aberystwyth, SY23 3DB, UK

Ross D. King, Kenneth E. Whelan, Ffion M. Jones & Philip G. K. Reiser

School of Computing, The Robert Gordon University, Aberdeen, AB10 1FR, UK

Christopher H. Bryant

Department of Computing, Imperial College, London, SW7 2AZ, UK

Stephen H. Muggleton

Department of Chemistry, UMIST, P.O. Box 88, M60 1QD, Manchester, UK

Douglas B. Kell

School of Biological Sciences, University of Manchester, 2.205 Stopford Building, M13 9PT, Manchester, UK

Stephen G. Oliver

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Stephen G. Oliver .

Ethics declarations

Competing interests.

R.D.K. is a co-founder and member of the scientific board of PharmaDM, Kapeldreef 60, B-3001 Heverlee, Belgium.

Supplementary information

Supplementary methods and data (doc 1370 kb), rights and permissions.

Reprints and permissions

About this article

Cite this article.

King, R., Whelan, K., Jones, F. et al. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427 , 247–252 (2004). https://doi.org/10.1038/nature02236

Download citation

Received : 24 July 2003

Accepted : 14 November 2003

Issue Date : 15 January 2004

DOI : https://doi.org/10.1038/nature02236

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

From beer to breadboards: yeast as a force for biological innovation.

  • Hamid Kian Gaikani
  • Monika Stolar
  • Guri Giaever

Genome Biology (2024)

Self-driving laboratories to autonomously navigate the protein fitness landscape

  • Jacob T. Rapp
  • Bennett J. Bremer
  • Philip A. Romero

Nature Chemical Engineering (2024)

WMBAL: weighted minimum bounds for active learning

  • Jiaxi Zheng

Applied Intelligence (2024)

A robotic platform for the synthesis of colloidal nanocrystals

  • Haitao Zhao
  • Xue-Feng Yu

Nature Synthesis (2023)

Knowledge-integrated machine learning for materials: lessons from gameplaying and robotics

  • Kedar Hippalgaonkar
  • Qianxiao Li
  • Tonio Buonassisi

Nature Reviews Materials (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

hypothesis generation by

hypothesis generation by

Hypothesis Maker

Ai-powered research hypothesis generator.

  • Scientific Research: Generate a hypothesis for your experimental or observational study based on your research question.
  • Academic Studies: Formulate a hypothesis for your thesis, dissertation, or academic paper.
  • Market Research: Develop a hypothesis for your market research study to understand consumer behavior or market trends.
  • Social Science Research: Create a hypothesis for your social science research to explore societal or behavioral patterns.

New & Trending Tools

Lesson plan maker, ai notes generator, verbose text enhancer.

Automated Hypothesis Generation

hypothesis generation by

Automated hypothesis generation: when machine-learning systems produce ideas, not just test them.

Testing ideas at scale. Fast.

While algorithms are mostly used as tools to number-crunch and test-drive ideas, they have yet been used to generate the ideas themselves. Let alone at scale.

Rather than thinking up one idea at a time and testing it, what if a machine could generate millions of ideas automatically? What if this same machine would then proceed to autonomously test and rank the ideas, discovering which are better supported by the data? A machine that can even identify the type of data that could refute one’s theories and challenge existing practices.

This machine lies at the heart of SparkBeyond Discovery: its Hypothesis Engine. The engine automatically generates millions of ideas, many of them novel. Asks questions we would never think to even ask.

This Hypothesis Engine integrates the world’s largest collection of algorithms, and bypasses human cognitive bias to produce millions of ideas, hypotheses and questions in minutes. These hypotheses ensure that any meaningful signals in the data are surfaced. Then, these signals are often immediately actionable, and can be used as predictive features in machine learning models.

Going beyond the bias

Human ideation is inherently limited by cognitive bottlenecks and biases, which restrict us in generating and testing ideas at scale and high throughput. We're also limited by the speed at which we can communicate. We don’t have the capacity to read and comprehend the thousands of scientific articles and patents published every day. 

What’s more, the questions we ask are biased by our experience and knowledge, or even our mood.

In data science and research workflows, there are key bottlenecks that limit what a person or team can accomplish while working on a problem within a finite amount of time. 

For example, when exploring for useful patterns in data, a data scientist only has time to conceive, engineer, and evaluate a limited number of distinct hypotheses, leaving many areas unexplored. 

One of these areas is the gaps within an organization’s own data. This internal data may only reveal part of the story, whereas augmented external data sources can provide valuable contextual information. Without it, hypotheses based only on internal data don’t take into account the influence of external factors, such as weather and local events, or macro-economic factors and market conditions. 

Instead, by mapping out the entire spectrum of dynamics that happen on earth,SparkBeyond Discovery connects the dots between every data set that exists and offers a comprehensive viewpoint.

Tap into humanity's collective intelligence

Just like search engines crawl the web for text, our machine started indexing the code, data and knowledge on the web, and amassed one of the world's largest libraries of open-source code functions. 

Using both automation and AI, the Hypothesis Engine employs these functions to generate four million hypotheses per minute—a capacity that allows the technology to work through hundreds of good and bad ideas every second.

Related Articles

Overcoming the Enterprise LLM Blindspot

Overcoming the Enterprise LLM Blindspot

Turns out Enterprise LLMs have a massive blindspot, diminishing AI's impact on real-world performance. Here's how to solve it.

Continue reading

hypothesis generation by

Generative AI for data analytics: the future of enterprise sense-making

In the case of enterprise data analytics, generative AI will radically change the way we interrogate our data to explore, react to and shape our business realities.

hypothesis generation by

Turning enterprise data into accessible knowledge for LLMs

With the recent release of the GPT edition of our Discovery Platform, we introduce novel ways to unlock the vault of deep enterprise knowledge and internally developed insights, making them accessible to decision makers at all levels

hypothesis generation by

It was easier in this project since we used this outpout

Business insights.

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Predictive Models

Micro-segments, features for external models, business automation rules, root-cause analysis, join our event about this topic today..

Learn all about the SparkBeyond mission and product vision.

A conversation worth having today

Drop us a message and we'll get back to you promptly

Book a virtual meeting to see SparkBeyond products in action

Explore current job openings at SparkBeyond worldwide

Research Studio

Applications.

hypothesis generation by

  • School Guide
  • Mathematics
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Maths Formulas
  • Class 8 Maths Notes
  • Class 9 Maths Notes
  • Class 10 Maths Notes
  • Class 11 Maths Notes
  • Class 12 Maths Notes
  • Data Analysis with Python

Introduction to Data Analysis

  • What is Data Analysis?
  • Data Analytics and its type
  • How to Install Numpy on Windows?
  • How to Install Pandas in Python?
  • How to Install Matplotlib on python?
  • How to Install Python Tensorflow in Windows?

Data Analysis Libraries

  • Pandas Tutorial
  • NumPy Tutorial - Python Library
  • Data Analysis with SciPy
  • Introduction to TensorFlow

Data Visulization Libraries

  • Matplotlib Tutorial
  • Python Seaborn Tutorial
  • Plotly tutorial
  • Introduction to Bokeh in Python

Exploratory Data Analysis (EDA)

  • Univariate, Bivariate and Multivariate data and its analysis
  • Measures of Central Tendency in Statistics
  • Measures of spread - Range, Variance, and Standard Deviation
  • Interquartile Range and Quartile Deviation using NumPy and SciPy
  • Anova Formula
  • Skewness of Statistical Data
  • How to Calculate Skewness and Kurtosis in Python?
  • Difference Between Skewness and Kurtosis
  • Histogram | Meaning, Example, Types and Steps to Draw
  • Interpretations of Histogram
  • Quantile Quantile plots
  • What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation?
  • Using pandas crosstab to create a bar plot
  • Exploring Correlation in Python
  • Mathematics | Covariance and Correlation
  • Factor Analysis | Data Analysis
  • Data Mining - Cluster Analysis
  • MANOVA Test in R Programming
  • Python - Central Limit Theorem
  • Probability Distribution Function
  • Probability Density Estimation & Maximum Likelihood Estimation
  • Exponential Distribution in R Programming - dexp(), pexp(), qexp(), and rexp() Functions
  • Mathematics | Probability Distributions Set 4 (Binomial Distribution)
  • Poisson Distribution | Definition, Formula, Table and Examples
  • P-Value: Comprehensive Guide to Understand, Apply, and Interpret
  • Z-Score in Statistics
  • How to Calculate Point Estimates in R?
  • Confidence Interval
  • Chi-square test in Machine Learning

Understanding Hypothesis Testing

Data preprocessing.

  • ML | Data Preprocessing in Python
  • ML | Overview of Data Cleaning
  • ML | Handling Missing Values
  • Detect and Remove the Outliers using Python

Data Transformation

  • Data Normalization Machine Learning
  • Sampling distribution Using Python

Time Series Data Analysis

  • Data Mining - Time-Series, Symbolic and Biological Sequences Data
  • Basic DateTime Operations in Python
  • Time Series Analysis & Visualization in Python
  • How to deal with missing values in a Timeseries in Python?
  • How to calculate MOVING AVERAGE in a Pandas DataFrame?
  • What is a trend in time series?
  • How to Perform an Augmented Dickey-Fuller Test in R
  • AutoCorrelation

Case Studies and Projects

  • Top 8 Free Dataset Sources to Use for Data Science Projects
  • Step by Step Predictive Analysis - Machine Learning
  • 6 Tips for Creating Effective Data Visualizations

Hypothesis testing involves formulating assumptions about population parameters based on sample statistics and rigorously evaluating these assumptions against empirical evidence. This article sheds light on the significance of hypothesis testing and the critical steps involved in the process.

What is Hypothesis Testing?

Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. 

Example: You say an average height in the class is 30 or a boy is taller than a girl. All of these is an assumption that we are assuming, and we need some statistical way to prove these. We need some mathematical conclusion whatever we are assuming is true.

Defining Hypotheses

\mu

Key Terms of Hypothesis Testing

\alpha

  • P-value: The P value , or calculated probability, is the probability of finding the observed/extreme results when the null hypothesis(H0) of a study-given problem is true. If your P-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis.
  • Test Statistic: The test statistic is a numerical value calculated from sample data during a hypothesis test, used to determine whether to reject the null hypothesis. It is compared to a critical value or p-value to make decisions about the statistical significance of the observed results.
  • Critical value : The critical value in statistics is a threshold or cutoff point used to determine whether to reject the null hypothesis in a hypothesis test.
  • Degrees of freedom: Degrees of freedom are associated with the variability or freedom one has in estimating a parameter. The degrees of freedom are related to the sample size and determine the shape.

Why do we use Hypothesis Testing?

Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates two mutually exclusive population statements to determine which statement is most supported by sample data. When we say that the findings are statistically significant, thanks to hypothesis testing. 

One-Tailed and Two-Tailed Test

One tailed test focuses on one direction, either greater than or less than a specified value. We use a one-tailed test when there is a clear directional expectation based on prior knowledge or theory. The critical region is located on only one side of the distribution curve. If the sample falls into this critical region, the null hypothesis is rejected in favor of the alternative hypothesis.

One-Tailed Test

There are two types of one-tailed test:

\mu \geq 50

Two-Tailed Test

A two-tailed test considers both directions, greater than and less than a specified value.We use a two-tailed test when there is no specific directional expectation, and want to detect any significant difference.

\mu =

What are Type 1 and Type 2 errors in Hypothesis Testing?

In hypothesis testing, Type I and Type II errors are two possible errors that researchers can make when drawing conclusions about a population based on a sample of data. These errors are associated with the decisions made regarding the null hypothesis and the alternative hypothesis.

\alpha

How does Hypothesis Testing work?

Step 1: define null and alternative hypothesis.

H_0

We first identify the problem about which we want to make an assumption keeping in mind that our assumption should be contradictory to one another, assuming Normally distributed data.

Step 2 – Choose significance level

\alpha

Step 3 – Collect and Analyze data.

Gather relevant data through observation or experimentation. Analyze the data using appropriate statistical methods to obtain a test statistic.

Step 4-Calculate Test Statistic

The data for the tests are evaluated in this step we look for various scores based on the characteristics of data. The choice of the test statistic depends on the type of hypothesis test being conducted.

There are various hypothesis tests, each appropriate for various goal to calculate our test. This could be a Z-test , Chi-square , T-test , and so on.

  • Z-test : If population means and standard deviations are known. Z-statistic is commonly used.
  • t-test : If population standard deviations are unknown. and sample size is small than t-test statistic is more appropriate.
  • Chi-square test : Chi-square test is used for categorical data or for testing independence in contingency tables
  • F-test : F-test is often used in analysis of variance (ANOVA) to compare variances or test the equality of means across multiple groups.

We have a smaller dataset, So, T-test is more appropriate to test our hypothesis.

T-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.

Step 5 – Comparing Test Statistic:

In this stage, we decide where we should accept the null hypothesis or reject the null hypothesis. There are two ways to decide where we should accept or reject the null hypothesis.

Method A: Using Crtical values

Comparing the test statistic and tabulated critical value we have,

  • If Test Statistic>Critical Value: Reject the null hypothesis.
  • If Test Statistic≤Critical Value: Fail to reject the null hypothesis.

Note: Critical values are predetermined threshold values that are used to make a decision in hypothesis testing. To determine critical values for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Method B: Using P-values

We can also come to an conclusion using the p-value,

p\leq\alpha

Note : The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed in the sample, assuming the null hypothesis is true. To determine p-value for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Step 7- Interpret the Results

At last, we can conclude our experiment using method A or B.

Calculating test statistic

To validate our hypothesis about a population parameter we use statistical functions . We use the z-score, p-value, and level of significance(alpha) to make evidence for our hypothesis for normally distributed data .

1. Z-statistics:

When population means and standard deviations are known.

z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}

  • μ represents the population mean, 
  • σ is the standard deviation
  • and n is the size of the sample.

2. T-Statistics

T test is used when n<30,

t-statistic calculation is given by:

t=\frac{x̄-μ}{s/\sqrt{n}}

  • t = t-score,
  • x̄ = sample mean
  • μ = population mean,
  • s = standard deviation of the sample,
  • n = sample size

3. Chi-Square Test

Chi-Square Test for Independence categorical Data (Non-normally distributed) using:

\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

  • i,j are the rows and columns index respectively.

E_{ij}

Real life Hypothesis Testing example

Let’s examine hypothesis testing using two real life situations,

Case A: D oes a New Drug Affect Blood Pressure?

Imagine a pharmaceutical company has developed a new drug that they believe can effectively lower blood pressure in patients with hypertension. Before bringing the drug to market, they need to conduct a study to assess its impact on blood pressure.

  • Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
  • After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114

Step 1 : Define the Hypothesis

  • Null Hypothesis : (H 0 )The new drug has no effect on blood pressure.
  • Alternate Hypothesis : (H 1 )The new drug has an effect on blood pressure.

Step 2: Define the Significance level

Let’s consider the Significance level at 0.05, indicating rejection of the null hypothesis.

If the evidence suggests less than a 5% chance of observing the results due to random variation.

Step 3 : Compute the test statistic

Using paired T-test analyze the data to obtain a test statistic and a p-value.

The test statistic (e.g., T-statistic) is calculated based on the differences between blood pressure measurements before and after treatment.

t = m/(s/√n)

  • m  = mean of the difference i.e X after, X before
  • s  = standard deviation of the difference (d) i.e d i ​= X after, i ​− X before,
  • n  = sample size,

then, m= -3.9, s= 1.8 and n= 10

we, calculate the , T-statistic = -9 based on the formula for paired t test

Step 4: Find the p-value

The calculated t-statistic is -9 and degrees of freedom df = 9, you can find the p-value using statistical software or a t-distribution table.

thus, p-value = 8.538051223166285e-06

Step 5: Result

  • If the p-value is less than or equal to 0.05, the researchers reject the null hypothesis.
  • If the p-value is greater than 0.05, they fail to reject the null hypothesis.

Conclusion: Since the p-value (8.538051223166285e-06) is less than the significance level (0.05), the researchers reject the null hypothesis. There is statistically significant evidence that the average blood pressure before and after treatment with the new drug is different.

Python Implementation of Hypothesis Testing

Let’s create hypothesis testing with python, where we are testing whether a new drug affects blood pressure. For this example, we will use a paired T-test. We’ll use the scipy.stats library for the T-test.

Scipy is a mathematical library in Python that is mostly used for mathematical equations and computations.

We will implement our first real life problem via python,

In the above example, given the T-statistic of approximately -9 and an extremely small p-value, the results indicate a strong case to reject the null hypothesis at a significance level of 0.05. 

  • The results suggest that the new drug, treatment, or intervention has a significant effect on lowering blood pressure.
  • The negative T-statistic indicates that the mean blood pressure after treatment is significantly lower than the assumed population mean before treatment.

Case B : Cholesterol level in a population

Data: A sample of 25 individuals is taken, and their cholesterol levels are measured.

Cholesterol Levels (mg/dL): 205, 198, 210, 190, 215, 205, 200, 192, 198, 205, 198, 202, 208, 200, 205, 198, 205, 210, 192, 205, 198, 205, 210, 192, 205.

Populations Mean = 200

Population Standard Deviation (σ): 5 mg/dL(given for this problem)

Step 1: Define the Hypothesis

  • Null Hypothesis (H 0 ): The average cholesterol level in a population is 200 mg/dL.
  • Alternate Hypothesis (H 1 ): The average cholesterol level in a population is different from 200 mg/dL.

As the direction of deviation is not given , we assume a two-tailed test, and based on a normal distribution table, the critical values for a significance level of 0.05 (two-tailed) can be calculated through the z-table and are approximately -1.96 and 1.96.

(203.8 - 200) / (5 \div \sqrt{25})

Step 4: Result

Since the absolute value of the test statistic (2.04) is greater than the critical value (1.96), we reject the null hypothesis. And conclude that, there is statistically significant evidence that the average cholesterol level in the population is different from 200 mg/dL

Limitations of Hypothesis Testing

  • Although a useful technique, hypothesis testing does not offer a comprehensive grasp of the topic being studied. Without fully reflecting the intricacy or whole context of the phenomena, it concentrates on certain hypotheses and statistical significance.
  • The accuracy of hypothesis testing results is contingent on the quality of available data and the appropriateness of statistical methods used. Inaccurate data or poorly formulated hypotheses can lead to incorrect conclusions.
  • Relying solely on hypothesis testing may cause analysts to overlook significant patterns or relationships in the data that are not captured by the specific hypotheses being tested. This limitation underscores the importance of complimenting hypothesis testing with other analytical approaches.

Hypothesis testing stands as a cornerstone in statistical analysis, enabling data scientists to navigate uncertainties and draw credible inferences from sample data. By systematically defining null and alternative hypotheses, choosing significance levels, and leveraging statistical tests, researchers can assess the validity of their assumptions. The article also elucidates the critical distinction between Type I and Type II errors, providing a comprehensive understanding of the nuanced decision-making process inherent in hypothesis testing. The real-life example of testing a new drug’s effect on blood pressure using a paired T-test showcases the practical application of these principles, underscoring the importance of statistical rigor in data-driven decision-making.

Frequently Asked Questions (FAQs)

1. what are the 3 types of hypothesis test.

There are three types of hypothesis tests: right-tailed, left-tailed, and two-tailed. Right-tailed tests assess if a parameter is greater, left-tailed if lesser. Two-tailed tests check for non-directional differences, greater or lesser.

2.What are the 4 components of hypothesis testing?

Null Hypothesis ( ): No effect or difference exists. Alternative Hypothesis ( ): An effect or difference exists. Significance Level ( ): Risk of rejecting null hypothesis when it’s true (Type I error). Test Statistic: Numerical value representing observed evidence against null hypothesis.

3.What is hypothesis testing in ML?

Statistical method to evaluate the performance and validity of machine learning models. Tests specific hypotheses about model behavior, like whether features influence predictions or if a model generalizes well to unseen data.

4.What is the difference between Pytest and hypothesis in Python?

Pytest purposes general testing framework for Python code while Hypothesis is a Property-based testing framework for Python, focusing on generating test cases based on specified properties of the code.

Please Login to comment...

Similar reads.

  • data-science
  • Data Science
  • Machine Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Hypothesis generation guided by co-word clustering

  • Published: January 2003
  • Volume 56 , pages 111–135, ( 2003 )

Cite this article

hypothesis generation by

  • Johannes Stegmann 1 &
  • Guenter Grohmann 2  

605 Accesses

73 Citations

9 Altmetric

Explore all metrics

Co-word analysis was applied to keywords assigned to MEDLINE documents contained in sets of complementary but disjoint literatures. In strategical diagrams of disjoint literatures, based on internal density and external centrality of keyword-containing clusters, intermediate terms (linking the disjoint partners) were found in regions of below-median centrality and density. Terms representing the disjoint literature themes were found in close vicinity in strategical diagrams of intermediate literatures. Based on centrality-density ratios, characteristic values were found which allow a rapid identification of clusters containing possible intermediate and disjoint partner terms. Applied to the already investigated disjoint pairs Raynaud"s Disease - Fish Oil, Migraine - Magnesium, the method readily detected known and unknown (but relevant) intermediate and disjoint partner terms. Application of the method to the literature on Prions led to Manganese as possible disjoint partner term. It is concluded that co-word clustering is a powerful method for literature-based hypothesis generation and knowledge discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

hypothesis generation by

Inductive Content Analysis

hypothesis generation by

The Use of Artificial Intelligence in Writing Scientific Review Articles

hypothesis generation by

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

AGOSTONI, A., B. MARASINI, M. L. BIONDI, C. BASSANI, A. CAZZANIGA, B. BOTTASSO, M. CUGNO (1991), L-arginine therapy in Raynaud's phenomenon? International Journal of Clinical & Laboratory Research , 21: 202-203.

Google Scholar  

CALLON, M., J. LAW, A. RIP (1986), Mapping the Dynamics of Science and Technology: Sociology of Science in the Real World , London: The Macmillan Press Ltd.

CALLON, M., J. P. COURTIAL, F. LAVILLE (1991), Co-word analysis as a tool for describing the network of interactions between basic and technological research: the case of polymer chemistry, Scientometrics , 22: 155-205.

CAMBROSIO, A., C. LIMOGES, J. P. COURTIAL, F. LAVILLE (1993), Historical scientometrics? Mapping over 70 years of biological safety research with co-word analysis, Scientometrics , 27: 119-143.

CHEN, C., J. KULJIS, R. J. PAUL (2001), Visualizing latent domain knowledge, IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews , 31: 518-529.

COULTER, N., I. MONARCH, S. KONDA (1998), Software engineering as seen through its research literature: a study in co-word analysis, Journal of the American Society for Information Science , 49: 1206-1223.

COURTIAL, J. P., M. CALLON, A. SIGOGNEAU (1993), The use of patent titles for identifying the topics of invention and forecasting trends, Scientometrics , 26: 231-242.

DAVIES, R. (1989), The creation of new knowledge by information-retrieval and classification, Journal of Documentation , 45: 273-301.

EVERS, S., R. PORTHMANN, M. ÑBERALL, E. NAUMANN, W. D. GERBER (2002), Therapie idiopathischer Kopfschmerzen im Kindesalter. Empfehlungen der Deutschen Migräne-und Kopfschmerzgesellschaft (DMKG). [Treatment of idiopathic headache in childhood-recommendations of the German Migraine and Headache Society (DMKG)], Schmerz , 16: 48-56.

FREEDMAN, R. R., R. GIRGIS, M. D. MAYES (1999), Acute effect of nictric oxide on Raynaud's phenomenon in scleroderma, Lancet , 354: 739.

GORDON, M. D., S. DUMAIS (1998), Using latent semantic indexing for literature based discovery, Journal of the American Society for Information Science , 49: 674-685.

GORDON, M. D., R. K. LINDSAY (1996), Toward discovery support systems: a replication, re-examination and extension of Swanson's work on literature-based discovery of a connection between Raynaud's and fish oil, Journal of the American Society for Information Science , 47: 116-128.

HE, Q. (1999), Knowledge discovery through co-word analysis, Library Trends , 48: 133-159.

KAHAN, A., H. AWADA, Y. SULTAN, C. J. MENKES, B. AMOR (1988), Tissue plasminogen activator (t-pa) activity and t-pa inhibition (pai) in systemic sclerosis. Arthritis and Rheumatism , 31 (Suppl. 4): S112.

KATZ, J. S., D. HICKS (1997), Desktop Scientometrics, Scientometrics , 38: 141-153.

KINZE, S., M. CLAUSS, U. REUTER, T. WOLFT, J. P. DREIER, K. M. EINHäUPL, G. ARNOLD (2001), Valproic acid is effective in migraine prophylaxis at low serum levels: a prospective open-label study, Headache , 41: 774-778.

KOSTOFF, R. N. (1999), Science and technology innovation, Technovation , 19: 593-604.

KOSTOFF, R. N., H. J. EBERHART, D. R. TOOTHMAN (1998), Database tomography for technical intelligence: a roadmap of the near-earth space science and technology literature, Information Processing & Management , 34: 69-85.

LAYTON, W., J. M. SUTHERLAND (1975), Geochemistry and multiple sclerosis: a hypothesis, Medical Journal of Australia , 1: 73-77.

LINDSAY, R. K., M. D. GORDON (1999), Literature-based discovery by lexical statistics, Journal of the American Society for Information Science , 50: 574-587.

MONCADA, S., R. M. PALMER, E. A. HIGGS (1989), The biological significance of nitric oxide formation from L-arginine. Biochemical Society Transactions , 17: 642-644.

OMURA, M., S. KOBAYASHI, Y. MIZUKAMI, K. MOGAMI, N. TODOROKI-IKEDA, T. MIYAKE, M. MATSUZAKI (2001), Eicosapentaenoic acid (EPA) induces Ca 2+ -independent activation and translocation of endothelial nitric oxide synthase and endothelium-dependent vasorelaxation, FEBS Letters , 487: 361-366.

PURDEY, M. (1994), Are organophosphate pesticides involved in the causation of bovine spongiform encephalopathy (BSE)? Hypothesis based upon a literature review and limited trials on BSE cattle, Journal of Nutritional Medicine , 4: 43-82.

PURDEY, M. (1996 a), The UK epidemic of BSE: slow virus or chronic pesticide-initiated modification of the prion protein? Part 1: mechanisms for a chemically induced pathogenesis/transmissibility, Medical Hypotheses, 46: 429-443.

PURDEY, M. (1996 b), The UK epidemic of BSE: slow virus or chronic pesticide-initiated modification of the prion protein? Part 2: an epidemiological perspective pathogenesis/transmissibility, Medical Hypotheses, 46: 445-454.

PURDEY, M. (1998), High-dose exposure to systemic phosmet insecticide modifies the phosphatidylinositol anchor on the prion protein: the origins of new variant transmissible spongiform encephalopathies? Medical Hypotheses , 50: 91-111.

PURDEY, M. (2000), Ecosystems supporting clusters of sporadic TSEs demonstrate excesses of the radicalgenerating divalent cation manganese and deficiencies of antioxidant co factors Cu, Se, Fe, Zn. Does a foreign cation substitution at prion protein's Cu domain initiate TSE? Medical Hypotheses , 54: 278-306.

PURDEY, M. (2001), Does an ultra violet photooxidation of the manganese-loaded/copper-depleted prion protein in the retina initiate the pathogenesis of TSE? Medical Hypotheses , 57: 29-45.

SCOLNICK, E., E. RANDS, S. A. AARONSON, G. J. TODARO (1970), RNA-dependent DNA polymerase activity in five RNA viruses: divalent cation requirements, Proceedings of the National Academy of Sciences of the United States of America , 67: 1789-1796.

SMALHEISER, N. R., D. R. SWANSON (1996a), Indomethacin and Alzheimer's disease, Neurology , 46: 583.

SMALHEISER, N. R., D. R. SWANSON (1996b), Linking estrogen to Alzheimer's disease: an informatics approach, Neurology , 47: 809-810.

SMALHEISER, N. R., D. R. SWANSON (1998), Calcium-independent phospholipase A 2 and schizophrenia, Archives of General Psychiatry , 55: 752-753.

SØRENSEN, K. V. (1988), Valproate: a new drug in migraine prophylaxis, Acta Neurologica Scandinavica , 78: 346-348.

SWANSON, D. R. (1986), Fish oil, Raynaud's syndrome, and undiscovered public knowledge, Perspectives in Biology and Medicine , 30: 7-18.

SWANSON, D. R. (1988), Migraine and magnesium: eleven neglected connections, Perspectives in Biology and Medicine , 31: 526-557.

SWANSON, D. R. (1989a), Online search for logically-related noninteractive medical literatures: a systematic trial-and-error strategy, Journal of the American Society for Information Science , 40: 356-358.

SWANSON, D. R. (1989b), A second example of mutually isolated medical literatures related by implicit, unnoticed connections. Journal of the American Society for Information Science , 40: 432-435.

SWANSON, D. R. (1990a), Medical literature as a potential source of new knowledge, Bulletin of the Medical Library Association , 78: 29-37.

SWANSON, D. R. (1990b), Somatomedin C and arginine: implicit connections between mutually isolated literatures, Perspectives in Biology and Medicine , 33: 157-186.

SWANSON, D. R. (1991), Complementary structures in disjoint literatures. In: A. BOOKSTEIN, Y. CHIARAMELLA, G. SALTON, V. V. RAGHAVAN (Eds), SIGIR '91: Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Chicago, Oct. 13-16). New York: Association for Computing Machinery, pp. 280-289.

SWANSON, D. R. (1993), Intervening in the life cycles of scientific knowledge, Library Trends , 41: 606-631.

SWANSON, D. R., N. R. SMALHEISER (1997), An interactive system for finding complementary literatures: a stimulus to scientific discovery, Artificial Intelligence , 91: 183-203.

SWANSON, D. R., N. R. SMALHEISER (1999), Implicit text linkages between Medline records: using Arrowsmith as an aid to scientific discovery, Library Trends , 48: 48-59.

TURNER, W. A., G. CHARTRON, F. LAVILLE, B. MICHELET (1988), Packaging information for peer review: new co-word analysis techniques. In: Van Raan, A. F. J (Ed.), Handbook of Quantitative Studies of Science and Technology . Netherlands: Elsevier Science Publishers, pp. 291-323.

WEEBER, M., H. KLEIN, A. R. ARONSON, J. G. MORK, L. T. W. DE JONG-VAN DEN BERG, R. VOS (2000), Text-based discovery in biomedicine: the architecture of the DAD-system. In: OVERHAGE, J. M. (Ed.). Proceedings of the 2000 AMIA Annual Fall Symposium . Philadelphia, PA: Hanley and Belfus, pp. 903-907.

WEEBER, M., H. KLEIN, L. T. W. DE JONG-VAN DEN BERG, R. VOS (2001), Using concepts in literature-based discovery: Simulating Swanson's Raynaud-fish oil and migraine-magnesium discoveries, Journal of the American Society for Information Science and Technology , 52: 548-557.

Download references

Author information

Authors and affiliations.

Medical Library, Free University Berlin, Medical Library University Hospital Benjamin Franklin, 12203, Berlin, Germany

Johannes Stegmann

Institute of Medical Informatics, Biometry and Epidemiology University Hospital Benjamin Franklin, University Hospital Free University Berlin, Berlin (, Germany

Guenter Grohmann

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

About this article

Stegmann, J., Grohmann, G. Hypothesis generation guided by co-word clustering. Scientometrics 56 , 111–135 (2003). https://doi.org/10.1023/A:1021954808804

Download citation

Issue Date : January 2003

DOI : https://doi.org/10.1023/A:1021954808804

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Knowledge Discovery
  • Powerful Method
  • Find a journal
  • Publish with us
  • Track your research

Hate "The Phantom Menace"? The Ewok Line theory could explain why

What if a "how i met your mother" hypothesis also applies to our divided opinion about the "star wars" prequels, by melanie mcfarland.

Twenty-five years onward from the theatrical debut of " Star Wars: Episode I – The Phantom Menace ," nearly every one of its haters has a story about how George Lucas wrecked their childhood.

Maybe it was the acting, or more accurately, its absence. Many cite the introduction of the whole midichlorians pseudo-science  which, as my still-traumatized husband explained during a recent rewatch, negated the mystical wonder of the Force connection. "The podracing . . . the podracing . . . " he muttered his breath with all the resignation of Colonel Kurtz gasping out his last words.

This man loved "Star Wars" well into his 20s . . . until Lucas brought Jar Jar Binks and the Gungans into his orbit. The Naboo natives’ barely intelligible patois moved critics like  NPR’s Bob Mondello  to wonder what Lucas was thinking in “ [introducing] "a race of idol-worshiping primitives who speak with Caribbean accents and behave like refugees from 'Amos n Andy.'"  The Jar Jar hatred ran so deep and fierce that it brought years of virulent harassment upon Ahmed Best, the actor who voiced him. 

To the manchild I love, that character and most of what surrounded it marked the death of any nostalgia he held for “Star Wars.” 

This explains why the bulk of the movie’s silver anniversary coverage breaks down to measuring how we feel about “The Phantom Menace” all these years later as opposed to appreciating what it contributes to moviemaking or the franchise canon. Our love or hatred frequently boils down to what age we were in 1999. 

Less often examined is the mechanics of “Star Wars” as a brand with emotional staying power and Gen X’s insidiously possessive attitude concerning the original trilogy. This existed long before Lucas wrote and directed “The Phantom Menace,” the opening act to the prequel trilogy that arrived 16 years after “The Return of the Jedi.”

Indeed, the origins of the middle-aged "Star Wars" fan's signature smug dismissal may be  yub-nubbing  their way through that installment, the same one that forced Carrie Fisher's Princess Leia to fight in a bikini .

The 25th anniversary of “The Phantom Menace” coincides with the 41st anniversary of a fictionally established but sound theory first presented by Neil Patrick Harris’ Barney Stinson in a seventh season episode of “How I Met Your Mother” called “Field Trip” that first aired in 2011. 

He called it “The Ewok Line,” a demographic border established on May 25, 1983. Those who turned 10 before that date were “too old for something so cloying and cute,” said Barney. Anyone who turned 10 afterward loves the Ewoks “because, why? . . . They reminded you of your Teddys.”

The sitcom father of all F-boys is right. Well, not quite . . . it’s writer Jamie Rhonheimer who verbalized the source of the first schism within the “Star Wars” congregation. 

The origins of the middle-aged "Star Wars" fan's dismissal of "The Phantom Menace" may be  yub-nubbing  their way through "Jedi."

Frankly, I was not aware of it until college, when I attended a marathon screening of the original movies in the campus' largest lecture hall. By the time the “Jedi” reel was in the projector about half the audience was inebriated, setting up the roar that met the Imperial AT-STs firing on Endor’s fuzzy cuties. Everyone who detested them cheered while the rest of us sat there in horror. (Fun fact: this is the first movie my husband and I attended together, only we hadn’t met yet. Guess which team he was on.)

“HIMYM” presented The Ewok Line as one of its many jokes that rings true because it solidifies a generational quirk that many 30- and 40-somethings didn’t recognize as a commonality. It’s also one of the rare times that Barney was correct instead of purely ridiculous, although for him the Line was a secret metric he used to guess a woman’s age.

Regarding “Star Wars,” it was the smaller fault line that predicted the chasm created when the people who waited for a full driver's licensed teenager’s existence for a new “Star Wars” chapter were greeted by Jake Lloyd listlessly yelling “Yippee!” That's the actor who had the unenviable task of playing the nine-year-old boy who would become Darth Vader.

I remain in the thumbs-down camp concerning the prequels, by the way; instead of rehashing the same quibbles, I’ll simply direct you  to Charles Taylor’s 1999 review for Salon .  What he said.

At the same time, today I can better appreciate that Lucas didn’t make those films for me. He made them for 1999’s children. As such the love/hate division has become less binary as the franchise’s mythology has expanded and we’ve all matured. To varying degrees. 

If people feel better than we once did about those prequels, credit Dave Filoni’s contributions to the “Star Wars” mythos by way of “The Clone Wars” and “Star Wars Rebels.” 

That acclaimed pair of animated series thoughtfully filled in the gaping potholes left between the prequels, and fleshed out Anakin Skywalker’s backstory and psychological profile. Thanks to them, old-school devotees have a respectable consolation prize to enjoy with their children and grandchildren.

We might also contemplate our collective recognition of the dangerously intoxicating effects of nostalgia and the ways one’s loyalty to the original trilogy exemplifies that. “A New Hope,” “ The Empire Strikes Back” and “Jedi” collectively became a kind of morality North Star for Gen X, augmented by Joseph Campbell’s authentication of the films as spiritual parables.

Several essayists have described common characteristics of the “Star Wars” generation in terms of its existence at a technological turning point. We remember having to physically dial phone numbers using devices that plugged into walls, or when only a handful of movies came out each summer. The pioneering special effects Lucas used in “A New Hope” are part of setting those expectations. So were the action figures – the disenchanted devotees' Teddys. 

Remaining tethered to playthings and the imagination surrounding them for that long led millions of us to build backstories and worlds in our heads, some described on classic Kenner packaging and others teased into reality in official novels and comics. Which we also read – how else were we expected to quench our thirst for all those years? 

We need your help to stay independent

And the marketplace assigned value to all that. EBay came online a few years before Lucas reawakened the Force in theaters, letting the masses know our old lightsabers and plastic Tauntauns had actual monetary value above and beyond the price tag. 

Did critics misjudge their cinematic worth back then? From a canonical perspective, perhaps.

Our disappointment may also be a function of movie consumption evolving as well. Those feelings over ownership over “Star Wars” probably have something to do with the fact that we actually owned copies of the original movies on VHS. Home video systems enabled superfans to rewatch the confrontation between Han Solo and Greedo repeatedly to determine who shot first. (It was Han, dammit!) 

That also meant we could pick apart the smallest details about each scene to amplify some sense of profundity that, truth be told, probably wasn’t there in the first place. “The Phantom Menace” left no doubt of that, revealing Lucas to be less of some space opera guru than a guy more skilled at whiz-bang effects than character development or thoughtful exposition. 

The man gave the role of Queen Amidala to Natalie Portman, an actor who went on to win a best actress Oscar in the same year as “Field Trip,” and dulled down her abilities to the level of taxidermied fish puppetry. But you know who wasn’t scrutinizing Portman or Liam Nesson or Samuel L. Jackson for emotional range? People born on the post-1983 side of The Ewok Line.

Want a daily wrap-up of all the news and commentary Salon has to offer? Subscribe to our morning newsletter , Crash Course.

Many “Phantom Menace” lovers embrace it as the first “Star Wars” they saw in theaters where they are close to Padme Amidala’s age, 14, or Anakin’s; he was nine. Genre fantasy can especially empowering for the young, enabling them to relate to somebody like themselves tossed into a position of trust and power instead of relegated to the booster seat. 

Those folks loved the podracing scenes and the fact that boy Anakin saves the day by, in effect, hitting the off switch on a massive remote. Most embrace the silliness of Jar Jar as opposed to expanding his clumsy slapstick into something more sinister than his maker intended.

Did critics misjudge their cinematic worth back then? From a canonical perspective, perhaps. In 2024 we’re awash in “Star Wars” stories, some better than others, thanks to the seeds Lucas planted in those prequels. The best of them map new roads through this universe that call upon mature, thoughtful perspectives differing extensively from what Lucas endeavored to do two and half decades ago.

In terms of their overall execution . . . they’re still really, really not good. That said, not every critic bludgeoned “The Phantom Menace.”  This is what the late, revered Roger Ebert wrote in 1999 :

 At the risk of offending devotees of the Force, I will say that the stories of the "Star Wars" movies have always been space operas, and that the importance of the movies comes from their energy, their sense of fun, their colorful inventions and their state-of-the-art special effects. I do not attend with the hope of gaining insights into human behavior. Unlike many movies, these are made to be looked at more than listened to, and George Lucas and his collaborators have filled "The Phantom Menace" with wonderful visuals.

Some of them were cuddly and meant to be kid-friendly. That doesn’t exempt them from disdain or criticism but maybe all these years later we naysayers might observe more closely which side of the line we stand on and what that position says about us.

“Star Wars: Episode I – The Phantom Menace" is streaming on Disney+. "How I Met Your Mother" is streaming on Hulu.

about "Star Wars"

  • The "Star Wars" kids aren't alright
  • "Solo" tips the balance: There have now been more bad "Star Wars" movies than good
  • Star Wars creatures taught me empathy

Melanie McFarland is Salon's award-winning senior culture critic. Follow her on Twitter: @McTelevision

Related Topics ------------------------------------------

Related articles.

hypothesis generation by

hypothesis generation by

Advertisement

Deciphering Generation Names, Birth Years and Stereotypes

  • Share Content on Facebook
  • Share Content on LinkedIn
  • Share Content on Flipboard
  • Share Content on Reddit
  • Share Content via Email

hypothesis generation by

From sock hops and bell bottoms to low-rise jeans and TikTok dance challenges, each generation has many characteristics and trends that set it apart from the next.

Somewhere along the line, these eras picked up a bundle of different monikers and start and end dates. These generation names and years can be confusing, but there is a method to the madness. Well, most of the time.

For example, Generation Z — aka Gen-Z, aka Post-Millennials, aka iGeneration — begins in either 1994, 1996 or 1997, depending on who you're talking to. However, '97 is the most widely accepted starting year for Gen Zers.

Read on to learn more about the ways we've come to name and define different generations, from those who were children during World War I to younger generations who never knew life without social media platforms.

Defining Generation Names and Dates

The greatest generation (gi generation): born 1901 to 1927, the silent generation: born 1928 to 1945, baby boomer generation: born 1946 to 1964, generation x: born 1965 to 1980, millennial generation: born 1981 to 1996, generation z or igen: born 1997 to 2012, generation alpha: born between 2013 to 2025.

A generation is usually defined as a group of people born and living around the same time, typically spanning about 15 to 20 years. This grouping is based on shared historical, social and cultural experiences that shape their attitudes, values and behaviors.

These shared experiences forge a collective identity that sets each generation apart. For example, the start of the baby boom generation is often tied to the end of World War II , while millennials are typically marked by the rise of the internet and the new millennium.

These generation labels and dates aren't set in stone and can vary slightly depending on the source, but they generally reflect periods of substantial change that influence each generation's formative years.

Setting Generational Boundaries

One major contributor to defining each generation's boundaries is the Pew Research Center , which conducts extensive research and analysis on demographic, social and economic trends.

They establish generational boundaries based on significant historical events, technological advancements and cultural shifts, providing a framework for understanding how different cohorts experience and influence society.

Researchers, media and policymakers widely use definitions and reports from the Pew Research Center to analyze generational differences and their impacts on various aspects of life.

Now, let's look at each generation and its characteristics.

The GI generation is renowned for its resilience and civic duty, shaped by the profound challenges and major historical events of the early 20th century. In one generation, they experienced two world wars and a major economic downturn.

The Greatest Generation came of age during the Great Depression , which began in 1929 and lasted through much of the 1930s. The hardships they faced during this time — such as widespread unemployment and poverty — instilled values of frugality, diligence and perseverance.

Many members of this generation were children during World War I (1914 to 1918), which also influenced their early life experiences. Though too young to participate directly in the First World War, the global impact of the war and its aftermath (including economic instability and societal changes) would have been part of their formative environment.

Their significant involvement in World War II, either on the battlefield or on the home front, further defined their lives, cementing their legacy of resilience and sacrifice.

This generation's experiences of economic and global instability helped shape the mid-20th-century world, laying the foundations for modern societal structures. Their enduring influence is marked by their commitment to duty and ability to thrive despite early adversities.

While resilience is a hallmark, this group is sometimes seen as overly traditional and resistant to change. They have received criticism for clinging to established norms and authority without question and being too skeptical of new technologies and modern innovations.

The Silent Generation grew up during the Great Depression and World War II, events that significantly shaped their attitudes and behaviors. These early experiences instilled a sense of frugality, hard work and duty, which defined much of their approach to life.

So what's with the "silent" label? Well, this moniker is due to its members' perceived cultural and social traits. A 1951 Time magazine article coined the term, observing the generation's tendency to be more cautious, conformist and less vocal about their political and social views than their predecessors.

Growing up during the Depression and Second World War and coming of age during the early Cold War era, many members of this generation prioritized job security and stability, often shying away from activism and public dissent. They are usually viewed as a stabilizing force during times of change, unlike the boomers that followed, who are viewed as more vocal and rebellious.

Baby boomers are the demographic group born between 1946 and 1964. They are defined by the post-World War II baby boom . During this period, birth rates skyrocketed because of economic prosperity, returning soldiers eager to start families, supportive government policies, cultural optimism and the expansion of suburban housing.

This generation grew up during a time of widespread economic prosperity and rapid social change, including the Civil Rights Movement, the Sexual Revolution and the Vietnam War . These events shaped boomers into a generation known for challenging and reshaping societal structures.

Professionally, they are often credited with fueling the economic prosperity of the late 20th century. Boomers are known for their strong work ethic, which is usually characterized as work-centric, competitive and goal-oriented. As they entered the workforce, they also gained a reputation for changing the norms of work and retirement, pushing for more flexibility and a focus on work-life balance .

Socially, baby boomers have been described positively and negatively: They are seen as a generation that values individual freedom and responsibility. However, they are frequently accused of prioritizing their own financial security, contributing to housing market inflation and environmental degradation.

Their impact on politics and economics continues to be significant as they age, especially regarding social security systems and healthcare services, given their vast numbers and active involvement in civic duties.

In recent years, many baby boomers have begun to work past the traditional retirement age , either by necessity or choice. This shift impacts societal views on aging and retirement, setting new standards for future generations.

This generation came of age during shifting societal values and technological advancements, notably the rise of personal computing and the internet.

Growing up during the 1970s and 1980s, Gen Xers witnessed significant political and economic changes, including the end of the Cold War and the 1987 stock market crash (and the implosion of the dot-com bubble as young adults).

Often characterized as "latchkey" kids , many Gen Xers were raised in dual-income or single-parent households. This factor contributed to their reputation for being independent, resourceful and self-reliant. This independence is sometimes seen as cynicism or skepticism, mainly because they experienced several economic downturns and corporate downsizing during their formative years.

In the workplace, Gen Xers are known for valuing a work-life balance, pushing back against the work-centric mentality of their boomer parents. They were among the first to challenge the corporate ladder concept, favoring a more flexible and results-oriented work environment.

Culturally, Gen X has made substantial contributions to music, art and technology. They drove the grunge music movement and the growth of independent film. Gen Xers were also the first generation to grow up with video games and embrace digital technology on a significant scale.

Today, as they move into middle age, Generation Xers are often considered the "middle child" of generations, overshadowed by the larger boomer and millennial generations.

More rarely referred to as Generation Y, this generation has been shaped by unique circumstances, including technological advancements, economic volatility and global connectivity, which have significantly influenced their values, behaviors and lifestyle choices.

Millennials grew up during the rapid expansion of the internet and digital technology, making them highly adept at communicating and processing information through digital platforms. This tech-savvy generation has driven significant changes in how people connect, work and consume media, leading to the rise of social media, the gig economy and streaming entertainment.

Economically, many millennials entered the job market during the Great Recession of the late 2000s, which has had long-lasting effects on their career paths and financial stability. This timing has often led to challenges such as higher levels of student debt and difficulties in achieving traditional milestones like home ownership and marriage.

This generation has faced criticism for being perceived as entitled, overly dependent on technology, lacking work ethic and exhibiting a sense of impatience and craving for instant gratification, often attributed to their upbringing in a rapidly evolving digital age and economic challenges.

Socially and politically, millennials are known for their progressive values. They prioritize issues like climate change , social justice and inclusivity. They are also more likely than previous generations to advocate for government intervention in areas such as health care and environmental regulation.

As millennials mature into key societal roles, their influence continues to grow, reshaping politics, culture and the economy. Their approach to life and work, including a preference for flexible work arrangements and a desire for a meaningful career that aligns with their values, is slowly changing traditional norms.

Generation Z, often called Gen Z, was raised in the era of smartphones and social media, which profoundly influences their communication habits, information consumption and social interactions.

Gen Zers have come of age during significant social, environmental and technological change. These shifts include global challenges such as economic inequality and political polarization, which have shaped the climate change worldview to be pragmatic and inclusive yet cautious about the future.

In terms of technology, they fully embrace the digital age and seamlessly integrate digital tools into their daily lives for education, entertainment and socializing.

Educationally and economically, Gen Z faces unique challenges, including the high costs of education and the uncertainties of job markets influenced by automation and the gig economy. These factors drive many in Gen Z to value practical skills and job security, pushing them toward entrepreneurship and side hustles as ways to achieve financial stability.

Socially and politically, Gen Z is characterized by a strong sense of justice and a commitment to advocacy. They often use digital platforms to mobilize around issues such as climate change, mental health and inclusivity. Their activism is frequently aimed at effecting change at both the grassroots and global levels, illustrating their commitment to positively impacting the world.

Gen Zers are often criticized for being overly reliant on technology, having short attention spans, displaying entitlement and impatience, being overly sensitive in their focus on social justice and lacking the strong work ethic of previous generations.

Gen Alpha is the first generation born entirely in the 21st century, and its upbringing is deeply intertwined with technology. From a very young age, Gen A has been exposed to smartphones, tablets and AI-driven technologies, making it the most technologically immersed generation from the outset.

There is speculation that the digital natives' overdependence on digital devices could reduce face-to-face social skills and attention spans.

The parenting and education of Generation Alpha are significantly influenced by the experiences of millennials, who are their primary parents. This generational connection emphasizes values like inclusivity, environmental awareness and the use of technology for socializing and learning.

With the pervasive presence of advanced technology, Gen Alpha children are likely to experience personalized learning environments and digital play as integral components of their development.

Socially and culturally, Generation Alpha is growing up in a world of global connectivity and diverse communities. Issues like climate change, sustainability and social justice are expected to be central themes in their educational and developmental narratives.

Moreover, the COVID-19 pandemic has marked a significant part of their early years, likely impacting their schooling, social interactions and family dynamics in profound ways.

Gen Alpha is anticipated to blend digital and physical experiences further as they mature, leveraging technology in innovative ways that will shape their work, entertainment and social relations. Their potential influence on future cultural, technological and environmental advancements is vast, as they will continue to build on the digital foundation laid by older generations.

We created this article in conjunction with AI technology, then made sure it was fact-checked and edited by a HowStuffWorks editor.

Please copy/paste the following text to properly cite this HowStuffWorks.com article:

At the centre of the image is a nebula on the black background of space. The nebula is composed of clumpy, red, filamentary clouds. At the centre-right of the red clouds is a large cavernous bubble, and at the centre of the bubble there is an opaque blue glow with speckles of stars. At the edges of the bubble, the dust is white. There are several other smaller cavernous bubbles at the top of the nebula. There are also some smaller, red stars and a few disc-shaped galaxies scattered about the image.

What if aliens exist—but they're just hiding from us? The Dark Forest theory, explained

The chilling theory depicted in the Netflix series ‘3 Body Problem’ is just one explanation for our lack of encounters with extraterrestrial intelligence.

The famed Fermi paradox has bewitched astronomers for more than half a century. To put it concisely: If the cosmos is nearly 14 billion years old, then where are all the interstellar societies? Why haven’t they popped over to say hello? Myriad solutions to this conundrum have been proposed, but perhaps none more chilling than the Dark Forest theory.

As the supposition holds, the reason we can’t see these alien civilizations is because they’re all in hiding. Unlike humanity—whose radio transmissions have long echoed throughout our local galactic neighborhood—these societies have all concluded that it’s simply too dangerous to broadcast their location to potentially hostile neighbors.

It's a sobering thought—and one that’s gaining attention now that it’s being depicted in 3 Body Problem , a Netflix adaptation of author Cixin Liu’s literary trilogy. But is it a plausible solution to the Fermi paradox? Of all the posited answers, experts say the Dark Forest hypothesis seems less likely to be correct.

It’s possible that several extraterrestrial intelligences, or ETIs, would conceal themselves. But it’s improbable that all of them will come to the same fear-based conclusion and hide away.

( In the hunt for alien life, this planet just became a top suspect .)

“We don’t even see that same behavior on cultures here on Earth,” says Moiya McTier , an astrophysicist, author, and folklorist. Some ETIs might have members that all act in perfect unison. But others will have divergent, independently behaving groups—some who trend more toward aggression or pacifism, curiosity, or reclusiveness. If one of them waves hello, then that Dark Forest will get a brightly lit campfire for us to see.

For Hungry Minds

But technically anything is possible considering we don’t have any evidence for ETIs to begin with. Perhaps everyone really is hiding. Maybe there truly is a threat lurking out there, somewhere in the dark. And maybe humanity just hasn’t realized it yet.

The case for the Dark Forest theory

The Fermi Paradox was casually raised by physicist Enrico Fermi during a lunchtime chat way back in 1950. It has many nuances, but at its heart is this central premise: Our solar system is just 4.6 billion years old, whereas the universe is 13.8 billion years old. That is plenty of time for life on other planets to develop into technologically advanced societies, those that could set forth across the sea of stars and create outposts or new societies on countless worlds.

But we have yet to find any sign of these societies. So where is everybody?

“There are so many possible overlapping solutions to the Fermi paradox,” says McTier. Is space simply too vast for alien societies to have reached Earth yet? Do all of them destroy themselves before becoming interstellar? Are we the only technologically advanced society in our corner of the cosmos? Is the evolution of life vanishingly rare?

( This man launched the quest to find alien intelligence. It changed astronomy .)

“All the Fermi paradox tells you is that civilizations are rare. It doesn’t tell you why they’re rare,” says Ian Crawford , a planetary scientist and astrobiologist at Birkbeck, University of London. “One of the solutions is: yeah, they’re all out there, but they’re all hiding. If they give themselves away, someone will come and destroy them.”

The idea that these spacefaring aliens are simply reluctant to reveal themselves has featured in sci-fi storytelling for many decades . Liu, in his 2008 book, gave the hypothesis a catchy name. He describes the universe as a dark forest , wherein each alien society is like a fearful, armed hunter gingerly moving forth. If that hunter finds “other life—another hunter, an angel or a demon, a delicate infant or a tottering old man, a fairy or a demigod—there’s only one thing he can do: open fire and eliminate them. In this forest, hell is other people.”

Being fearful has its evolutionary benefits: We may flinch at a strange noise in the night, and although most of the time it’s harmless, our caution may save our lives the one time it’s coming from a genuine threat.

“It can’t be denied that there is some survival value in being aggressive,” Seth Shostak , a senior astronomer at the Search for Extraterrestrial Intelligence Institute in California. Preemptively take out the competition, and you may sleep more securely while getting extra resources. The history of humanity—and its present—is littered with grim examples of this.

A massive cluster of bright, yellow and white light stars.

The case against the Dark Forest theory

Thankfully, the Dark Forest has a plethora of issues that are difficult to resolve—the most obvious being that it’s extremely difficult to conceal a technologically advanced world.

You May Also Like

hypothesis generation by

The 11 most astonishing scientific discoveries of 2023

hypothesis generation by

This supermassive black hole was formed when the universe was a toddler

hypothesis generation by

How fast is the universe really expanding? The mystery deepens.

Long before actively searching for ETIs became a global scientific practice , radio signals from Earth’s quotidian intraspecies communications have been emanating into the void—something that a nearby alien society hoping to find a new ally, or fresh target, could handily spot.

( How many alien civilizations are out there? A galactic survey holds a clue .)

Even as we’ve begun to grasp the hypothetical threat, it's not like we’re about to go completely silent, either. “We’ve never given the slightest thought to turning off all the radars because it might be dangerous,” says Shostak. “It’s just not gonna happen.”

Even if an ETI tried to conceal itself, it may not be sophisticated enough to work. Some alien societies may have found a way to stamp out all their noise, but others may be accidentally still giving the game away without realizing. “The way cavemen might hide is quite different from the way Klingons might hide,” says Shostak.

The forest analogy also falls apart when you consider the true nature of the universe—or simply our own ginormous galaxy. The woods can seem huge and endless in the dark, but that’s peanuts compared to space.

“There may be hostile aliens out there,” says Shostak. But the distances between them are likely to be unfathomably vast, so much so that the idea they would feel the need to preemptively attack one another seems odd. Even if they feared each other, the expanse between them means that they wouldn’t likely need to compete for resources; each would have near-limitless worlds, asteroids, and even stars to exploit.

The fact that Earth is, by universal standards, a young, noisy, and vulnerable technological society, also by default implies that—if there are ETIs out there—then they cannot all be instinctively aggressive.

“If there are so many civilizations, and some of them could destroy us, then we have to explain how that has not happened,” says Karim Jebari , a researcher at the Institute for Futures Studies in Stockholm, Sweden. “Maybe there’s a Galactic Empire that keeps the hostilities down, or maybe it's really difficult… to attack each other over interstellar distances.”

( What we know from decades of UFO government investigations .)

Or, as Jebari has suggested in a recent paper , ETIs have reached the same logical conclusion: that they still exist because other advanced alien societies have chosen not to smite them, perhaps hoping instead to have a mutually beneficial conversation. “We have no reason to attack them in a preemptive strike,” says Jebari. “If they’re smart… maybe they’re thinking the same thing about us.”

That all ETIs would share the very human instinct of assuming the worst about an unknown entity is also a massive presumption.

“For me, [the Dark Forest] is one of the less compelling explanations for the Fermi paradox, because it relies on a few anthropocentric assumptions that I don’t think are fair,” says McTier. Fear is a powerful thing. But so is curiosity.

The nightmare scenario

That doesn’t necessarily mean that the Dark Forest hypothesis is a nonstarter. The problem is that addressing the holes in the theory requires amping up the terror factor.

“The nightmare scenario is that suppose those that are hiding are right,” says Crawford. “Suppose that, sometime in the history of the galaxy, a technological civilization… decided that whenever planets with life or technology were found, they were going to destroy it.”

In other words, if extermination for extermination’s sake was the goal, then the Dark Forest seems more plausible. “If something like that has been going on in the history of the galaxy, then yes it would explain the Fermi paradox,” Crawford says.

So sure, maybe our corner of the cosmos is quiet because life getting started in the first place is an extreme rarity. Perhaps it’s lonely out here because alien societies have a bad habit of annihilating themselves once they discover something like atomic weapons.

Or, just maybe, “we don’t see them because they’re not there,” says Crawford—because a slaughtering entity is going from star to star extinguishing any sign of life. “That’s the really scary thing.”

Related Topics

  • EXTRATERRESTRIAL LIFE
  • ASTROPHYSICS

hypothesis generation by

The world’s most powerful telescope is rewriting the story of space and time

hypothesis generation by

This is what the first stars looked like as they were being born

hypothesis generation by

Here's how astronomers found one of the rarest phenomenons in space

hypothesis generation by

Saturn’s ‘Death Star’ moon was hiding a secret: an underground ocean

hypothesis generation by

Colossal gravitational waves—trillions of miles long—found for the first time

  • Paid Content
  • Environment
  • Photography
  • Perpetual Planet

History & Culture

  • History & Culture
  • History Magazine
  • Mind, Body, Wonder
  • World Heritage
  • Terms of Use
  • Privacy Policy
  • Your US State Privacy Rights
  • Children's Online Privacy Policy
  • Interest-Based Ads
  • About Nielsen Measurement
  • Do Not Sell or Share My Personal Information
  • Nat Geo Home
  • Attend a Live Event
  • Book a Trip
  • Inspire Your Kids
  • Shop Nat Geo
  • Visit the D.C. Museum
  • Learn About Our Impact
  • Support Our Mission
  • Advertise With Us
  • Customer Service
  • Renew Subscription
  • Manage Your Subscription
  • Work at Nat Geo
  • Sign Up for Our Newsletters
  • Contribute to Protect the Planet

Copyright © 1996-2015 National Geographic Society Copyright © 2015-2024 National Geographic Partners, LLC. All rights reserved

COMMENTS

  1. Formulating Hypotheses for Different Study Designs

    Thus, hypothesis generation is an important initial step in the research workflow, reflecting accumulating evidence and experts' stance. In this article, we overview the genesis and importance of scientific hypotheses and their relevance in the era of the coronavirus disease 2019 (COVID-19) pandemic.

  2. Hypothesis Generation by Difference

    The difference-based methods for hypothesis generation are introduced as design principles and patterns for integrated hypothesis generation. 6.1.1 Classification of Difference-Based Methods. First, we explain the difference-based methods for generating hypotheses in general regardless of the data type.

  3. Data-Driven Hypothesis Generation in Clinical Research: What We Learned

    Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study ...

  4. Hypothesis-generating research and predictive medicine

    The paradigm of hypothesis-generating research does not replace or undermine hypothesis-testing modes of research; instead, it complements them and has facilitated discoveries that may not have been possible with hypothesis-testing research. The hypothesis-generating mode of research has been primarily practiced in basic science but has ...

  5. Hypothesis

    Generate a hypothesis in advance through pre-analyzing a problem (i.e., generation of a prestage hypothesis ). 3. Collect data related to the prestage hypothesis by appropriate means such as experiment, observation, database search, and Web search (i.e., data collection). 4. Process and transform the collected data as needed. 5.

  6. PDF Scientific hypothesis generation process in clinical research: a

    Scientific hypothesis generation and scientific hypothesis testing are distinct processes 2,5. In clinical research, research questions are often delineated without the support of systematic data analysis (i.e., not data-driven) 2,6,7. Using and analyzing existing data to facilitate scientific

  7. Scientific Hypothesis Generation by a Large Language Model: Laboratory

    View a PDF of the paper titled Scientific Hypothesis Generation by a Large Language Model: Laboratory Validation in Breast Cancer Treatment, by Abbi Abdel-Rehim and 10 other authors. View PDF Abstract: Large language models (LLMs) have transformed AI and achieved breakthrough performance on a wide range of tasks that require human intelligence ...

  8. Hypothesis Generation and Interpretation

    Academic investigators and practitioners working on the further development and application of hypothesis generation and interpretation in big data computing, with backgrounds in data science and engineering, or the study of problem solving and scientific methods or who employ those ideas in fields like machine learning will find this book of ...

  9. Hypothesis Generation from Literature for Advancing Biological

    Hypothesis Generation is a literature-based discovery approach that utilizes existing literature to automatically generate implicit biomedical associations and provide reasonable predictions for future research. Despite its potential, current hypothesis generation methods face challenges when applied to research on biological mechanisms. ...

  10. [2404.04326] Hypothesis Generation with Large Language Models

    Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled ...

  11. Hypothesis Generation with Large Language Models

    Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses.

  12. Machine Learning as a Tool for Hypothesis Generation

    Jens Ludwig & Sendhil Mullainathan, 2024. "Machine Learning as a Tool for Hypothesis Generation," The Quarterly Journal of Economics, vol 139 (2), pages 751-827. Founded in 1920, the NBER is a private, non-profit, non-partisan organization dedicated to conducting economic research and to disseminating research findings among academics, public ...

  13. Hypothesis Generation for Data Science Projects

    Hypothesis Generation vs. Hypothesis Testing. This is a very common mistake data science beginners make. Hypothesis generation is a process beginning with an educated guess whereas hypothesis testing is a process to conclude that the educated guess is true/false or the relationship between the variables is statistically significant or not.

  14. PDF 1.0 Introdu Ov a 7Testing

    Hypothesis Generation is a category that includes three specific techniques— Simple Hypotheses, Multiple Hypotheses Generator™, and Quadrant Hypothesis Generation. Simple Hypotheses is the easiest to use but not always the best selection. Use the Multiple Hypotheses Generator™ to identify a large set of all possible hypotheses.

  15. Demystifying Hypothesis Generation: A Guide to AI-Driven Insights

    Decision-making. Hypothesis generation involves making informed guesses about various aspects of a business, market, or problem that need further exploration and testing. This article discusses the process you need to follow while generating hypothesis and how an AI tool, like Akaike's BYOB can help you achieve the process quicker and better.

  16. Scientific hypothesis

    The generation of a hypothesis frequently is described as a creative process and is based on existing scientific knowledge, intuition, or experience. Therefore, although scientific hypotheses commonly are described as educated guesses, they actually are more informed than a guess. In addition, scientists generally strive to develop simple ...

  17. Functional genomic hypothesis generation and experimentation ...

    The Robot Scientist hypothesis-generation and experimentation loop. A widely accepted view of science is that it follows a 'hypothetico-deductive' process 1. Scientific expertise and ...

  18. Spontaneous generation

    Spontaneous generation is a superseded scientific theory that held that living creatures could arise from nonliving matter and that such processes were commonplace and regular. It was hypothesized that certain forms, such as fleas, could arise from inanimate matter such as dust, or that maggots could arise from dead flesh.

  19. Hypothesis Maker

    Create a hypothesis for your research based on your research question. HyperWrite's Hypothesis Maker is an AI-driven tool that generates a hypothesis based on your research question. Powered by advanced AI models like GPT-4 and ChatGPT, this tool can help streamline your research process and enhance your scientific studies.

  20. (PDF) Hypothesis generation

    Hypothesis Generation. Hermann Moisl. University of Newcastle. 1.0 INTRODUCTION. The aim of science is to understand reality. An academic discipline, philosophy of science, is. devoted to ...

  21. Automated Hypothesis Generation

    Automated hypothesis generation: when machine-learning systems produce ideas, not just test them. Testing ideas at scale. Fast. While algorithms are mostly used as tools to number-crunch and test-drive ideas, they have yet been used to generate the ideas themselves. Let alone at scale. Rather than thinking up one idea at a time and testing it ...

  22. Understanding Hypothesis Testing

    Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

  23. Detecting mediation effects with the Bayes factor: Performance

    Testing the presence of mediation effects is important in social science research. Recently, Bayesian hypothesis testing with Bayes factors (BFs) has become increasingly popular. However, the use of BFs for testing mediation effects is still under-studied, despite the growing literature on Bayesian mediation analysis. In this study, we systematically examine the performance of the BF for ...

  24. Hypothesis generation guided by co-word clustering

    It is concluded that co-word clustering is a powerful method for literature-based hypothesis generation and knowledge discovery. Article PDF. Download to read the full article text Similar content being viewed by others. The Use of Artificial Intelligence in Writing Scientific Review Articles ...

  25. Hate "The Phantom Menace"? The Ewok Line theory could explain why

    The 25th anniversary of "The Phantom Menace" coincides with the 41st anniversary of a fictionally established but sound theory first presented by Neil Patrick Harris' Barney Stinson in a ...

  26. Baby-boomers are loaded. Why are they so stingy?

    Baby-boomers were born between 1946 and 1964—and are the luckiest generation in history. Most of the cohort, which numbers 270m across the rich world, have not fought wars. Some got to see the ...

  27. Deciphering Generation Names, Birth Years and Stereotypes

    These generation names and years can be confusing, but there is a method to the madness. Well, most of the time. Advertisement. For example, Generation Z — aka Gen-Z, aka Post-Millennials, aka iGeneration — begins in either 1994, 1996 or 1997, depending on who you're talking to. However, '97 is the most widely accepted starting year for Gen ...

  28. Analysis and hypothesis testing of redundant energy of solar home

    Hypothesis testing of the existence of redundant energy from the SHS is also conducted. Our study has revealed that generally, there is redundant energy generation in the hours of 10 a.m. to 3 p.m. for the households, with hourly values ranging from 0.37 kWh to 1.55 kWh.

  29. The Dark Forest theory in '3 Body Problem,' explained

    Liu, in his 2008 book, gave the hypothesis a catchy name. He describes the universe as a dark forest , wherein each alien society is like a fearful, armed hunter gingerly moving forth.

  30. Applied Sciences

    Current semantic generation research mainly focuses on semantic generation of single grammar and lacks work on automatically generating semantics for different grammatical strategies. Generating semantics for different grammars is a tedious, inefficient, and non-scalable task. Inspired by sequence labeling in the field of natural language ...