Lingualyzer: A computational linguistic tool for multilingual and multidimensional text analysis

  • Original Manuscript
  • Open access
  • Published: 29 November 2023

Cite this article

You have full access to this open access article

research paper on linguistic analysis

  • Guido M. Linders   ORCID: orcid.org/0000-0002-2252-6260 1 , 2 &
  • Max M. Louwerse   ORCID: orcid.org/0000-0003-0328-7070 1  

1577 Accesses

6 Altmetric

Explore all metrics

A Publisher Correction to this article was published on 13 December 2023

This article has been updated

Most natural language models and tools are restricted to one language, typically English. For researchers in the behavioral sciences investigating languages other than English, and for those researchers who would like to make cross-linguistic comparisons, hardly any computational linguistic tools exist, particularly none for those researchers who lack deep computational linguistic knowledge or programming skills. Yet, for interdisciplinary researchers in a variety of fields, ranging from psycholinguistics, social psychology, cognitive psychology, education, to literary studies, there certainly is a need for such a cross-linguistic tool. In the current paper, we present Lingualyzer ( https://lingualyzer.com ), an easily accessible tool that analyzes text at three different text levels (sentence, paragraph, document), which includes 351 multidimensional linguistic measures that are available in 41 different languages. This paper gives an overview of Lingualyzer, categorizes its hundreds of measures, demonstrates how it distinguishes itself from other text quantification tools, explains how it can be used, and provides validations. Lingualyzer is freely accessible for scientific purposes using an intuitive and easy-to-use interface.

Similar content being viewed by others

research paper on linguistic analysis

Beyond lexical frequencies: using R for text analysis in the digital humanities

research paper on linguistic analysis

Corpus Linguistic Analysis: How Far Can We Go?

research paper on linguistic analysis

SemanticExcel.com: An Online Software for Statistical Analyses of Text Data Based on Natural Language Processing

Avoid common mistakes on your manuscript.

Introduction

For most research in cognitive and social psychology, psycholinguistics, and cognitive science at large, text analysis has primarily focused on a very small and very specific part of human language, that of formal, written English from a WEIRD (Western, Educated, Industrialized, Rich and Democratic) population (Blasi et al., 2022 ; Henrich et al., 2010 ; Kučera & Mehl, 2022 ; Levisen, 2019 ). Most experiments conducted in these disciplines use English stimuli, and most linguistic analyses and computational linguistic tools are based on the English language. Yet, the focus on English is rather surprising. English is only one of over 7000 languages in the world, and not even the one most commonly used by native speakers (Eberhard et al., 2022 ). Moreover, the overwhelming focus on English might even hinder progress in these fields, due to premature generalizations across languages, based on just English-language studies (Blasi et al., 2022 ), and the Anglocentric bias in creating and testing hypotheses and theories (Levisen, 2019 ). It is therefore unlikely that all findings in the behavioral sciences can be generalized across languages (Evans & Levinson, 2009 ), as demonstrated in for example cross-linguistic reading experiments (Li et al., 2022 ; Share, 2008 ), cross-linguistic visual perception experiments (Lupyan et al., 2020 ), and analyses of backchannel behavior across languages (Maynard, 1986 ; Zellers, 2021 ). At the very least, whether findings obtained are generalizable beyond English requires an investigation to what extent, and in which ways, languages differ from one another. Because languages vary widely in their statistical regularities and cultures strongly influence the interpretation of the results of (quantitative) linguistic analyses, such analyses do not necessarily extend beyond the findings for the specific language or linguistic population under investigation (Blasi et al., 2022 ; Kučera & Mehl, 2022 ; Levisen, 2019 ; Louwerse, 2021 ).

Fortunately, there are some, albeit few, computational tools that focus on individual languages other than English. For instance, several language-specific tools have been created that can perform a large range of general natural language processing (NLP) tasks, such as word tokenization and segmentation, part-of-speech (PoS) tagging and named-entity recognition. Examples are CAMeL Tools for Arabic (Obeid et al., 2020 ), BNLP for Bengali (Sarker, 2021 ), FudanNLP for Chinese (Qiu et al., 2013 ), and EstNLTK for Estonian (Laur et al., 2020 ). However, these tools tend to be very language-specific. Extending these tools to other languages or comparing texts across different languages is difficult (Bender, 2009 ). The Linguistic Inquiry and Word Count (LIWC) tool, for instance, quantifies word use through a dictionary of (English) words which are grouped into primarily psychologically based dimensions (Tausczik & Pennebaker, 2010 ). LIWC thus heavily relies on a handcrafted dictionary that is only available in English. Attempts have been made to manually translate this dictionary into many other languages: Arabic, Brazilian Portuguese, Chinese, Dutch, French, German, Italian, Japanese, Serbian, Spanish, Romanian and Russian (see Kučera & Mehl, 2022 , for an overview). However, manual translations are non-trivial and time-consuming (Boyd et al., 2022 ). Moreover, dictionaries in different languages vary significantly in terms of the number of words they contain (Kučera & Mehl, 2022 ). Perhaps more importantly, it is unclear to what extent these dictionaries are really comparable across languages. For example, a parallel corpus of TED talks was analyzed in four different languages using four different translations of the LIWC dictionary to investigate the comparability across languages (Dudău & Sava, 2021 ). The results varied across language pairs and even across word groups, questioning to what extent this variation can be explained by cross-linguistic differences or differences across these dictionaries.

The solution to the problem of discrepancies across languages is to not use a (top-down) dictionary approach but to rely on a (bottom-up) data-driven approach. However, data-driven tools can also be hard to extend beyond English, due to the lack of natural language training data that these tools need, and differences in annotation across languages. For example, recent neural network language models rely on very large amounts of language data. Yet data quality and quantity are both important factors in the performance of these models in different natural language processing tasks, highlighting the importance of collecting and annotating large amounts of high-quality natural language data for languages beyond English and especially for low-resource languages (Artetxe et al., 2022 ; Magueresse et al., 2020 ; Rae et al., 2021 ).

Fortunately, in recent years more resources for languages other than English have been made available. Most notable is the creation of the Universal Dependencies (UD) treebank collection (Nivre et al., 2020 ). This collection contains natural language data in many languages, annotated using a universal set of part-of-speech tags, morphological features and a universal approach to tokenization, PoS tagging, morphological feature annotation, lemmatization, dependency parsing, and named entity recognition. Several tools have been trained on the UD treebanks to automatically process and annotate new texts. Some prominent ones, covering over 60 languages, are Stanza (Qi et al., 2020 ) and UDPipe (Straka & Straková, 2017 ; Straka et al., 2016 ). Other resources that have been made available are the multilingual word vectors which are available in 157 languages (Grave et al., 2018 ), and a large-scale multilingual masked language models trained on 100 languages (Conneau et al., 2020 ). These resources utilize large amounts of publicly available data in many languages, such as data coming from Wikipedia and Common Crawl.

What these multilingual NLP tools lack, however, is an interface that allows users who do not have a strong background in programming and NLP to extract relevant information from (multilingual) language datasets and configure them in such a way that they can serve as measures of interest. It is exactly for that reason that quantitative text analysis tools such as LIWC (Tausczik & Pennebaker, 2010 ) and Coh-Metrix (Graesser et al., 2004 ; McNamara et al., 2014 ) were developed.

Quantitative text analysis converts unstructured text into quantifiable (i.e., countable or measurable) variables (Roberts, 2000 ), thereby leveraging the many statistical regularities that are present in human language (Gibson et al., 2019 ). These statistical regularities are fundamental in understanding language (Louwerse, 2011 , 2018 ). However, these regularities are not static and differ across writers and speakers (Pennebaker & King, 1999 ), as well as across language registers and genres (Biber, 1988 ; Louwerse et al., 2004 ).

The quantification of language use can provide important insights in psychological processes (Linders & Louwerse, 2023 ), the mental state of the language user (Tausczik & Pennebaker, 2010 ), but in other characteristics of the language user as well, such as age and gender (Maslennikova et al., 2019 ; Schler et al., 2005 ), the idiolect and sociolect of an author (Louwerse, 2004 ), the native language of a writer (Malmasi et al., 2017 ), and even demographic information (Alvero et al., 2021 ). What’s more, the regularities in language are different enough between different language users such that it is possible to identify the author of a piece of text (Juola, 2008 ; Türkoğlu et al., 2007 ). Quantitative text analysis is also used for stimulus creation (Cruz Neri & Retelsdorf, 2022a ), validation (Trevisan & García, 2019 ) and analysis (Dodell-Feder et al., 2011 ). Finally, the resulting quantification is used in computational and statistical models to infer and understand latent properties of texts, such as the truth value of political statements (Mihalcea & Strapparava, 2009 ; Rashkin et al., 2017 ), whether social media texts contains humor or irony (Barbieri & Saggion, 2014 ; Reyes et al., 2012 ) or hate speech (Fortuna & Nunes, 2018 ), and the readability of a text (McNamara et al., 2012 ). In sum, there is a variety of research purposes for which quantitative text analysis tools are desirable.

To support multilingual and multidimensional text analysis, we created the computational linguistic tool Lingualyzer ( https://lingualyzer.com ). Specifically, we had four goals in mind. First, the tool had to support languages beyond English and beyond the Indo-European language family (as many languages as were feasible), and allow for comparable output in all languages. Second, the tool had to be accessible for researchers that do not necessarily have knowledge of NLP or programming. Concretely, this meant providing users with an interface where they can enter unstructured and unprocessed text and with a few clicks obtain the values for a large number of different measures at different text levels (i.e., sentence, paragraph, and document) and linguistic dimensions (e.g., lexical, syntactic, and semantic dimensions). Third, we strived to include a large and varied set of reliable linguistic dimensions and linguistic measures to maximize the value of the tool for different purposes (e.g., cross-linguistic comparisons, text characterization, stimulus validation). Fourth, the linguistic features included in the tool needed to be motivated theoretically. Finally, to make the tool readily available, we aimed for a web interface, freely accessible for scientific purposes.

The current paper is structured into three parts. First, we provide an overview of the existing tools in the literature to position Lingualyzer. Next, we present an overview of Lingualyzer and the linguistic measures and dimensions it covers. Finally, we provide an evaluation of Lingualyzer in terms of instrument reliability and instrument validity.

Tools for a quantitative text analysis

What text analysis tools are already available for researchers working in the behavioral sciences? It is difficult to provide an exhaustive overview of existing tools, given the variety in measures and the focus of the available tools, the variety of single languages they cover, and the variety of publications in journals from different disciplines and in different languages (other than English). Without aiming for an exhaustive overview, but rather to get a general idea of the variety of the available quantitative text analysis tools, we provide an overview whereby we restricted ourselves to only include those tools (1) that had a clear focus on quantitative text analysis, (2) that covered the whole processing pipeline from processing the raw, unstructured text to the quantitative analysis, (3) with an interface accessible to the user, rather than the developer of the tools, (4) that were not derivatives or subsets of other tools, for example tools that were translated into other languages (e.g., Scarton & Aluísio, 2010 ; Van Wissen & Boot, 2017 ), and (5) that had more than three measures, to exclude tools that focus on a single natural language processing or text classification task (e.g., Thelwall et al., 2010 ).

An overview of quantitative text analysis tools is given in Table 1 . The foci of these tools vary. Some focus on text characterization to measure the variation in language use with respect to different populations (Brunato et al., 2020 ; Francis & Pennebaker, 1992 ; McTavish & Pirro, 1990 ), language registers (Biber, 1988 , as reimplemented in Nini, 2019 ) or reflecting aspects of cognition (Tuckute et al., 2022 ). Others focus on specific text characteristics, such as sentiment and verbal tone (Crossley et al., 2017 ; North et al., 1972 ). Yet, other tools focus on text complexity in terms of readability (Bengoetxea & Gonzales-Dios, 2021 ; Dascalu et al., 2013 ), text cohesion (Crossley et al., 2016 ; Dascalu et al., 2013 ; Graesser et al., 2004 ) or syntactic complexity (Kyle, 2016 ; Lu, 2010 ). It is, however, important to note that tools can be used for multiple purposes. For example, T-Scan (Pander Maat et al., 2014 ) and LATIC (Cruz Neri et al., 2022b ) have been designed to quantify both text characteristics and complexity, while Coh-Metrix, for example has been used to characterize variation in language registers (Louwerse et al., 2004 ) and authorship attribution (McCarthy et al., 2006 ).

Table 1 marks whether the approach of the given tools is primarily dictionary-based or data-driven. Dictionary-based tools are those that include a dictionary or database to categorize words and consequently categorize a text. Data-driven methods instead rely on patterns in the text and quantify those using computational linguistic and statistical models. Hybrid approaches use a mixture of both dictionary-based and data-driven approaches. In general, more recently developed tools tend to be more data-driven or hybrid, whereas the inception of older tools tend to be more dictionary-based. While it is difficult to create a dictionary-based tool in multiple languages because individual dictionaries or databases need to be constructed for each language, it is at the same time difficult to create data-driven tools in multiple languages because it would require natural language data that is annotated in a unified manner across the different languages.

The number of measures in Table 1 indicate the different number of quantifiable values or linguistic variables that the tool measures. These vary widely between different tools, ranging from five (Diction) to approximately 472 (T-Scan). Most tools contain between 50 and 200 measures. A comparison of absolute numbers between tools is not very meaningful, because the measures widely differ in complexity, ranging from simple word or word group counts to average semantic similarity scores of adjacent paragraphs.

Table 1 furthermore marks the languages supported by the original version of each tool. Note that this does not include any separate translations of the original tools or derivative tools that use part of the measures from the original tool. Most tools only support English, and some are even more specific, focusing only on written English (Crossley et al., 2016 ) or even written English as a second language (Kyle, 2016 ; Lu, 2010 ). There are, however, some tools that cover more than just the English language (Bengoetxea & Gonzales-Dios, 2021 ; Brunato et al., 2020 ; Cruz Neri et al., 2022b ; Dascalu et al., 2013 ). These tools may support one other language, but with the exception of Profiling-UD (Brunato et al., 2020 ), the languages beyond English are only supported by a subset of the measures. Moreover, due to differences in annotation algorithms and tagsets being used for the different languages, it is virtually impossible to compare the output of the overlapping measures across the languages.

The tools in Table 1 are ordered by the year the first version was released. Many tools have seen improvements over time, and as such, we have also added the most recent reference that provides information on the most recent changes or additions.

Lingualyzer

Lingualyzer is a multilingual and multidimensional text analysis tool available for scientific purposes, benefiting behavioral science researchers who may not have a strong NLP programming background or otherwise would like to use an easily accessible tool. Lingualyzer computes a large number of linguistic measures across different dimensions and levels of analysis. This section explains Lingualyzer, starting with an overview of the languages for which it is available. We then outline how Lingualyzer processes texts and give an overview of the different dimensions that Lingualyzer captures. We end this section with an explanation of how Lingualyzer can be used.

Table 2 summarizes all 41 languages and their respective ten language families for which Lingualyzer is currently available. Within the Indo-European family alone, the tool covers seven different branches. One of the core principles of Lingualyzer is a uniform treatment of text regardless of its language. This means that all measures are available in all languages. Consequently, all values are calculated using exactly the same computations, and annotations are performed based on the same strategies and schemes, which is not the case for most other multilingual quantitative text analysis tools, such as ReaderBench (Dascalu et al., 2013 ), MultiAzterTest (Bengoetxea & Gonzales-Dios, 2021 ) and LATIC (Cruz Neri et al., 2022b ). This means that the output of all the measures are comparable across languages.

Natural language processing resources

For a multilingual text analysis tool that can be used for cross-linguistic analyses, consistency across languages is important. Text processing therefore needs to be unified across languages, using an NLP pipeline that covers a diversity of languages. Fortunately, two such NLP pipelines that facilitate this process already exist: Stanza (Qi et al., 2020 ) and UDPipe (Straka & Straková, 2017 ). Both Stanza and UDPipe are open-source toolkits available in Python that can perform many different natural language processing tasks on raw texts, including word tokenization, lemmatization, PoS tagging, named entity recognition, and dependency parsing and were developed with the goal of creating a language-agnostic tool that is available in as many language as possible (currently over 60). The models of both toolkits are trained on the Universal Dependencies (UD) treebanks (Nivre et al., 2020 ) and come with a framework, such that with relative ease, new models can be trained in new languages.

Despite their similarities, there are differences between the two frameworks that makes Stanza preferable. Most importantly, Stanza makes use of deep neural network models to reach a state-of-the-art performance on the core natural language processing tasks, whereas UDPipe makes use of both deep neural networks and machine learning methods, reaching a very similar, but generally slightly lower performance compared with Stanza on all NLP tasks (cf. Qi et al., 2020 ). Moreover, because UDPipe is slightly older, its models were trained on an older version of the UD treebanks.

The UD framework provides a universal approach to annotating texts on different morphosyntactic levels. First, word boundaries are determined. Words are then annotated with a lemma, a bare form of the word with all morphology removed, and with a syntactic label. The UD framework makes use of a universal set of 17 different PoS tags, which can be summarized into open class tags (i.e., adjectives, adverbs, interjections, nouns, proper nouns and verbs), closed class tags (i.e., adpositions, auxiliaries, coordinating and subordinating conjunctions, determiners, numerals, particles, and pronouns) and a group of other tags, including punctuation markers, symbols and a rest category. Finally, words are annotated with one or multiple morphosyntactic properties. These are labels that indicate different lexical and grammatical features that are overtly marked on the words. The UD framework supports 24 different classes which are further subdivided into individual features. Features are annotated using a presence value, such that a word either does contain a certain feature or does not. Because not all features are present or annotated in every language, we have made a selection of the most universal features and included those in Lingualyzer. These include personal, demonstrative, and interrogative pronouns, singular and plural words, definite and indefinite words, finite verbs, infinitives and verbal adjectives, present and past tense markers, and passive voice markers.

Importantly, the UD framework is word- or token-based, with all lemmas, part-of-speech tags, morphosyntactic properties annotated on a word level. Consequently, Stanza is also word-based, and in turn Lingualyzer quantifies most units on a word or token level. In most languages words and tokens are identical and can be identified by whitespaces around the word. There are, however, two exceptions. First, there are languages such as Chinese, Japanese, and Vietnamese, that do not use whitespaces to mark word boundaries. Footnote 1 In these cases, Stanza uses a word segmentation algorithm to decide the word boundaries. Words and tokens are therefore still identical, but they cannot be directly observed from the text through whitespace boundaries. Word segmentation is also used in other languages, albeit on a much smaller scale. For example in English, possessive markers (marked by ’s ) are seen as separate word tokens by Stanza and some compound words, such as government-friendly are split into two separate word tokens. Second, some languages such as French, Italian, and Spanish, use contractions or other mechanisms to represent multiple words into a single word that is bounded by whitespaces. In such languages, Stanza requires by default a multiword expression token identifier (Qi et al., 2020 ). In such cases, words and tokens will differ from each other, since each multiword comprises multiple tokens. Each token, instead of each word, in those cases will be annotated with a lemma, a PoS tag and morphosyntactic features. Note that this differs from the segmentation of for example compound words in English, where each token in the compound is seen as a separate word, and hence where no distinction between words and tokens is made.

Even though Stanza is a useful tool on its own, it is not directly accessible for behavioral science researchers with no or limited background in NLP or programming. Moreover, Stanza does not provide insights into different linguistic dimensions. Lingualyzer is therefore not a copy of Stanza but computes a large set of measures based on the Stanza tool in order to quantify text (sentence, paragraph, document) on a wide variety of linguistic dimensions. In other words, Stanza provides the linguistic annotations, such as tokenization and PoS tagging, that Lingualyzer then uses to compute different measures that give insights in different linguistic dimensions.

We enhanced the processed text from Stanza with two other language-specific sources to maximize the dimensions of text analysis: word vectors and a database of standardized word frequencies. Word vectors contain valuable information on the distributional properties of words. We used the 300 dimensional fastText word vectors, which were created from Wikipedia and Common Crawl data (Grave et al., 2018 ). These word vectors are trained on character n -grams and are therefore also able to generate an approximate vector representation for words for which no word vector is already stored. However, storing the trained word vector model in memory for each language is not feasible because of the large size of the models and the many languages in Lingualyzer. We therefore opted for storing all word vectors for complete words in a database and decided not to approximate vectors for words that are not in the database.

Standardized word frequencies provide valuable information on the general use of words beyond the text under investigation. We used the frequency lists from WorldLex for each available language (Gimenes & New, 2016 ). These frequency lists are based on three different large language sources: news articles, blogs, and Twitter data. Because the Twitter data were not available for all languages, we decided not to use it to maintain consistency across languages. For the remaining sources, we had absolute and normalized frequencies and contextual diversity measures to our availability. We, however, only used the normalized values, which represent the frequency per million words and the percentage of documents a word occurred in, respectively. The frequency lists for each source contained a minimum of 1.8 million words and 41,000 documents. We removed all words in the list that did not reach a threshold frequency of once per million words to mitigate any effects the size of the frequency lists, and to subsequently keep the quantification steps across language entirely similar and unbiased. Hence, the advantage of having a standardized list with a frequency threshold is that the lists, and in turn the output of the measures using these lists, are comparable across languages.

Levels of analysis

A text can be described as a complex collection of smaller linguistic segments, from morphological units, to words, sentences, and paragraphs, to entire documents. We implemented three different levels of analysis on which a text can be analyzed: the sentence level, the paragraph level, and the document level. Because sentence boundaries are denoted differently across languages, Stanza is used here for their identification. Paragraphs on the other hand are identified through a double newline separator in the document.

Importantly, all measures can be computed on each of those three levels, thus making no distinction in how the value for each measure is calculated based on these levels. However, since a value is calculated for each paragraph or sentence, returning each value individually is not feasible, nor desirable, since the number of values returned to the user would then be dependent on the text size and would not be uniform across analyses. For this reason, Lingualyzer summarizes the values for the paragraph and sentence levels using different statistics. More specifically, the values for the paragraph and sentence levels are summarized into average values over the different paragraphs or sentences, the standard deviation from this average, and the largest (maximum) and smallest value (minimum) across paragraphs or sentences.

Overview of Lingualyzer measures

Due to the large number of measures, it is impossible within the scope of this paper to describe each Lingualyzer measure individually. Instead, below we categorize the 351 measures (Fig. 1 ) and provide a description of the individual measures in the instructions in the online interface. A categorization of measures needs to be independent of the text segment (i.e., sentence, paragraph or document) being analyzed. Furthermore, a categorization based on just the calculation method that is used to determine value of the measure (e.g., count or ratio) does not suffice, because in many cases the same calculation method can be applied to many different aspects of a text, resulting in categories that are not necessarily very meaningful.

figure 1

Categorization of measures

Based on the type of information captured, we distinguish three categories of measures: (1) descriptive measures, (2) complexity measures, and (3) distribution measures. Descriptive measures describe the surface level or directly observable patterns in a text segment. Complexity measures, on the other hand, target the variability or internal complexity of a text segment. These measures can also describe the relationship between different descriptive measures within a text segment. Distribution measures capture the temporal aspects of a text segment. At a non-linguistic level, these measures describe the temporal distribution of an aspect, while at the linguistic level, these measures describe the distributional relationships between different text segments.

The descriptive, complexity, and distributional measures can be subdivided into whether or not they are language-specific. If the measure is not dependent on language-specific annotations such as the PoS tag, lemma, morphological features, or frequency and word vector databases, it is considered to be general , otherwise the measure is labeled linguistic .

The resulting six (descriptive, complexity and distribution measures × general and linguistic measures) can be further subdivided into different text units quantified by the measures. We have defined measures quantifying (1) morphological, (2) lexical, and (3) syntagmatic units. Morphological measures capture patterns within the boundaries of individual words, such as morphemes or characters. Lexical measures quantify the individual words themselves. Finally, syntagmatic measures capture patterns in groups of words that share a (morpho)syntactic or semantic feature, such as a PoS tag or plural words. Note that this categorization is independent from the text level (i.e., sentence, paragraph, document) being investigated, as all morphological, lexical, and syntagmatic measures can be computed on each of these three text levels.

Altogether the Lingualyzer taxonomy encompasses (3 × 2 × 3 = ) 18 categories. The taxonomy is illustrated in Fig. 1 , with the categories being discussed in more detail next.

Measures by information type and language-specificity

General descriptive measures.

General descriptive measures describe the composition or the surface level characterization of a text for which no linguistic knowledge is required. They may serve as proxies for other measures, but deviate from them in that they are directly observable in a text and language independent. Footnote 2 Examples of general descriptive measures are the letter and word count, measuring the total number of letters and words in a text segment respectively and the hapax legomena incidence, which counts the number of words occurring only once in a text per 1000 words.

Linguistic descriptive measures

Linguistic descriptive measures describe the surface level observable linguistic patterns. Linguistic descriptive measures are not necessarily generalizable, and hence language dependent in the sense that they require language-specific algorithms or resources to extract the required information. Examples of linguistic descriptive measures are the counts of individual part-of-speech tags (e.g., nouns, verbs, adverbs) in a text segment, or morphosyntactic features, such as the number of definite words and the number of passive voice markers in a text segment.

General complexity measures

General complexity measures compute the level of variability or internal complexity of a variable in terms of cognitive or computational resources, independent of a language. While general descriptive measures only describe the surface level, complexity measures look at aspects beyond, targeting latent variables of a text. Examples of general complexity measures are for instance the type-token ratio, i.e., the number of distinct words in the text compared to the total number of words, sentence length, i.e., the average number of words in a sentence, and word entropy, i.e., the average information content of the word types.

Linguistic complexity measures

Linguistic complexity measures compute the variability or complexity of variables in terms of linguistic variation and structure. Differing from the general complexity measures, these measures target language-specific or linguistic aspects, thus describing the variability between different linguistic markers or the complexity of the linguistic structure. Different than the linguistic descriptive measures, linguistic complexity measures describe the latent or internal structure of linguistic aspects, rather than surface level or directly observable linguistic aspects. Note that because these aspects are language-specific, there might be substantial variation in the internal linguistic structures of different linguistic variables across languages. Most notably, some linguistic aspects can be completely unmarked in a language, such that no linguistic structure is present at all. For example, definiteness is not marked in Chinese and Russian. Examples of measures in this category are the first-to-third pronoun ratio and the definite-indefinite word ratio.

General distribution measures

General distribution measures describe the temporal patterns in the surface level aspects of a text segment. These measures differ from the general descriptive measures because they investigate specifically where a surface level aspect of a text occurs, rather than describing a descriptive property of that aspect. General distribution measures furthermore differ from the general complexity measures because they investigate the surface level aspects of a text, rather than the latent aspects or internal complexity. Note that these measures also describe the surface level temporal patterns of different linguistic markers. Even though linguistic or language-specific information is needed to determine where those markers are, no linguistic information is needed for describing their temporal patterns. Hence, we have classified these as general distribution measures. Examples of such measures are the first-person pronoun burstiness, measuring the interval distribution of first-person pronouns, and the average position of future tense markers in a text segment.

Linguistic distribution measures

Linguistic distribution measures describe the temporal relationships between different text segments, thereby extending beyond the analysis of an individual text segment. What sets these measures apart from the general distribution measures is that they do not describe the temporal patterns within a text segment, but between different text segments, thereby describing the distributional relationships between different text segments. Whereas linguistic complexity measures describe the linguistic structure within a text segment, the linguistic distribution measures describe the similarities and differences in linguistic structure between text segments. We can compare consecutive paragraphs (paragraph-paragraph) or sentences (sentence-sentence) and calculate the values of the linguistic distribution measures for each comparison. Moreover, because we treat each text segment similarly, regardless of whether it is a document, paragraph, or sentence, we can calculate the linguistic distribution measures between text segments of different levels. Hence, we can compare each sentence to the rest of the document (sentence-document), or to the rest of the paragraph (sentence-paragraph) it occurs in, as well as compare each paragraph to the rest of the document (paragraph-document). Examples of such measures are the average word vector cosine similarity that measures the semantic similarity between two text segments. Another example is the lemma overlap between two text segments, which measures the proportion of lemmas in the smaller text segment that also occurs in the larger segment.

Measures by linguistic unit

For all six categories (i.e., descriptive, complexity and distribution, each subdivided into general and linguistic) we identify three additional categories of measures. These measures target different units in the text, namely within the word boundaries (morphological), at word level (lexical) and within a group of words that share a syntactic or semantic characteristic (syntagmatic).

Morphological measures

Morphological measures target information quantified in the word form, thus describing patterns that occur typically within the boundaries of individual words. One example is letter entropy, which measures the average information content of letters. Another example is the Levenshtein distance, which measures the distance between two text segments in terms of how many letter substitutions, additions and deletions are minimally needed to transform one text segment to the other.

Lexical measures

Lexical measures target information quantified at the level of individual word tokens, describing the composition, complexity or distribution of words, where individual words are the quantified units. These measures specifically target properties that are unique to a word, and thus do not target syntactic or semantic properties. Examples are the word count, hapax legomena count (number of words occurring only once in the text segment), type-token ratio, unknown word count (words not occurring in the standardized frequency list), average of the standardized frequencies of all words and word entropy.

Syntagmatic measures

Syntagmatic measures target information quantified at the level of a group of words that share a morphosyntactic, syntactic or semantic feature. These measures describe the behavior and distribution of these groups of words. Examples are the verb count, the first-to-third-person pronoun ratio, the burstiness (temporal behavior in terms of periodic, random and “bursty” recurrence) of passive voice markers, and the cosine distance between the average word vectors of two text segments.

Calculation methods

This section describes how the values of the 351 linguistic measures in the 18 categories are computed. Lingualyzer includes (1) raw counts, (2) ratios, and (3) normalized counts. The simplest method is a (raw) count, which counts the total number of occurrences of a quantifiable unit, such as a word token. In the calculation of ratio scores, typically the count of one quantifiable unit is divided by the count of another. An example of a ratio is the number of nouns divided by the number of lexical items. A specific variant of the ratio scores are normalized counts or incidence scores. These scores divide the count of a quantifiable unit by the text length (i.e., number of word tokens in the text) to represent the density of a quantifiable unit. This is a score that is independent of the length of a text and allows for a comparison across texts. Because the resulting scores can get very small, we multiply them by 1000 to represent a count per 1000 words, a better readable representation commonly used in quantitative linguistic tools to represent normalized counts (Bengoetxea & Gonzales-Dios, 2021 ; Biber, 1988 ; Graesser et al. 2004 ). Incidence scores therefore always range between 0 and 1000. An example is the noun incidence score, i.e., the number of noun tags divided by the total number of words, multiplied by 1000. Raw counts, ratios and normalized counts are calculated for a large variety of descriptive, complexity and even linguistic distribution measures.

Even though the majority of the measures are calculated using one of those three methods, there still is a variety of measures that quantify aspects of the text through different methods. We specifically discuss the least familiar ones: Levenshtein distance, entropy, Zipf frequency and contextual diversity, Zipf’s law, burstiness and other dispersion measures, and cosine similarity.

Levenshtein distance

The Levenshtein distance denotes the minimal number of additions, deletions and replacements needed to transform one string into another (Levenshtein, 1966 ). There are multiple variants of this measure implemented. The Levenshtein character distance is a linguistic distribution measure that calculates the distance between different text segments in terms of character additions, deletions and replacements, while the Levenshtein word, lemma and PoS distance do the same, but for the words, lemmas and syntactic structure by looking at the word, lemma and PoS sequences of two text segments, respectively. The word-lemma Levenshtein distance, a linguistic complexity measure, uses the Levenshtein algorithm to denote the distance between each word and its lemma in a text segment through letter changes.

Entropy scores

Entropy scores denote the average information content of linguistic unit (Bentz et al., 2017 ; Gibson et al., 2019 ). We calculate the entropy for the words and characters in the text, using word unigrams and character unigrams respectively. These measures give an estimate of the predictability (and hence complexity) of the words and characters in a text segment.

Zipf frequency and contextual diversity

Since these measures are somewhat related and both based on the same external source (i.e., the word frequency databases), we discuss them together (Gimenes & New, 2016 ). The Zipf frequency of a word denotes the general or standardized frequency of usage (Van Heuven et al., 2014 ). The frequency is logarithmically scaled in order to be more readable, due to the large differences in frequency between frequent and infrequent words. Zipf frequencies are calculated by taking a logarithmic base of 10 from the unscaled frequency of occurrence for each word per billion words. Because we only include words that occur at least once per million words, the Zipf frequencies range between 3 and 9. Footnote 3 We assign a Zipf frequency of 0 to words that do not occur in the frequency lists. Contextual diversity represents the percentage of all documents a word occurs in (Adelman et al., 2006 ). For both the Zipf frequency and contextual diversity, we included measures that calculate their word average in a text segment.

We also included information on the fit of Zipf’s power law to the word frequency distribution of a text segment (Zipf, 1949 ). The resulting fit is quantified by two values, namely (1) the estimated steepness of the slope of the distribution, and (2) the goodness-of-fit of the observed frequency distribution with the law, quantified through the R 2 determination coefficient. These values tell us something about how word frequencies are distributed and how well that distribution adheres to the law. It has been argued that the steepness of the curve is negatively correlated with the number of cognitive resources available to the language user (Linders & Louwerse, 2023 ; Zipf, 1949 ).

Dispersion measures

Dispersion measures calculate how a group of words is distributed across a text segment. The burstiness measure indicates the temporal distribution of words based on their position in a text segment (Abney et al., 2018 ). Scores of +1 indicate “bursty” behavior, which means that words in a group tend to cluster together in smaller clusters, with long distances between these clusters. Scores of -1 indicate a more even distribution of the words across the text, i.e., the occurrence of words within this group is more periodic. Scores around 0 indicate random behavior. Because the original burstiness formula assumes a temporal sequence to be infinitely long (Abney et al., 2018 ), the formula does not approximate finite temporal sequences well into the “bursty” direction, especially for shorter sequences (Kim & Jo, 2016 ). Since texts by definition are finite sequences of words and can be arbitrarily short, we have therefore used the alternative formulation described by Kim and Jo ( 2016 ) that approximates finite and shorter sequences better.

Another measure used to assess the dispersion of a group of words is the average position in a text and its standard deviation. The average position is rescaled to a score between 0 and 1, with 0 denoting the start of the text segment and 1 the end. Finally, we have implemented a measure of dispersion that compares the number of occurrences in the first half of the text with the number of occurrences in the second half of the text segment. This ratio is scaled to represent a number between – 1, indicating all items occur in the first half of the text, + 1, indicating that all items occur in the second half of the text. 0 indicates the items are equally distributed between the two halves of the texts.

Word vectors

The word vectors can be used to calculate a semantic representation by taking the average vector over all words in the text segment. For each word, we retrieved the word vector and averaged the vectors of all words in a sentence to create a semantic representation of that sentence. If a vector is not available for a word, we approximated the word by taking its lemma. Those words for which neither the word nor the lemma is available, are ignored. We furthermore removed all words with an occurrence of more than 4000 times per million words on both the news and blogs word frequency lists from WordLex (Gimenes & New, 2016 ). This roughly corresponds to the 20 most frequent words in English, including words such as and , to and the , but the exact nature of extremely frequent words varies depending on the language, with only four words being removed in Telugu and Korean, but 29 in French. Removing high-frequency words, typically grammatical items, is a frequent procedure to optimize distributional semantic measures (Landauer et al., 2007 ). The average vectors are then used to calculate the semantic similarity between different text segments. This is done through calculating the cosine distance between two vectors. A score of 1 indicates perfect similarity, meaning that the contents of the two text segments is identical, while a distance of 0 indicates that the text segments are completely semantically distinct. Because these average vectors only look at content and do not take into account the size of a text segment, they can be used to compare text segments at different levels and of different lengths.

General overview of measures

Our multidimensional setup with the analysis of different text levels, quantifying different units in the text, using a varied set of measures naturally leads to a large number of measures and an even larger number of values. To be precise, Lingualyzer computes 3118 different values for 351 different measures, spanning 18 categories of measures described above, at document, paragraph and sentence levels of analysis. Footnote 4 These numbers are summarized by category in Table 3 .

From Table 3 , one might conclude that there is a strong bias towards syntagmatic measures. This is especially true when looking at the number of linguistic descriptive and complexity measures. Due to Lingualyzer quantifying units primarily at the word level, there are only few measures at the morphological level. This is however compensated for by the syntagmatic measures, of which a large part capture morphosyntactic properties. These properties are expressed at the morphological level, but summarized by morphosyntactic feature and hence defined at the syntagmatic level. The overwhelming presence of syntagmatic measures is furthermore caused by the fact that for each PoS tag and for each morphosyntactic feature, there are multiple measures defined. For example, for burstiness and all other general distributional measures, there is a measure for each PoS tag and morphological feature, leading to a disproportionally large set of measures for this category.

It is important to note that, even though general measures rely on language-independent measures, they do use the tokenized and word segmented representations from Stanza for the quantification of words. While tokenization and word segmentation is a straightforward process in most languages, it is not in some, such as Chinese and Vietnamese, where word boundaries are not marked by whitespaces. Moreover, even though general distribution measures do not need to rely language-specific information in their calculation, they do rely on language-specific resources for the definition of the word groups. In other words, general measures still in essence quantify linguistic information in a text segment, demonstrating that strict demarcations between the 18 categories are difficult to make.

Measures from the different information type categories (i.e., descriptive, complexity, distributional) are not necessarily fully mutually exclusive. Measures from different categories might correlate, and measures from one category might also be informative for measures in another category. The same applies to the categorization of the quantified text unit (i.e., morphological, lexical, syntagmatic). The main goal of the categorization was not to create a theory-informed categorization, but to summarize and describe the variety in the different measures in an understandable way.

Comparison with existing tools

With the 18 categories that Lingualyzer distinguishes, we can now better compare the tool to the other available tools presented earlier in this paper. This comparison is presented in Table 4 . As the table shows, very few tools contain general distribution measures or measures at the paragraph level. Yet almost all existing tools contain general complexity and linguistic descriptive measures. This is not surprising since the quantification of different word groups in the dictionary-based tools is primarily on a linguistic descriptive level. Most tools also contain at least one general complexity measure, such as the type-token ratio.

Lingualyzer differentiates itself from existing tools in a number of ways. First, it treats all 41 languages equally and uniformly, so that all measures and all dimensions can be analyzed in each language. Together with Profiling-UD, this is a significantly larger number of languages than is supported in any of the other tools. This uniformity entails that measures are comparable across languages and that different languages can be compared with each other. Lingualyzer can furthermore analyze several general distributional properties of texts, something that is not possible in any of the other tools. Consequently, Lingualyzer has the largest variety of measures, closely matched by Coh-Metrix, ReaderBench, TAACO and MultiAzterTest. Finally, Lingualyzer is the first quantitative text analysis tool that in addition to the document level can easily summarize all measures on a paragraph and sentence level as well.

Profiling-UD seems to be very similar to Lingualyzer, as it also supports multiple languages, is data-driven, is very accessible, contains a large variety of measures and can be applied to answer a large array of research questions. The NLP pipeline is furthermore trained on the same data, namely the UD treebank (Nivre et al., 2020 ), although Profiling-UD uses a slightly older NLP tool (UDPipe). Lingualyzer targets a larger range of dimensions (in addition to morphological and syntactic dimensions, also semantic dimensions). Most notably, Lingualyzer captures general distributional aspects and distributional semantic aspects of language, whereas it does not capture syntactic complexity and syntactic relations in as much detail as Profiling-UD. Lingualyzer furthermore targets multiple text levels and is trained on a slightly newer set of models and version of the UD treebanks.

There are however also some limitations in Lingualyzer when comparing tools. Firstly, Lingualyzer only performs a surface level syntactic analysis. For example, unlike Coh-Metrix or Profiling-UD, Lingualyzer does not construct a dependency parse tree for a deeper syntactic analysis. We excluded this analysis due to the heavy computation required for such an analysis and the generally lower quality of the dependency parse annotations. Furthermore, an in-depth lexical semantic analysis is not possible due to the absence of cross-linguistic databases. Hence, word-specific properties such as semantic categories of words and rating scores on polarity and concreteness are currently impossible to incorporate.

Usage of Lingualyzer

Lingualyzer is a data-driven tool that analyzes texts in terms of general and linguistic contents and quantifies this contents into a large range of values at sentence, paragraph and document level. Because Lingualyzer is data-driven, it does not make any prior assumptions that are text-specific or language-specific. Hence, it can analyze any text, regardless of whether it is a large or small document and whether it consists of multiple paragraphs or sentences.

Because Lingualyzer is data-driven, it can be used for many different purposes, including register analysis, (author) profiling, readability assessment, as well as cross-linguistic analyses such as typology studies and text comparisons across languages. However, the large number of 351 measures, totaling 3118 values, might not be practical for all applications. We provide the user two ways to reduce the seemingly combinatorial explosion of measures. First, the user can select a reduced set of 33 values that cover most of the 18 categories, providing a comprehensive summary of the measures that will likely be most frequently used by the average user. This summary is the most basic version of Lingualyzer and is generally recommended for less experienced users and users new to Lingualyzer. An overview of these selected measures is given in Table 5 . These measures only cover the document level, as well as one linguistic distribution measure, namely the cosine distance, which covers the paragraph–document, sentence–document, paragraph–paragraph, and sentence–sentence levels. For a full description of these (as well as all other) measures, we refer to the online documentation of the tool ( https://lingualyzer.com ). For users who prefer more flexibility in choosing the types of measures presented to them, but would like a comprehensive albeit not overwhelming overview, we provide the possibility to filter on the six categories, the three text levels themselves, as well as on the statistics used to summarize the values of the sentence and paragraph levels.

Lingualyzer is free to use for researchers in the scientific community. It is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). If you use Lingualyzer in your research, please cite the current paper.

The dependencies of Lingualyzer are the Stanza tool, the Universal Dependencies Treebank, the wordfreq tool and the fastText word vectors. Stanza and wordfreq are licensed under the Apache License 2.0, and the fastText word vectors are licensed under the Creative Commons Attribution-Share-Alike License 3.0. All Universal Dependencies Treebanks that are used, with three exceptions, are licensed under Creative Commons licenses with about a third not allowing any commercial use (Creative Commons Attribution-NonCommercial-ShareAlike). Footnote 5 The Catalan, Polish, and Spanish treebanks are licensed under the GNU General Public License, Version 3.

How to use Lingualyzer

Lingualyzer is accessible for free through an online interface at https://lingualyzer.com/ . The interface was developed in such a way that it is intuitive and easy to use, also for users new to Lingualyzer. An illustration of the interface is shown in Fig. 2 . The user can either enter the text to be analyzed in a textbox, or can upload a text file consisting of the individual text to be analyzed. Uploaded texts must be submitted in text format. Users can enter texts that are up to approximately 40,000 characters long. Adhering to privacy concerns, Lingualyzer does not store or use any of the texts that are processed, nor does it store or use any of the processed output. User texts are deleted from the server when the analysis is completed.

figure 2

Illustration of the web interface of Lingualyzer

Next, the user can select any filters that are needed to provide a (more) concise output of the Lingualyzer measures. Lingualyzer automatically selects the language for which the text needs to be investigated, but the user could also select the language manually prior to the analysis. The processing of text generally only takes a relatively small amount of time, typically in seconds the results are returned, though more time is needed for larger texts given the larger number of computations. Larger documents could take up to 2 min to analyze. Documents above the recommended limit might take a very long time to be completed by the tool. The Lingualyzer results are shown in a table consisting of a column with the title of the text, the language for which the text is analyzed, and the measures that have been selected by the user. For each additional text that is analyzed, an additional column is added to make a comparison of results across different texts straightforward. The results can be copied and pasted in a spreadsheet, but can also be downloaded in “.txt” format. The user is given the choice in downloading the full or just the filtered results.

Potential applications

The primary goal of Lingualyzer is to provide researchers with the possibility of analyzing texts across a large number of different languages. Lingualyzer supports 41 different languages from ten different language families, allowing researchers across a large and varied language landscape to perform quantitative linguistic text analyses. Many findings for English could potentially be validated across other languages and many new research questions can be investigated for new languages.

Exploring the possibilities of performing cross-linguistic analyses is a promising direction due to Lingualyzer computing the exact same measures across the languages supported. Moreover, the models were trained using the same algorithms with the underlying data based on a unified annotation framework. This means that the output of Lingualyzer is comparable across each individual language. While not all measures are meaningful when compared across languages, for example due to the absence of a morphological feature in some languages (e.g., definiteness not being marked in Chinese and Russian), the unified annotation framework of the UD Treebanks seem indeed to enable cross-linguistic comparisons with the general and linguistic complexity having been investigated across languages using the UD Treebank corpora and annotations (Bentz et al., 2023 ; Berdicevskis et al., 2018 ).

Lingualyzer captures a wide variety of different aspects in texts on different dimensions using a language-agnostic and text type-agnostic approach, and could therefore potentially be used in many “classic” quantitative text analysis applications, such as text and author characterization (Biber, 1988 ; Juola, 2008 ; Tausczik & Pennebaker, 2010 ), and readability and complexity assessment (Dascalu et al., 2013 ; McNamara et al., 2012 ). The relative simplicity of the measures (e.g., no complicated and error-prone computations, such as dependency parsing; cf. Qi et al., 2020 ) is likely an advantage as they might be more robust across different text types.

Lingualyzer furthermore provides summary statistics on a paragraph and sentence level and is unique in providing information consistently at three different text levels (i.e., document, paragraph, and sentence). These summarization statistics provide more localized information and could therefore potentially be very useful in for example (linguistic) stimuli creation, validation and analysis (Cruz Neri & Retelsdorf, 2022a ; Dodell-Feder et al., 2011 ; Trevisan & García, 2019 ).

Finally, because of the large number of values computed by Lingualyzer, it can serve as feature input for computational algorithms. For example, such feature input can be used to train computational models that classify the truthfulness of (political) statements (Mihalcea & Strapparava, 2009 ; Rashkin et al., 2017 ), or detect humor and irony (Barbieri & Saggion, 2014 ; Reyes et al., 2012 ). Finally, the input can also be used to investigate if and how complex neural networks encode linguistic information with the goal of making such models insightful and explainable (Miaschi et al., 2020 ; Tuckute et al., 2022 ).

Validation of Lingualyzer

Any computational linguistic tool ideally needs to be validated. McNamara et al. ( 2014 , p. 165) distinguish intrinsic validation (testing that the tool does what it is supposed to) and extrinsic validation (evidence in terms of widespread use and acceptance by a community, for instance the discourse community).

Most tools given in Table 1 have been validated intrinsically. Most notably, Coh-Metrix was validated by comparing the output of texts with a high versus low cohesion, considering relative differences between the measures for the two conditions (McNamara et al., 2010 ). Coh-Metrix has furthermore been validated as a measure for differentiating several text characteristics, such as language registers (Louwerse et al., 2004 ) and authorship (McCarthy et al., 2006 ). Each version of LIWC was evaluated on a corpus with different text genres, where the consistency of word use within a dictionary category was measured across texts from different genres (Boyd et al., 2022 ). Apart from an evaluation by the authors themselves, LIWC has been incredibly popular and has been validated in numerous psychological domains (Tausczik & Pennebaker, 2010 ). MultiAzterTest was evaluated using yet another validation technique. The authors evaluated the correctness of the readability assessments made by their tool and compared them to the same assessments made by Coh-Metrix, taking the latter as a baseline (Bengoetxea & Gonzales-Dios, 2021 ).

These types of intrinsic validation are called instrument validity . The tool under investigation is validated for a particular purpose, for example readability assessment or register analysis. A prerequisite for any instrument validity, however, is to prove that the individual measures are both reliable and consistent. This is called instrument reliability . This is perhaps the most critical type of evaluation. Even though one may assume that the developers of an analysis tool have taken all care to make sure that the produced values are correct, instrument reliability is generally not reported. In fact, from all the tools in Table 1 , we only know of two that have been validated using an instrument reliability study: L2SCA and LATIC. The syntactic annotations and the 14 measures in L2SCA, a tool for measuring the syntactic complexity in texts from non-native speakers of English, were verified by two annotators on a small subset of a corpus of English essays written by native Chinese speakers, demonstrating a high reliability of the tool both at the level of automatic annotations and the measures (Lu, 2010 ). The evaluation procedure of LATIC was very similar. Part-of-speech annotations in LATIC were manually verified in English and German through human annotations on a small sample from corpora containing fiction and news articles respectively (Cruz Neri et al., 2022b ). Next, using five short texts, taken from introductory texts of questions from a science assessment, the measures were calculated by human annotators and correlations between the annotators and the output from LATIC were computed. No significant differences between the measures calculated by the annotators and the LATIC output were found.

Data-driven or hybrid tools with a large number of linguistic features have not reported instrument reliability, or such reports are at least not distributed through academic outlets. The reason for this is likely that computing the reliability is a very tedious and time-intensive process, due to the often many measures in these tools and the complex nature of the measures. It is therefore no wonder that the instrument reliability was investigated for tools with a relatively small number of easy-to-compute measures such as L2SCA and LATIC (cf. Table 1 ).

We provide both types of intrinsic validation: an instrument reliability study and an instrument validity study.

Instrument reliability

Lingualyzer dependencies.

In order to compute the output of the measures, Lingualyzer uses several computational linguistic resources. The reliability and validity of any computational tool is inherent to the quality of its external resources. The Stanza toolkit is used as an NLP pipeline to process (i.e., segment and annotate) the raw text and is thus used in all calculations. The word frequency and contextual diversity information from WorldLex are used in some measures to determine general use of words and fastText word vectors are used for comparing the semantic similarity across text segments.

For a reliable text quantification system it is important to have accurate annotations on all relevant tasks of the NLP pipeline. Stanza models have been intrinsically validated with each model having been individually evaluated on each NLP task. Stanza models typically have a high performance on all NLP tasks with average F1 scores above 85% on all the different tasks when looking at all the pre-trained models (Qi et al., 2020 ).

It is, however, difficult to assess the general robustness of Stanza and its applicability to different text registers and genres across languages, due to differences in genres and registers contained in the training data for each model, the size of the corpus and the quality of the annotations. Unfortunately, there are only few studies that investigated the annotations of Stanza for different registers and genres (Păiș et al., 2021 ; Sadvilkar & Neumann, 2020 ; Sirts & Peekman, 2020 ). Despite the fact that Stanza is a relatively new tool, it has already widely been used as a processing pipeline in many studies with texts from very different genres. For example, it has been used as a processing pipeline for detecting phishing e-mails (Gualberto et al., 2020 ), identifying comparative questions (Bondarenko et al., 2022 ), and investigating statistical tendencies in transcribed spoken dialog (Linders & Louwerse, 2023 ).

To ensure the highest possible reliability of the Stanza models in texts from different registers than trained on, we considered additional selection criteria for the inclusion of Stanza models and languages, and the use of Stanza annotations. First, we only included languages for which an accuracy of more than 80% was achieved on all relevant NLP tasks (i.e., tokenization, sentence segmentation, multi-word token expansion, lemmatization, PoS tagging and morphological feature annotation). Moreover, we only included models that were trained on at least 40,000 word tokens. Finally, if multiple models were available for a single language, we preferred the largest model or the model trained on the most varied corpus data in case there were only small differences in corpus size. Finally, we excluded measures based on annotations related to certain morphological features due to reliability concerns, such as abbreviations, mood and aspect which are also not consistently annotated across languages. For the same reason, we also did not use dependency parses, since they have a demonstrated lower performance (Qi et al., 2020 ).

For a small subset of the available languages, the word frequency and contextual diversity WorldLex databases were validated on a lexical decision task, showing significant correlations between reaction times of individual words and the frequency and contextual diversity (Gimenes & New, 2016 ), variables that have been hypothesized to strongly correlate in the psycholinguistic literature (Brysbaert & New, 2009 ). A subset of the fastText vectors were validated on a word analogy task in ten different languages (Grave et al., 2018 ), a common way to validate word vectors, given their rather abstract representation (Schnabel et al., 2015 ), though not without problems (Faruqui et al., 2016 ). The fastText vectors, albeit in most tasks not the best-performing word vector model (Wang et al., 2019 ), are widely used in many areas of natural language processing, owing to the unique availability of vectors in many different languages and the ability to represent unseen words (Lauriola et al., 2022 ).

Lingualyzer measures

In addition to a validation of its dependencies, the instrument reliability of Lingualyzer itself needs to be assessed. Here we report the instrument reliability for both English and Dutch, the two languages for which the authors who performed the manual verification had (near) native proficiency. Investigating all 41 languages is not necessary, because the implementation of the measures is independent of the selected language. The evaluation was done on the document level for the measures that analyze individual text segments, and on a sentence-sentence level for the measures that analyze and compare different text segments (i.e., linguistic distribution measures). A human validation of all 3118 values is also superfluous, due to the repetition of calculations and code at each of the text levels – yet we did check for any discrepancies across values. Values at the document level are, where possible, calculated using a bottom-up approach, combining the values from the lower levels. Thus, any calculation error at a lower level will cascade to the document level. For the linguistic distribution measures, the sentence–sentence level was chosen, due to the short texts used in the validation.

Manually computing the values on a single text large enough to cover all measures included in Lingualyzer is prone to human error in the human calculations, and it is difficult for peers to evaluate the results. We therefore opted for generating individual sentences that specifically target a measure. To avoid any biases on our end, we queried OpenAI’s ChatGPT (OpenAI, 2023 ), which we prompted for a sentence or short text with multiple instances of a characteristic specific to each measure. Footnote 6 We performed the same sentence generation process for both English and Dutch. Some of these sentences were re-generated, adapted manually, or substituted with an earlier generated sentence in case ChatGPT did not yield a sentence that included instances of the quantified unit by the measure. Hence, for each of the 351 measures, we generated a single targeted sentence or short text on which the respective measure was evaluated. Generated texts consisted primarily of a single sentence each with approximately ten words, except for the sentences that required to be embedded in paragraphs which included 2–4 sentences, and for the cases where two sentences were needed, i.e., the linguistic distribution measures (see details below).

For all generated sentences, the Lingualyzer value was calculated by hand, not using Lingualyzer. External scripts were used for calculations that were infeasible to do by hand or if a specific resource (e.g., a word vector or word frequency) was needed. These values were then compared with the values generated by Lingualyzer and any discrepancies were investigated and resolved. We removed measures that yielded inconsistent annotations, such as measures based on negation markers, aspect, and mood. Due to the re-use of many of these annotations in different measures, this resulted in the removal of 84 measures. The removal of these measures guaranteed consistency within a language, but more importantly consistency across languages (i.e., a measure may have worked well for one language, but not for another), albeit with the sacrifice of a reduction of measures. Consequently, a perfect correlation was obtained between the Lingualyzer output and the human computations for all Lingualyzer measures for both Dutch and English. The dataset with the artificially generated sentences and the corresponding human-validated values can be viewed on the Lingualyzer website under the “Instructions”. These sentences can, in addition to being used for verifying Lingualyzer, also serve as examples with the aim of making the measures more insightful. The statistics of the English texts can be found in Table 6 . Footnote 7

Human-validated sentences

Having established a perfect match between the Lingualyzer output and the human performance, we next compared these findings with a subset of existing tools, as reported in Table 1 . Coh-Metrix (Graesser et al., 2004 ; McNamara et al., 2014 ), the re-implementation of the Biber Tagger, the Multi-dimensional Analysis Tagger or MAT (Nini, 2019 ), Profiling-UD (Brunato et al., 2020 ), MultiAzterTest (Bengoetxea & Gonzales-Dios, 2021 ) and LATIC (Cruz Neri et al., 2022b ). These tools were chosen because (1) they were publicly available and (2) contained at least five measures that could be mapped onto a Lingualyzer measure. ReaderBench would also qualify for inclusion in the analysis, but was unfortunately unavailable at the time the analysis was conducted. We created a mapping from Lingualyzer measures to the measures in each of these tools. Where a match was less apparent, we made adjustments. These concern the following. First, MAT, Profiling-UD and LATIC use incidence scores to represent the occurrence per 100 words, while all other tools represent the same scores per 1000 words. Hence we multiplied the incidence scores of Profiling-UD and LATIC with a factor 10 to match those in Lingualyzer. Second, some values in Lingualyzer were represented by two individual values in other tools. For example, Coh-Metrix contains incidence scores for both first-person singular and first-person plural pronouns separately, while Lingualyzer only contains a single incidence score for both singular and plural first-person pronouns. In these cases, we added up the scores of these individual values.

We made a distinction between measures where an exact correspondence was expected and measures where an approximate correspondence is to be expected. Approximate correspondences were expected when NLP processing algorithms were trained on different datasets with different tagsets, leading to slight differences the resulting scores. Moreover, some measures that represent the same information were calculated slightly differently. One example is the type-token ratio over the first 100 words in Profiling-UD, which can only approximate the more holistic moving average type-token ratio in Lingualyzer that is calculated over all possible windows of 100 words in a text segment. Another example is the calculation of the cosine distance, which in Coh-Metrix is based on latent semantic analysis, while it is based on average word vectors in Lingualyzer.

Because only one of the tools (Profiling-UD) included the Dutch language, contrary to our previous instrument reliability assessment, we only compared the output of the tools on the English sentences in this analysis. Of the 351 measures in Lingualyzer, there were 56 measures that had an equivalent in one or multiple existing tools. MAT had the smallest number of equivalent measures with 12, while MultiAzterTest had the largest with 38.

For each tool, we calculated the percentage of the measures that returned the correct value, based on the human-validated gold standard (See 3.1.2 Lingualyzer measures). Here we mitigated possible effects of different rounding strategies by allowing for a very small margin of error. The correctness percentages are summarized in Table 7 . The performance of Lingualyzer for the 351 sentences equals human performance. All other tools only made minor mistakes when compared with the manually computed output (and hence the Lingualyzer output), resulting in correctness percentages between 88 and 100%, supporting their general reliability. The only notable exception is Profiling-UD, which yielded a low correctness percentage of 22%. This, however, can almost exclusively be traced back to the fact that punctuation marks are seen and counted as individual word tokens and that consequently all measures that rely on this count, such as all incidence scores and word length, return an incorrect value. Note, however, that a meaningful comparison of the correctness percentages across tools is not possible due to differences in the exact nature of the measures and the number of measures that overlap with Lingualyzer.

Just like the sentences used for the validation of Lingualyzer measures (3.1.2), the dataset with the mapping of the Lingualyzer measures to the measures of the tools used in the comparison can be found on the Lingualyzer website under “Instructions”.

Actual texts

The instrument reliability analysis using short sentences that targeted individual measures is welcome, as it (1) allows to verify the accuracy of measures, and (2) provides examples of the measures to the user. However, one may argue that such an analysis does not represent a naturalistic scenario in which Lingualyzer would be used. The results from this evaluation can therefore only be interpreted as validating that the measures reliably calculate the correct value.

To evaluate Lingualyzer with naturalistic data we compared the output of the Lingualyzer measures with the same tools as in the previous analysis on texts that likely more closely represent actual use cases. These five tools are available in English. The next most common language among the tools in Table 1 is Spanish. Three out of these five tools support Spanish, which is why we included it in this analysis as well. Moreover, we investigated three different texts from three different genres: a fiction book, a very recent news article and a transcript from a free-flow spoken dialog between two participants. From Project Gutenberg, we retrieved the first chapters of the following fiction books in English and Spanish, respectively: “Alice's Adventures in Wonderland” and “El idilio de un enfermo”. Footnote 8 We selected the following news articles: “Diana knew she wouldn’t be queen — and doubted Charles wanted the crown” from The Washington Post and “El misterioso asesinato de Guillermo Castillo, el chef del pueblo” from El Mundo. Footnote 9 Finally, for the spoken dialog transcripts, we selected the dialog with id “sw_0243_3513” from the Switchboard Dialog Act Corpus (Stolcke et al., 2000 ), and from the Spanish CallFriend corpus, we selected the dialog with id “4057” (MacWhinney, 2007 ). Texts were converted to a “.txt” format and encoded using UTF-8. In addition, all newline characters that solely served to enhance readability were removed to ensure homogeneity in formatting. For the spoken conversations, annotations were removed, and each turn was separated into a single paragraph.

Given the nature of the values, we computed a non-parametric Spearman rank-order correlation for each text and tool, correlating the values for each tool with the corresponding Lingualyzer values. These results are shown in Table 8 . Note that only 42 different values at most were compared across the tools, only a small subset of all values computed in Lingualyzer, and that a comparison across tools is again not possible, due to differences in measures that are correlated. Note further that in this analysis, next to the measures with an exact correspondence to a Lingualyzer measure, we also included measures with an approximate correspondence. Overall, correlations are very high with most correlations r > .95, showing consistency in the measures across the tools, across languages and across genres with the exception of dialog. Even though the sample size is small, with only one text per language and text genre, it is clear that correlations are the lowest for dialog and especially low with MAT in English and LATIC in Spanish. In sum, this highlights that, despite the differences in annotations and possible mistakes, tools are comparable to Lingualyzer on the small subset of overlapping measures, on fiction and news articles and to a lesser extent on spoken dialog transcripts.

Instrument validity

In addition to the instrument reliability – comparing the outcome of Lingualyzer measures with those by human raters and existing tools – instrument validity is relevant. One of the primary aims of Lingualyzer is to open up the possibilities for researchers in the behavioral science community to study one or multiple languages beyond English. To facilitate this, we aimed to make all measures available in all languages and make each measure as comparable as possible across languages. However, in order for Lingualyzer to be a reliable tool to analyze or compare multiple language, differences in output across languages need to be systematic. We selected a parallel corpus, the translations of the Universal Declaration of Human Rights (UDHR). Footnote 10 Because the contents of the document is supposedly identical across translations, we expected the differences in output to be caused by linguistic differences between languages. We predicted that linguistic differences – and therefore the “linguistic distances” were smaller for more closely-related languages (Chiswick & Miller, 2005 ; Wichmann et al., 2010 ). We correlated the linguistic differences extrapolated from a bottom-up approach (i.e., the Lingualyzer output from the UDHR translations) with linguistic differences extrapolated from a database of language typology, which we will call a top-down approach.

The fundamental difference between a bottom-up and top-down approach to comparing language is that a bottom-up approach relies on actual corpus data and the statistical patterns that can be found in this data, and the top-down approach relies on descriptions of generalized patterns in language by expert judgments. In the field of language typology, the two approaches are referred to as token-based and type-based typology, respectively (Levshina, 2019 ). A bottom-up approach has been used to show and explain the universality of several quantitative or statistical linguistic laws (Bentz & Ferrer-i-Cancho, 2016 ; Bentz et al., 2017 ; Piantadosi et al., 2011 ), while a top-down approach has been used to explain how languages are different from and related to each other (Bickel, 2007 ; Comrie, 1989 ; Georgi et al., 2010 ).

For the bottom-up approach, the UDHR corpus was chosen as our data source because the information is formal and leaves little room for ambiguity. It is therefore less susceptible to differences in meaning or content across different translations, for instance due to stylistic differences or figurative language use. The UDHR corpus is rather small, consisting of only roughly 63 paragraphs and 86 sentences. We removed all metadata and used Lingualyzer to compute the results for all 351 measures (and the resulting 3118 values) for all translations in each of the 41 languages. A comparison of similarities and differences in the output, quantified through a linguistic distance calculation would then allow for identifying how similar the languages are.

But how do we quantify this linguistic distance between languages? Due to the widely varying scales of the values of the Lingualyzer output, simply computing a Euclidean distance would bias the distance towards the measures with the larger scale. We therefore normalized the data by computing z scores for each of the values. The advantage of this normalization is that the resulting values are not only centered around 0, but are also comparable in their deviation from the mean across languages. We then removed values that had a perfect correlation with another value, when looking at the values across languages, since these values are redundant and therefore uninformative. Similarly, we removed the measures where all values were the same across languages. Finally, we computed the Euclidean distance between the z scores of the different values for each language pair.

For the top-down approach, we used the World Atlas of Language Structures (WALS) to extract typological features for all languages available in Lingualyzer (Dryer & Haspelmath, 2013 ). The current version of WALS has 192 different features which each can take between two and 28 values. Defining distances between languages, based on the typological features is not straightforward. Here we closely followed Rama and Kolachina ( 2012 ), who created a binary feature matrix from the typological features, which was subsequently used to quantify the distances between languages. Unfortunately, not every feature is defined for all languages, leaving many feature values undefined. Therefore, similar to Rama and Kolachina ( 2012 ), we removed features that were not shared by at least 10% of the languages, to avoid creating feature vectors that are too sparse. We then converted each feature with k different values into k different binary features, marking the presence or absence of that particular feature in a language, similar to Georgi et al. ( 2010 ), and Rama and Kolachina, (2012). Since Serbian and Croatian are combined in WALS, we used the same binary feature vector for both languages. In total we had 515 binary features, covering 66.1% of all possible values across languages. Afrikaans and Slovak had to be removed from further analysis, because these languages contained too few features, resulting in language pairs with no overlapping features. This meant that we were unable to define distances between these language pairs. For the remaining language pairs, we quantified the distances between language pairs using the Hamming distance, which is the normalized count of the number of values that are equal in the two vectors. Here, all feature values that were undefined for one or both of the languages, were removed prior to the calculation of the distance.

We then performed a hierarchical clustering. In Figs. 3 and 4 we summarized the hierarchical clusters into dendrograms for the Lingualyzer (Fig. 3 ) and typological distances (Fig. 4 ). The similarities between the Lingualyzer and the language typology dendrograms are illustrative but obvious. In both dendrograms, languages from the same family or branch tend to cluster together. Note that the Lingualyzer clustering cannot be explained by the language script. For instance, Hindi and Urdu, two languages that are similar yet do not share the same script, cluster together.

figure 3

Dendrogram created from a distance matrix based on differences in Lingualyzer output between languages

figure 4

Dendrogram created from a distance matrix based on typological differences between languages

In order to compare the similarity between the hierarchical clusters, we computed the cophenetic correlation coefficient, a technique that allows for comparing the similarity of clusters created through hierarchical clustering, i.e., dendrograms (Sokal & Rohlf, 1962 ). Similar to an “ordinary” correlation, its values range between – 1 and + 1, indicating the strength of a negative or positive correlation. The correlations between the hierarchical clusters created from the Lingualyzer distances and typological distances are shown in Table 9 . Here, we report not only the results where all Lingualyzer measures were used in calculating the distances between language pairs, but also where we used subsets of the Lingualyzer measures. These subsets include measures from the following categories: document level, general descriptive, linguistic descriptive, general complexity, linguistic complexity, general distribution and linguistic distribution. Despite the very different approaches to establishing the distances between language pairs (i.e., a bottom-up, token-based, data-driven approach versus a top-down, type-based, expert judgement-based approach), a moderate to strong correlation between the resulting clusters of both approaches is found, showing that a significant portion of the variance of the distances between languages on a parallel corpus, calculated using the Lingualyzer measures, can be attributed to typological or linguistic differences between languages.

Zooming in on the different subsets of Lingualyzer measures, we can observe a slightly lower correlation for the document measures, compared to all other measures. The linguistic descriptive and complexity measures result in a significantly higher correlation than their general counterparts. This is not surprising, given that the WALS features are linguistic by definition. For the same reason, the general distribution measures result in a similar correlation, as they quantify the distribution of primarily linguistic variables and are typically language-specific. Because the linguistic distribution measures quantify similarities between different text segments, they are less indicative of variation between languages. Still, we find a low-to-moderate correlation for this subset.

These results, albeit only illustrative for a language typology study, demonstrate an example of instrument validity paving the way for the use of Lingualyzer in cross-linguistic studies and comparisons. The moderate-to-strong correlations indicate systematicity in the variation across languages and thus also consistency in the measures across the languages. This is an important prerequisite for any analysis involving multiple languages. Finally, this validation study also demonstrates one potential use case of Lingualyzer, namely investigating cross-linguistic generalizations. One exciting potential extension of this study is to investigate whether the output of Lingualyzer can predict the presence or absence of a typological feature in a language, based on just usage-based language data.

Discussion and conclusion

This paper presented Lingualyzer, an easy-to-use multilingual computational linguistic tool that can be used for text analysis across a multitude of features. Lingualyzer analyzes text on 351 measures, categorized into 18 categories, resulting in 3118 values across 41 languages. Compared with other computational linguistic tools available, Lingualyzer is unique because it allows for such a large number of different languages on such a large number of computational measures, with measures that are available and comparable in all languages.

As with every tool, Lingualyzer has some limitations. First and foremost, Lingualyzer does not yet support batch processing. Each document has to be entered individually. This may not be practical when a large number of documents need to be processed, but to save resources and to avoid misuse of the tool, this is for now the most feasible option. Internally (i.e., not through the public web interface), Lingualyzer does allow for batch processing. Moreover, we are evaluating options to also allow batch processing for a larger audience. Second, the number of features that Lingualyzer uses to analyze text at different levels is large, but could be larger. However, because Lingualyzer allows for cross-linguistic analyses, consistency across languages is more critical than obtaining a maximum number of features. Conversely, we have taken care of not overwhelming the user with a magnitude of features providing a common-features-only option. Finally, Lingualyzer only covers less than 1% of the living languages in the world today. That is the disappointing news. However, the 41 languages Lingualyzer does cover, are the languages most commonly used. As with the measures that Lingualyzer includes, the languages it covers are based on a selection that ensures consistency in cross-linguistic analyses.

Even though most computational linguistic tools are presumably validated intrinsically – removing any bugs or inconsistencies – instrument reliability often tends not to be reported. For Lingualyzer, we have provided a few examples of instrument reliability: comparing the results of Lingualyzer with human performance (for two languages), and comparing its results with those of other tools. Evaluating the instrument reliability of a computational linguistic tool such as Lingualyzer is an immense task, which is virtually impossible to do across all languages and across all measures and values. Individual measures were validated on a representative set of sentences and (where applicable) compared to similar measures in existing tools across different genres. The validations reported in the current paper demonstrate that Lingualyzer measures are reliable. It is, however, important to stress that measures were not validated across all 41 languages. However, the potential for errors in other languages not included in the validation, has been minimized, first because errors for one language must apply to multiple languages and those errors have been removed for Dutch and English. Second, a careful pass through the selection of the measures (e.g., by not considering more error-prone annotations such as dependency parses) has furthermore minimized the chance of errors. Similar to it not being feasible to validate all 41 languages, not all 3118 values were individually considered. Here, too, errors that were to occur at one level must propel to other levels. Careful investigation at the sentence and paragraph level must have minimized (and as far as we can tell eliminated) errors at the other levels.

In addition to reporting the instrumental reliability, we also reported the instrumental validity of Lingualyzer by comparing its cross-linguistic output with that of a language typology. While the similarities between the hierarchical relationships from the Lingualyzer output and those from the language typology are obvious, some considerations are in place. First, the typological differences contain many missing binary values, resulting in the distance of each language pair being based on different typological features. So while the results are interesting and the Lingualyzer and typological dendrograms are comparable, our baseline, the typological distances, is at best an approximation. Moreover, the visualization through dendrograms purely illustrates similarities between languages, which do not necessarily correlate with genealogical relationships between languages. For example, while Welsh is an Indo-European language, in both dendrograms, it is close to the Afro-Asiatic languages Arabic and Hebrew, possibly due to these languages sharing some unique grammatical features, such as the widespread use of infixes. Similarly, while Romanian is overall typologically very similar to the other Romance languages, the isolation of Romanian compared with the other Romance languages, might have led to significant differences that are more apparent in the Lingualyzer measures than in the language typology. However, the analysis presented here is illustrative and should not be used as a full typology study. Yet, it does provide some useful insights in similarities and differences across languages.

We hope that with the availability of Lingualyzer, the behavioral science community has a useful computational linguistic tool to its availability. Many areas within the behavioral sciences and related fields do not necessarily have the computational linguistic expertise or programming skills to extract linguistic features from texts. Lingualyzer aims to fill this gap by providing behavioral scientists, including linguists, psycholinguists, corpus linguists, literary scholars, anthropologists, sociologists, and economists, with the opportunity to easily analyze text across different levels and a multitude of different dimensions. Because of its scope Lingualyzer can be used for a variety of purposes. But most importantly, Lingualyzer extends research often limited to languages spoken by a WEIRD community, and more specifically English language community, to languages spoken by a far larger community. We specifically hope that Lingualyzer allows for novel and innovative research in the behavioral sciences, pushes the boundaries of findings obtained for one language to 40 others languages, and offers explorations on similarities and differences across those languages.

Data availability

Lingualyzer can be accessed at the following webpage: https://lingualyzer.com/ . The data used in the evaluation can also be found on this website. All other data sources, including the ones Lingualyzer uses, are reported in this article and are free to use for scientific purposes.

Change history

13 december 2023.

A Correction to this paper has been published: https://doi.org/10.3758/s13428-023-02314-y

Chinese and Japanese do not use whitespaces at all, while Vietnamese marks syllable boundaries, instead of word boundaries, with white spaces.

Note that language-independent measures still require a language-specific word segmentation algorithm to separate the raw text into countable word tokens. Language-independent measures, however, do not rely on linguistically informed annotations on these word tokens.

The theoretical upper bound is that a word occurs a billion times per billion words. In this case the Zipf frequency would be 9. However, typically, there are no words in languages with a Zipf frequency of 8 or higher, which would mean that a word occurs at least 100 million times per billion words.

It should be noted however that, despite the fact that we have defined 18 categories, the total number of categories for which we have implemented measures is 15, as there are three categories for which no measures are implemented. The reason is not that no such measures exist, but that these measures cannot be reliably implemented across languages.

See: https://stanfordnlp.github.io/stanza/available_models.html for more details about the licenses.

ChatGPT cannot be used for computing the outcome of the measures and can only help with generating sentences, as for high precision tasks it frequently yields erroneous results.

Note that the statistics for the Dutch sentences and paragraphs were comparable with those for English: for number of words (M = 11.35, SD = 9.14, range 1–141), word length (M = 5.41, SD = .98, range 3–11), number of sentences (M = 1.13, SD = .70, range 1–10), and sentence length (M = 10.07, SD = 2.41, range 1–21).

Alice’s Adventures in Wonderland can be found here: https://www.gutenberg.org/ebooks/11 , and El idilio de un enfermo can be found here: https://www.gutenberg.org/ebooks/25777 .

Diana knew she wouldn’t be queen — and doubted Charles wanted the crown can be found here: https://www.washingtonpost.com/history/2023/05/06/diana-coronation-king-charles-queen/ , and El misterioso asesinato de Guillermo Castillo, el chef del pueblo can be found here: https://www.elmundo.es/espana/2023/05/05/6453e904e4d4d8c94a8b4584.html .

The translations can be found here: http://www.unicode.org/udhr/ .

Abney, D. H., Dale, R., Louwerse, M. M., & Kello, C. T. (2018). The bursts and lulls of multimodal interaction: Temporal distributions of behavior reveal differences between verbal and non-verbal communication. Cognitive Science, 42 (4), 1297–1316. https://doi.org/10.1111/cogs.12612

Article   PubMed   Google Scholar  

Adelman, J. S., Brown, G. D., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17 (9), 814–823. https://doi.org/10.1111/j.1467-9280.2006.01787.x

Alvero, A., Giebel, S., Gebre-Medhin, B., Antonio, A. L., Stevens, M. L., & Domingue, B. W. (2021). Essay content and style are strongly related to household income and SAT scores: Evidence from 60,000 undergraduate applications. Science . Advances, 7 (42). https://doi.org/10.1126/sciadv.abi9031

Artetxe, M., Aldabe, I., Agerri, R., Perez-De-Viñaspre, O., & Soroa, A. (2022). Does corpus quality really matter for low-resource languages? In Y. Goldberg, Z. Kozareva, & Y. Zhang, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 7383–7390). Association for Computational Linguistics.

Barbieri, F., & Saggion, H. (2014). Automatic detection of irony and humour in Twitter. In S. Colton, D. Ventura, N. Lavrac, & M. Cook, Proceedings of the Fifth International Conference on Computational Creativity (pp. 155–162). Association for Computational Creativity.

Bender, E. M. (2009). Linguistically naïve!= language independent: Why NLP needs linguistic typology. In T. Baldwin, & V. Kordoni, Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous? (pp. 26–32). Association for Computational Linguistics.

Bengoetxea, K., & Gonzalez-Dios, I. (2021). MultiAzterTest: A multilingual analyzer on multiple levels of language for readability assessment. arXiv preprint arXiv:2109.04870 . https://doi.org/10.48550/arXiv.2109.04870

Bentz, C., & Ferrer-i-Cancho, R. (2016). Zipf's law of abbreviation as a language universal. In C. Bentz, G. Jäger, & I. Yanovich, Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics (pp. 1–4). University of Tübingen. https://doi.org/10.15496/publikation-10057

Bentz, C., Alikaniotis, D., Cysouw, M., & Ferrer-i-Cancho, R. (2017). The entropy of words—Learnability and expressivity across more than 1000 languages. Entropy, 19 (6), 275. https://doi.org/10.3390/e19060275

Article   Google Scholar  

Bentz, C., Gutierrez-Vasques, X., Sozinova, O., & Samardžić, T. (2023). Complexity trade-offs and equi-complexity in natural languages: A meta-analysis. Linguistics Vanguard, 9 (s1), 9–25. https://doi.org/10.1515/lingvan-2021-0054

Berdicevskis, A., Çöltekin, Ç., Ehret, K., von Prince, K., Ross, D., Thompson, B., Yan, C., Demberg, V., Lupyan, G., Rama, T., & Bentz, C. (2018). Using Universal Dependencies in cross-linguistic complexity research. In M.-C. de Marneffe, T. Lynn, & S. Schuster, Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) (pp. 8–17). Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-6002

Biber, D. (1988). Variation across speech and writing . Cambridge University Press. https://doi.org/10.1017/CBO9780511621024

Bickel, B. (2007). Typology in the 21st century: Major current developments. Linguistic Typology, 11 (1), 239–251. https://doi.org/10.1515/LINGTY.2007.018

Blasi, D. E., Henrich, J., Adamou, E., Kemmerer, D., & Majid, A. (2022). Over-reliance on English hinders cognitive science. Trends in Cognitive Sciences, 26 (12), 1153–1170. https://doi.org/10.1016/j.tics.2022.09.015

Bondarenko, A., Ajjour, Y., Dittmar, V., Homann, N., Braslavski, P., & Hagen, M. (2022). Towards understanding and answering comparative questions. In K. Selcuk Candan, H. Liu, L. Akoglu, X. L. Dong, & J. Tang, Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery. https://doi.org/10.1145/3488560.3498534

Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The development and psychometric properties of LIWC-22. University of Texas at Austin.

Brunato, D., Cimino, A., Dell’Orletta, F., Venturi, G., & Montemagni, S. (2020). Profiling-UD: A tool for linguistic profiling of texts. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis, Proceedings of the 12th Language Resources and Evaluation Conference (LREC'20) (pp. 7145–7151). European Language Resources Association.

Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41 (4), 977–990. https://doi.org/10.3758/BRM.41.4.977

Chiswick, B. R., & Miller, P. W. (2005). Linguistic distance: A quantitative measure of the distance between English and other languages. Journal of Multilingual and Multicultural Development, 26 (1), 1–11. https://doi.org/10.1080/14790710508668395

Comrie, B. (1989). Language universals and linguistic typology: Syntax and morphology. University of Chicago Press.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440–8451). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747

Crossley, S. A., Kyle, K., & Dascalu, M. (2019). The tool for the automatic analysis of cohesion 2.0: Integrating semantic similarity and text overlap. Behavior Research Methods , 14–27. https://doi.org/10.3758/s13428-018-1142-4

Crossley, S. A., Kyle, K., & McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods, 48 (4), 1227–1237. https://doi.org/10.3758/s13428-015-0651-7

Crossley, S. A., Kyle, K., & McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): An automatic tool for sentiment, social cognition, and social-order analysis. Behavior Research Methods, 49 (3), 803–821. https://doi.org/10.3758/s13428-016-0743-z

Cruz Neri, N., & Retelsdorf, J. (2022a). Do students with specific learning disorders with impairments in reading benefit from linguistic simplification of test items in science? Exceptional Children, 89 (1), 23–41. https://doi.org/10.1177/00144029221094

Cruz Neri, N., Klückmann, F., & Retelsdorf, J. (2022b). LATIC–A linguistic analyzer for text and item characteristics. PLOS One, 17 (11), e0277250. https://doi.org/10.1371/journal.pone.0277250

Dascalu, M., Dessus, P., Trausan-Matu, Ş. B., & Nardy, A. (2013). ReaderBench, an environment for analyzing text complexity and reading strategies. In H. C. Lane, K. Yacef, J. Mostow, & P. Pavlik, Proceedings of the 16th International Conference on Artificial Intelligence in Education (AIED 2013) (pp. 379–388). Springer. https://doi.org/10.1007/978-3-642-39112-5_39

Dodell-Feder, D., Koster-Hale, J., Bedny, M., & Saxe, R. (2011). fMRI item analysis in a theory of mind task. NeuroImage, 55 (2), 705–712. https://doi.org/10.1016/j.neuroimage.2010.12.040

Dryer, M. S., & Haspelmath, M. (2013). WALS Online (v2020.3) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7385533

Dudău, D. P., & Sava, F. A. (2021). Performing multilingual analysis with linguistic inquiry and word count 2015 (LIWC2015). An equivalence study of four languages. Frontiers in Psychology, 12 , 2860. https://doi.org/10.3389/fpsyg.2021.570568

Eberhard, D. M., Simons, G. F., & Fennig, C. D. (2022). Ethnologue: Languages of the world ((25 ed.). ed.). SIL International.

Google Scholar  

Evans, N., & Levinson, S. C. (2009). The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences, 32 (5), 429–448. https://doi.org/10.1017/S0140525X0999094X

Faruqui, M., Tsvetkov, Y. R., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. In O. Levy, F. Hill, A. Korhonen, K. Cho, R. Reichart, Y. Goldberg, & A. Bordes, Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP (pp. 30–35). Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-2506

Fortuna, P., & Nunes, S. (2018). A survey on automatic detection of hate speech in text. ACM Computing Surveys, 51 (4), 85. https://doi.org/10.1145/3232676

Francis, M. E., & Pennebaker, J. W. (1992). Putting stress into words: The impact of writing on physiological, absentee, and self-reported emotional well-being measures. American Journal of Health Promotion, 6 (4), 280–287. https://doi.org/10.4278/0890-1171-6.4.280

Georgi, R., Xia, F., & Lewis, W. (2010). Comparing language similarity across genetic and typologically-based groupings. In C.-R. Huang, & D. Jurafsky, Proceedings of the 23rd International Conference on Computational Linguistics (pp. 385–393). Coling 2010 Organizing Committee.

Gibson, E., Futrell, R., Piantadosi, S. P., Dautriche, I., Mahowald, K., Bergen, L., & Levy, R. (2019). How efficiency shapes human language. Trends in Cognitive Sciences, 23 (5), 389–407. https://doi.org/10.1016/j.tics.2019.02.003

Gimenes, M., & New, B. (2016). Worldlex: Twitter and blog word frequencies for 66 languages. Behavior Research Methods, 48 , 963–972. https://doi.org/10.3758/s13428-015-0621-0

Graesser, A. C., McNamara, D. S., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36 (2), 193–202. https://doi.org/10.3758/BF03195564

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning word vectors for 157 languages. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga, Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC'18) (pp. 3483–3487). European Language Resources Association.

Gualberto, E. S., De Sousa, R. T., Vieira, T. P., Da Costa, J. L. P. C., & Duque, C. G. (2020). The answer is in the text: multi-stage methods for phishing detection based on feature engineering. IEEE Access, 8 , 223529–223547. https://doi.org/10.1109/ACCESS.2020.3043396

Gutu-Robu, G., Sirbu, M.-D. P., Dascălu, M., Dessus, P., & Trausan-Matu, S. (2018). Liftoff–ReaderBench introduces new online functionalities. Romanian Journal of Human–Computer Interaction, 11 (1), 76–91.

Hart, R. P. (2017). Diction (software). The International Encyclopedia of Communication Research Methods , 1–2. https://doi.org/10.1002/9781118901731.iecrm0066

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33 (2–3), 61–83. https://doi.org/10.1017/S0140525X0999152X

Juola, P. (2008). Authorship attribution. Foundations and Trends in Information Retrieval, 1 (3), 233–334. https://doi.org/10.1561/1500000005

Kim, E.-K., & Jo, H.-H. (2016). Measuring burstiness for finite event sequences. Physical Review E, 94 (3), 032311. https://doi.org/10.1103/PhysRevE.94.032311

Kučera, D., & Mehl, M. R. (2022). Beyond English: Considering language and culture in psychological text analysis. Frontiers in Psychology, 13 , 819543. https://doi.org/10.3389/fpsyg.2022.819543

Article   PubMed   PubMed Central   Google Scholar  

Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication. Georgia State University. https://doi.org/10.57709/8501051

Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (2007). Handbook of latent semantic analysis. Lawrence Erlbaum Associates.

Laur, S., Orasmaa, S., Särg, D., & Tammo, P. (2020). EstNLTK 1.6: Remastered Estonian NLP pipeline. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis, Proceedings of the 12th Language Resources and Evaluation Conference (LREC'20) (pp. 7152–7160). European Language Resources Association.

Lauriola, I., Lavelli, A., & Aiolli, F. (2022). An introduction to deep learning in natural language processing: Models, techniques, and tools. Neurocomputing, 470 , 443–456. https://doi.org/10.1016/j.neucom.2021.05.103

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10 (8), 707–710.

Levisen, C. (2019). Biases we live by: Anglocentrism in linguistics and cognitive sciences. Language Sciences, 76 , 101173. https://doi.org/10.1016/j.langsci.2018.05.010

Levshina, N. (2019). Token-based typology and word order entropy: A study based on Universal Dependencies. Linguistic Typology, 23 (3), 533–572. https://doi.org/10.1515/lingty-2019-0025

Li, X., Huang, L., Yao, P., & Hyönä, J. (2022). Universal and specific reading mechanisms across different writing systems. Nature Reviews Psychology, 1 (3), 133–144. https://doi.org/10.1038/s44159-022-00022-6

Linders, G. M., & Louwerse, M. M. (2023). Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort. Psychonomic Bulletin & Review, 30 , 77–101. https://doi.org/10.3758/s13423-022-02142-9

Louwerse, M. M. (2004). Semantic variation in idiolect and sociolect: Corpus linguistic evidence from literary texts. Computers and the Humanities, 38 , 207–221. https://doi.org/10.1023/B:CHUM.0000031185.88395.b1

Louwerse, M. M. (2011). Symbol interdependency in symbolic and embodied cognition. Topics in Cognitive Science, 3 (2), 273–302. https://doi.org/10.1111/j.1756-8765.2010.01106.x

Louwerse, M. M. (2018). Knowing the meaning of a word by the linguistic and perceptual company it keeps. Topics in Cognitive Science, 10 (3), 573–589. https://doi.org/10.1111/tops.12349

Louwerse, M. M. (2021). Keeping those words in mind: How language creates meaning . Rowman & Littlefield.

Louwerse, M. M., McCarthy, P. M., McNamara, D. S., & Graesser, A. C. (2004). Variation in language and cohesion across written and spoken registers. In K. D. Forbus, D. Gentner, & T. Regier, Proceedings of the 26th Annual Meeting of the Cognitive Science Society (pp. 843–848).

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15 (4), 474–496. https://doi.org/10.1075/ijcl.15.4.02lu

Lupyan, G., Rahman, R. A., Boroditsky, L., & Clark, A. (2020). Effects of language on visual perception. Trends in Cognitive Sciences, 24 (11), 930–944. https://doi.org/10.1016/j.tics.2020.08.005

MacWhinney, B. (2007). The Talkbank project. In I. J. Beal, K. Corrigan, & H. Moisl (Eds.), Creating and Digitizing Language Corpora: Volume 1: Synchronic Databases (pp. 163–180). Palgrave Macmillan. https://doi.org/10.1057/9780230223936_7

Chapter   Google Scholar  

Magueresse, A., Carles, V., & Heetderks, E. (2020). Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:2006.07264 . https://doi.org/10.48550/arXiv.2006.07264

Malmasi, S., Evanini, K., Cahill, A., Tetreault, J., Pugh, R., Hamill, C., Napolitano, D., & Qian, Y. (2017). A report on the 2017 native language identification shared task. In J. Tetreault, J. Burstein, C. Leacock, & H. Yannakoudakis, Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 62–75). Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-5007

Maslennikova, A., Labruna, P., Cimino, A., & Dell'Orletta, F. (2019). Quanti anni hai? Age Identification for Italian. In R. Bernardi, R. Navigli, & G. Semeraro, Proceedings of the Sixth Italian Conference on Computational Linguistics. Italian Association for Computational Linguistics.

Maynard, S. K. (1986). On back-channel behavior in Japanese and English casual conversation. Linguistics, 24 (6), 1079–1108. https://doi.org/10.1515/ling.1986.24.6.1079

McCarthy, P. M., Lewis, G. A., Dufty, D. F., & McNamara, D. S. (2006). Analyzing writing styles with Coh-Metrix. In G. Sutcliffe, & R. Goebel, Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference (pp. 764–769). AAAI Press.

McNamara, D. S., Graesser, A. C., & Louwerse, M. M. (2012). Sources of text difficulty: Across genres and grades. In J. Sabatini, E. Albro, & T. O'Reilly, Measuring up: Advances in how we assess reading ability (pp. 89–116). Rowman & Littlefield.

McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press. https://doi.org/10.1017/CBO9780511894664

McNamara, D. S., Louwerse, M. M., McCarthy, P. M., & Graesser, A. C. (2010). Coh-Metrix: Capturing linguistic features of cohesion. Discourse Processes, 47 (4), 292–330. https://doi.org/10.1080/01638530902959943

McTavish, D. G., & Pirro, E. B. (1990). Contextual content analysis. Quality & Quantity, 24 (3), 245–265. https://doi.org/10.1007/BF00139259

Miaschi, A., Brunato, D., Dell’Orletta, F., & Venturi, G. (2020). Linguistic profiling of a neural language model. In D. Scott, N. Bel, & C. Zong, Proceedings of the 28th International Conference on Computational Linguistics (pp. 745–756). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.65

Mihalcea, R., & Strapparava, C. (2009). The lie detector: Explorations in the automatic recognition of deceptive language. In K.-Y. Su, J. Su, J. Wiebe, & H. Li, Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP: Short Papers (pp. 309–312). Association for Computational Linguistics.

Nini, A. (2019). The multi-dimensional analysis tagger. In T. B. Sardinha, & M. V. Pinto, Multi-Dimensional Analysis: Research Methods and Current Issues (pp. 67–94). Bloomsbury Academic. https://doi.org/10.5040/9781350023857.0012

Nivre, J., de Marneffe, M.-C., Ginter, F., Hajič, J., Manning, C. D., Pyysalo, S., Schuster, S., Tyers, F., & Zeman, D. (2020). Universal Dependencies v2: An evergrowing multilingual treebank collection. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis, Proceedings of the 12th Language Resources and Evaluation Conference (LREC'20) (pp. 4034–4043). European Language Resources Association.

North, R., Lagerstrom, R., & Mitchell, W. (1972). Diction computer program . Inter-university Consortium for Political and Social Research.

Obeid, O., Zalmout, N., Khalifa, S., Taji, D., Oudah, M., Alhafni, B., Inoue, G., Eryani, F., Erdmann, A., & Habash, N. (2020). CAMeL tools: An open source python toolkit for Arabic natural language processing. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis, Proceedings of the 12th Language Resources and Evaluation Conference (LREC'20) (pp. 7022–7032). European Language Resources Association.

OpenAI. (2023). ChatGPT (Mar 23 version) [Large language model]. Retrieved from https://chat.openai.com/

Păiș, V., Ion, R., Avram, A.-M., & Mitrofan, M. T. (2021). In-depth evaluation of Romanian natural language processing pipelines. Romanian Journal of Information Science and Technology, 24 (4), 384–401.

Pander Maat, H., Kraf, R., van den Bosch, A., Dekker, N., van Gompel, M., Kleijn, S., Sanders, T., & van der Sloot, K. (2014). T-Scan: A new tool for analyzing Dutch text. Computational Linguistics in the Netherlands Journal, 4 , 53–74.

Pennebaker, J. W., & King, L. A. (1999). Linguistic styles: Language use as an individual difference. Journal of Personality and Social Psychology, 77 (6), 1296–1312. https://doi.org/10.1037/0022-3514.77.6.1296

Piantadosi, S. T., Tily, H., & Gibson, E. (2011). Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences, 108 (9), 3526–3529. https://doi.org/10.1073/pnas.1012551108

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In A. Celikyilmaz, & T.-H. Wen, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 101–108). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-demos.14

Qiu, X., Zhang, Q., & Huang, X. (2013). FudanNLP: A toolkit for Chinese natural language processing. In M. Butt, & S. Hussain, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 49–54). Association for Computational Linguistics.

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., & Irving, G. (2021). Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446 . https://doi.org/10.48550/arXiv.2112.11446

Rama, T., & Kolachina, P. (2012). How good are typological distances for determining genealogical relationships among languages? In M. Kay, & C. Boitet, Proceedings of COLING 2012: Posters (pp. 975–984). The COLING 2012 Organizing Committee.

Rashkin, H., Choi, E., Jang, J. Y., Volkova, S., & Choi, Y. (2017). Truth of varying shades: Analyzing language in fake news and political fact-checking. In M. Palmer, R. Hwa, & S. Riedel, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2931–2937). Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1317

Reyes, A., Rosso, P., & Buscaldi, D. (2012). From humor recognition to irony detection: The figurative language of social media. Data & Knowledge Engineering, 74 , 1–12. https://doi.org/10.1016/j.datak.2012.02.005

Roberts, C. W. (2000). A conceptual framework for quantitative text analysis. Quality and Quantity, 34 (3), 259–274. https://doi.org/10.1023/A:1004780007748

Sadvilkar, N., & Neumann, M. (2020). PySBD: Pragmatic sentence boundary disambiguation. In E. L. Park, M. Hagiwara, D. Milajevs, N. F. Liu, G. Chauhan, & L. Tan, Proceedings of Second Workshop for NLP Open Source Software (pp. 110–114). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlposs-1.15

Sarker, S. (2021). BNLP: Natural language processing toolkit for Bengali. arXiv preprint arXiv:2102.00405 . https://doi.org/10.48550/arXiv.2102.00405

Scarton, C., & Aluísio, S. M. (2010). Coh-Metrix-Port: A readability assessment tool for texts in Brazilian Portuguese. In Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, PROPOR (Vol. 10, pp. 1–2).

Schler, J., Koppel, M., Argamon, S., & Pennebaker, J. W. (2005). Effects of age and gender on blogging. In I. N. Nicolov, F. Salvetti, M. Liberman, & J. H. Martin (Eds.), Computational Approaches to Analyzing Weblogs: Papers from the AAAI Spring Symposium (Vol. 6, pp. 199–205). AAAI Press.

Schnabel, T., Labutov, I., Mimno, D., & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. In L. Màrquez, C. Callison-Burch, & J. Su, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 298–307). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1036

Share, D. L. (2008). On the Anglocentricities of current reading research and practice: The perils of overreliance on an "outlier" orthography. Psychological Bulletin, 134 (4), 584–615. https://doi.org/10.1037/0033-2909.134.4.584

Sirts, K., & Peekman, K. (2020). Evaluating sentence segmentation and word tokenization systems on Estonian web texts. In A. Utka, J. Vaičenonienė, J. Kovalevskaitė, & D. Kalinauskaitė, Proceedings of the Ninth International Conference Baltic Human Language Technologies (pp. 174–181). IOS Press.

Sokal, R. R., & Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxon, 11 (2), 33–40. https://doi.org/10.2307/1217208

Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Martin, R., Van Ess-Dykema, C., & Meteer, M. (2000). Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26 (3), 339–373. https://doi.org/10.1162/089120100561737

Straka, M., & Straková, J. (2017). Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In J. Hajič, & D. Zeman, Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (pp. 88–99). Association for Computational Linguistics. https://doi.org/10.18653/v1/K17-3009

Straka, M., Hajic, J., & Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis, Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16) (pp. 4290–4297). European Language Resources Association.

Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29 (1), 24–54. https://doi.org/10.1177/0261927X09351676

Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61 (12), 2544–2558. https://doi.org/10.1002/asi.21416

Trevisan, P., & García, A. M. (2019). Systemic functional grammar as a tool for experimental stimulus design: New appliable horizons in psycholinguistics and neurolinguistics. Language Sciences, 75 , 35–46. https://doi.org/10.1016/j.langsci.2019.101237

Tuckute, G., Sathe, A., Wang, M., Yoder, H., & Shain, C. F. (2022). SentSpace: Large-scale benchmarking and evaluation of text using cognitively motivated lexical, syntactic, and semantic features. In H. Hajishirzi, Q. Ning, & A. Sil, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations (pp. 99–113). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.naacl-demo.11

Türkoğlu, F., Diri, B., & Amasyalı, M. F. (2007). Author attribution of Turkish texts by feature mining. In I. D.-S. Huang, L. Heutte, & M. Loog (Eds.), Advanced Intelligent Computing Theories and Applications: With Aspects of Theoretical and Methodological Issues (pp. 1086–1093). Springer. https://doi.org/10.1007/978-3-540-74171-8

Van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67 (6), 1176–1190. https://doi.org/10.1080/17470218.2013.850521

Van Wissen, L., & Boot, P. (2017). An electronic translation of the LIWC dictionary into Dutch. In I. Kosem, C. Tiberius, M. Jakubíček, J. Kallas, S. Krek, & V. Baisa, Electronic lexicography in the 21st century. Proceedings of the eLex 2017 Conference. (pp. 703–715). Lexical Computing CZ.

Wang, B., Wang, A., Chen, F. W., & Kuo, C.-C. J. (2019). Evaluating word embedding models: Methods and experimental results. APSIPA Transactions on Signal and Information Processing, 8 , e19. https://doi.org/10.1017/ATSIP.2019.12

Wichmann, S., Holman, E. W., Bakker, D., & Brown, C. H. (2010). Evaluating linguistic distance measures. Physica A: Statistical Mechanics and its Applications, 389 (17), 3632–3639. https://doi.org/10.1016/j.physa.2010.05.011

Zellers, M. (2021). An overview of forms, functions, and configurations of backchannels in Ruruuli/Lunyala. Journal of Pragmatics, 175 , 38–52. https://doi.org/10.1016/j.pragma.2021.01.012

Zipf, G. K. (1949). Human behavior and the principle of least effort . Addison-Wesley.

Download references

Acknowledgments

This research has been made possible by funding from the European Union, OP Zuid, and the Ministry of Economic Affairs awarded to the second author. We would like to thank Kiril O. Mitev for his help with the computational implementation of the tool and Peter Hendrix for his valuable comments on early versions of the draft and the tool. The usual exculpations apply.

Open practices

Lingualyzer can be accessed free of charge at the following webpage: https://lingualyzer.com/ . We adopt an open practices approach so that all data sources that Lingualyzer uses are reported in this article and are free to be used for scientific purposes. We have furthermore made the data created for the evaluation available on the website under “Instructions” on https://lingualyzer.com/ .

Open access funding provided by University of Zurich. This research has been made possible by funding from the European Union, OP Zuid, and the Ministry of Economic Affairs awarded to the second author.

Author information

Authors and affiliations.

Department of Cognitive Science & Artificial Intelligence, Tilburg University, Tilburg, Netherlands

Guido M. Linders & Max M. Louwerse

Department of Comparative Language Science, University of Zurich, Zurich, Switzerland

Guido M. Linders

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Guido M. Linders .

Ethics declarations

Conflicts of interest.

There are no known conflicts of interest.

Ethics approval

Not applicable.

Consent to participate

Additional information, publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Linders, G.M., Louwerse, M.M. Lingualyzer: A computational linguistic tool for multilingual and multidimensional text analysis. Behav Res (2023). https://doi.org/10.3758/s13428-023-02284-1

Download citation

Accepted : 30 October 2023

Published : 29 November 2023

DOI : https://doi.org/10.3758/s13428-023-02284-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Text analysis
  • Multilingual
  • Computational linguistics
  • Quantitative linguistics
  • Cross-linguistic
  • Find a journal
  • Publish with us
  • Track your research
  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Archaeology
  • Greek and Roman Papyrology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Agriculture
  • History of Education
  • History of Emotions
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Acquisition
  • Language Variation
  • Language Families
  • Language Evolution
  • Language Reference
  • Lexicography
  • Linguistic Theories
  • Linguistic Typology
  • Linguistic Anthropology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Modernism)
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Religion
  • Music and Culture
  • Music and Media
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Science
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Oncology
  • Medical Toxicology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Clinical Neuroscience
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Medical Ethics
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Neuroscience
  • Cognitive Psychology
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Strategy
  • Business History
  • Business Ethics
  • Business and Government
  • Business and Technology
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic Systems
  • Economic Methodology
  • Economic History
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Theory
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Politics and Law
  • Public Administration
  • Public Policy
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Linguistic Analysis (2nd edn)

  • < Previous chapter
  • Next chapter >

9 Conversation Analysis

Jack Sidnell (PhD Toronto, 1998) is an Associate Professor of Anthropology at the University of Toronto with a cross-appointment to the Department of Linguistics. His research focuses on the structures of talk and interaction. In addition to research in the Caribbean and Vietnam, he has examined talk in court and among young children. He is the author of Conversation Analysis: An Introduction (2010), the editor of Conversation Analysis: Comparative Perspectives (2009) and co-editor (with Makoto Hayashi and Geoffrey Raymond) of Conversational Repair and Human Understanding (2013) and (with Tanya Stivers) of The Handbook of Conversation Analysis (2012).

  • Published: 09 July 2015
  • Cite Icon Cite
  • Permissions Icon Permissions

Conversation analysis is an approach to social interaction that grew out of sociology but connects in various ways to concerns in other fields including linguistics. Because interaction among humans is accomplished in large part through the medium of language, conversation analysis focuses primarily on talk -in-interaction. In this brief overview of the field I outline the basic methods and analytic techniques of this approach and review some of the major findings of the last forty years of research. I conclude with some discussion of the relationship between conversation analysis and linguistics.

9.1 Introduction

Conversation Analysis (hereafter CA) is an approach to language and social interaction that emerged in the mid-to-late 1960s through the collaboration of sociologists Harvey Sacks and Emmanuel Schegloff as well as a number of their students, most importantly, Gail Jefferson (see Lerner 2004 ). Although it originated in the United States within sociology, today working conversation analysts can be found in Australia, Canada, England, Finland, France, Germany, Japan, Korea, the Netherlands, and else-where in departments of anthropology, communication studies, education, and linguistics. In their earliest studies, Sacks, Schegloff, and Jefferson worked out a rigorous method for the empirical study of talk-in-interaction and, as a result, their findings have proven robust and cumulative. Indeed, these pioneering studies (e.g. Schegloff 1968 ; Schegloff and Sacks 1973 ; Sacks et al. 1974 ; Schegloff et al. 1977 ; inter alia) from the 1960s and 1970s have provided a foundation for subsequent research such that we now have a large body of strongly interlocking findings about fundamental domains of human social interaction such as turn-taking, action sequencing, and repair.

Conversation Analysis is often identified, within linguistics at least, with pragmatics or discourse analysis. However, CA differs in a basic way from these approaches in so far as it takes action in interaction as the primary focus of study rather than language per se. Because language figures so centrally in human social interaction, the vast majority of work in CA is concerned with talk. But, importantly, the ultimate goal of CA is to discover and to describe interactional rather than linguistic structure. A basic finding of CA is that interaction is in fact finely structured and, as such, amenable to formal analysis.

In the present context, it is important to note that most, if not all, work in CA is premised on the idea that a language constitutes the kind of normative, symbolic structure that linguists have described since the founding work of Saussure, Jakobson, Sapir, Bloomfield, and so on (see Dixon 2009 for an updated and detailed account of linguistics from this perspective). In common with these pioneers of linguistics and in contrast to much of the work done under the heading of generative linguistics today, CAsts typically understand language to be fundamentally social, rather than biological or mental, in nature. Linguistic rules from this perspective are first and foremost social rules which are maintained in and through talk-in-interaction.

CAsts do not propose that this linguistic structure is reducible to, and an artifact of, a more basic underlying structure of interaction (cf. Levinson 2005 ; on irreducibility, see Hanks 1996 ). That said, some recent work in conversation analysis suggests that language structure and interactional structure do exert some influence on one another. So, for instance, the grammatical patterns of a particular language may bear on the organization of turn-taking (See Fox et al. 1996 ; Tanaka 2000 ; Sidnell 2010 ). Running in the other direction, certain near-universal features of language (or features that exhibit highly constrained variation across languages) may reflect the basic properties of interaction in the species ( Levinson 2006 ).

In this brief overview of CA I begin by outlining the main goals and principles of the field. I discuss how CA emerged out of a convergence of ethnomethodology, Goffman’s work on social interaction and a number of other research frameworks of the late 1960s suggesting that a pivotal and transformative moment came when Sacks, Schegloff, and Jefferson realized that analysts could use the same methods in studying conversation that conversationalists used in producing and understanding it. I then turn to consider a single fragment of conversation in some detail, suggesting that it, or any other such fragment, can be seen as the product of multiple, intersecting “machineries” or “organizations of practice.” In the final section I consider some ways research in CA bears on a few central issues in linguistics.

9.2 A Brief History and Some Key Ideas

The history of CA begins with the sociologists Erving Goffman and Harold Garfinkel. Goffman’s highly original and innovative move was to direct attention to the fundamentally social character of co-present interaction—the ordinary and extraordinary ways in which people interact with one another (see especially Goffman 1964 , 1981 ). Goffman insisted that this, what he (1983) later described as the “interaction order,” constituted a social institution that both formed the foundation of society at large and exhibited properties specific to it. Very early in his career (e.g. Goffman 1957 ), Goffman showed that interaction constituted its own system with its own specific properties quite irreducible to anything else be that language, individual psychology, culture, or “external characteristics” such as race, class, and gender.

In a more or less independent but parallel movement, in the late 1950s and early 1960s, Harold Garfinkel was developing a critique of mainstream sociological thinking that was to become ethnomethodology (see Garfinkel 1967 , 1974 ). Garfinkel challenged the conventional wisdom of the time by arguing that, to the extent that social life is regulated by norms, this rests upon a foundation of practical reasoning. People, Garfinkel suggested, must determine what norms, precedents, traditions, and so on apply to any given situation. As such, an explanation of human conduct that involves citing the rules or norms being followed is obviously inadequate since the question remains as to how it was decided that these were the relevant rules or norms to follow! By the early to mid-1960s, Harvey Sacks was deeply immersed in themes that Garfinkel and Goffman had developed and it is common and not entirely inaccurate to say that conversation analysis emerged as a synthesis of these two currents—it was the study of practical reasoning (à la Garfinkel) applied to the special and particular topic of social interaction (à la Goffman).

One of the key insights of early CA was that conversationalists’ methods of practical reasoning are founded upon the unique properties of conversation as a system. For instance, conversationalists inspect next turns to see if and how their own talk has been understood (see Sacks et al. 1974 ). That is, they exploit the systematic properties of conversation in reasoning about it. As analysts we can exploit the same resource. Consider the following fragment from one of Sacks’ recordings of the Group Therapy Sessions.

( Sacks 1995 a vI:281). 1

Sacks (1995 a , 1995 b ) draws attention to “the prima facie evidence afforded by a subsequent speaker’s talk” in his analysis of the therapist’s turns at 8 and 11 as recognizable introductions ( Schegloff 1992 : xliii). Thus, when, at line 12, Roger responds to the:

utterance with his name […] not with “What” [as in an answer to a summons], indeed not with an utterance to the therapist at all, but with a greeting to the newly arrived Jim, he shows himself (to the others there assembled as well as to us, the analytic overhearers) to have attended and analyzed the earlier talk, to have understood that an introduction sequence was being launched, and to be prepared to participate by initiating a greeting exchange in the slot in which it is he who is being introduced. ( Schegloff 1992 : xliii)

Thus a response displays a hearing or analysis of the utterance to which it responds. Such a hearing or analysis is “publicly available as the means by which previous speakers can determine how they were understood” ( Heritage 1984 ). The third position in a sequence is then a place to accept the recipients’ displayed understanding or, alternatively, to repair it. Consider the following case taken from a talk show in which Ellen DeGeneres is interviewing Rashida Jones. Where this fragment begins DeGeneres is raising a next topic: Jones’s new television show with comedian Amy Poehler, Parks and Recreation . DeGeneres initiates the topic by inviting Jones to tell the audience about the show. She then gives the title before concluding the turn with “an’ you an’ Amy Poehler how—how great is that.” Notice then that this final part of the turn can be heard as a real information question—a request for Jones to specify how great “that” is. At the same time, this construction “How X is that?” is a familiar, idiomatic expression that, by virtue of the presupposition it carries, conveys “it’s X” or, in this case, “it’s great”. Notice what happens.

The talk at line 03 (the A arrow) takes the form of a wh-question (“How great is that.”) and Rashida Jones treats it as one by answering “It’s pretty great” (at the B arrow). This response, by treating “How great is that.=” as an information-requesting question, reveals a problematic understanding which Ellen subsequently goes on to repair at line 09–10 and 13 (the C arrows). There are a few details of the turn starting at line 10 and continuing on line 13 of which we should take note. First, by emphasizing the first person singular pronoun (“I”) Ellen implies a contrast with “you” (so Ellen not Jones). Second, with “I: say it’s really great.” Ellen makes the illocutionary force of her utterance explicit (i.e. she is “saying” not “asking”). Here then Ellen indicates that “How great is that.=” was not in fact meant as a question but rather an assertion (or more specifically an assessment). 2

So it is the very sequential organization of turns-at-talk in conversation that provides for the maintenance of intersubjectivity between persons. Heritage writes:

By means of this framework, speakers are released from what would otherwise be an endless task of confirming and reconfirming their understandings of each other’s actions …a context of publicly displayed and continuously up-dated intersubjective understandings is systematically sustained. … Mutual understanding is thus displayed … ‘incarnately’ in the sequentially organized details of conversational interaction. ( Heritage 1984 : 259)

In his lectures Sacks made a series of penetrating arguments about the importance of basing a study of conversation on recorded examples (see Sacks 1984 ; Heritage 1984 ; Jefferson 1985 for discussion of this issue). This is not simply a matter of finding examples that will illustrate the point one is trying to make, but rather of beginning with the stubborn, recalcitrant, complex details of actual conversation and using them to locate and define whatever argument that one ends up with. Recordings provided Sacks with a terra firma on which to base a rigorously empirical discipline in which any analysis was accountable to the details of actual occurrences in the world. He writes:

I started to work with tape-recorded conversations. Such materials had a single virtue, that I could replay them. I could transcribe them somewhat and study them extendedly—however long it might take. The tape-recorded materials constituted a “good-enough” record of what happened. Other things, to be sure, happened, but at least what was on the tape had happened. ( Sacks 1984 : 26)

As Sacks goes on to note, we do not have very good intuitions about conversation (as we seem to for syntax which is apparently the contrast he was making) nor are we capable of remembering or imagining the details of what happens in conversation. For these reasons and others, conversation analysts insist on working from actual recordings of conversation rather than imagined, remembered, or experimentally produced examples.

9.3 Intersecting Organizations of Practices

So given these considerations we should now turn to some actual bit of recorded conversation and attempt to analyze it even if, given the constraints imposed by an overview chapter, we can only give it some cursory attention. The following is the transcript of the first few seconds of a telephone conversation between Deb, a woman in her fifties, and her boyfriend, Dick. The call comes the morning after Deb had hosted a party with some guests attending from out of town.

I’m going to suggest that this fragment of conversation—indeed, any fragment of conversation—can be usefully understood as the product of multiple intersecting “machineries” or “organizations of practice.” A term like “machineries” or a phrase such as “organizations of practice” may seem a bit obscure, but what I mean is actually fairly straightforward. Basically there is an organized set of practices involved in, first, getting and, secondly, constructing a turn, another such organized set of practices involved in producing a sequence of actions, another set of practices involved in the initiation and execution of repair and so on. Sacks sometimes used the metaphor of machines or machinery to describe this.

In a way, our aim is … to get into a position to transform, in what I figure is almost a literal, physical sense, our view of what happened here as some interaction that could be treated as the thing we’re studying, to interactions being spewed out by machinery, the machinery being what we’re trying to find; where, in order to find it we’ve got to get a whole bunch of its products. ( Sacks 1995 a : 169)

If we think about this little fragment in these terms—that is, as the product of multiple, simultaneously operative and relevant organizations of practice or “machineries” for short—we can get some good analytic leverage on what may at first seem quite opaque.

9.3.1 Overall Structural Organization

Let us start by noting that there is an organization relating to occasions or encounters taken as wholes—this is what we refer to as “overall structural organization” or, simply, “overall organization.” For a given occasion there are specific places within it that particular actions are relevantly done. An obvious example is that greetings are properly done at the beginning of an encounter rather than at its conclusion. Similarly, introductions between participants who do not know one another are relevant at the outset of an exchange. Sometime after an event—a job interview, an exam, a dinner party etc.—a discussion or report of “how it went” may become relevant. And, of course, this is precisely what Deb understands Dick to be inviting at line 06 with “Howditgo.”

There is another sense in which the overall organization of talk bears on what happens here. Think then about where this question “Howditgo” comes not in relation to these people’s lives (i.e. after Dick supposes the party is over) but rather in relation to this call. Specifically, the talk that immediately precedes this question is devoted to a series of tasks—getting the attention of the recipient via the ringing of the telephone and subsequently displaying that attention via “hello” (i.e. summons–answer, see Schegloff 1968 ), identifying, recognizing, and aligning the participants ( Schegloff 1979 ), so-called “personal state inquiries” ( Sacks 1975 ; Schegloff 1986 ). Taken together, we can see that the talk up to and including line 05 constitutes an “opening.” So what does that mean for the utterance we are now concerned with? Where can this “howditgo” be said to occur? Briefly, this is what Schegloff calls “anchor position”—precisely because whatever is said here is vulnerable to being heard as “why I’m calling”, as “the reason for the call” and thus as something its speaker accords some importance (see Schegloff 1986 ; Couper-Kuhlen 2001 ). We cannot go into a detailed analysis of this here but let us note that where participants reach this position (and there are many calls in which they never do for one reason or another) and the caller does not indicate what they are calling about, that may be oriented to as an absence. Consider then the following opening from a conversation between two close friends:

In this fragment, Hyla has called Nancy. A reciprocal exchange of personal state inquiries ends with Nancy’s assessment “good” at line 08. Here then the participants have reached “anchor position” but instead of the caller raising a first topic there is silence and some audible breathing from Hyla at lines 09–10. This occasions Nancy’s “What’s doin,” at line 11. With, “What’s doin,” Nancy invites Hyla (the caller) to raise a first topic and thereby displays an orientation to this as a place to do just that. And notice when Hyla responds with “Ah nothin” Nancy pursues a specific topic by asking “Y’didn’t go meet Grahame?”

9.3.2 Turn Organization

So those are two ways in which this little fragment of conversation or some part of it (e.g. the utterance “Howditgo”) is organized by reference to its place in a larger overall structure. Now let us consider the same bit of talk in terms of turn-taking and turn construction. Although Dick’s question is made up of four words, in a basic respect, this is produced as a single unit. Of course it is a single sentence but, more relevant for current purposes, it is a single turn. In their classic paper on turn-taking, Sacks et al. (1974) argued that turns at talk are made up of turn constructional units (TCUs) and that, in English at least, there is a sharply delimited set of possible unit-types. In English, TCUs are single words, phrases, clauses, and sentences. Consider the following example.

Shelley’s talk at line 36 exemplifies the use of a sentential turn constructional unit. Debbie’s turns at lines 37 and 39 are both composed of single lexical items. Shelley’s turn at 38 illustrates the use of a single phrase to construct a turn. And going back to our example: “Howdit go?” is similarly a sentential turn constructional unit.

Sacks et al. (1974 : 702) suggested that these TCUs have a feature of “projectability.” They write that lexical, phrasal, clausal, and sentential TCUs “allow a projection of the unit-type under way, and what, roughly, it will take for an instance of that unit-type to be completed.” This means, of course, that a recipient (and potential next speakers) need not wait for a current speaker to come to the actual completion of her talk before starting their own turn. Rather, because TCUs have a feature of projectability, next speaker/recipients can anticipate—or project—possible points of completion within the emerging course of talk and target those points as places to start their own contribution. We can see this very clearly in an example such as the following:

In this example, at lines 05–06, Parky begins an incipient next turn at the first point of possible completion in Old Man’s talk. Parky starts up here and again at the next point of possible completion not by virtue of any silence (by the time he starts there is no hearable silence) but by virtue of the projected possible completion of the turn constructional unit which constitutes a potential transition relevance place. Evidence such as this leads to the conclusion that “transfer of speakership is coordinated by reference to such transition-relevance places” ( Sacks et al. 1974 : 703).

Returning to the fragment from the conversation between Deb and Dick, notice that the transitions between speakers are managed in such a way as to minimize both gap and overlap. We now have a partial account of how participants are able to achieve this. Co-participants monitor the syntactic, prosodic, and broadly speaking pragmatic features of the current turn to find that it is about to begin, now beginning, continuing, now coming to completion—they anticipate, that is, points at which it is possibly complete (see also Ford et al. 1996 ). There is of course much more that could relevantly be said about this fragment along these lines but since this is merely meant to introduce different “organizations of practice” that go into a single fragment, we now move on to consider the organization of talk into sequences. Before we are done we will return to consider issues of turn-taking briefly.

9.3.3 Action and Sequence Organization

It is obvious enough that in conversation actions often come in pairs and that a first action such as a complaint, a request, an invitation makes relevant a next, responsive action (or a delimited range of actions). If that action is not produced it can be found, by the participants, to be missing where any number of things did not happen but are nevertheless not missing in the same sense. Schegloff (1968) described this relationship between a first and second action as one of “conditional relevance” and the unit itself as an “adjacency pair” (see Schegloff and Sacks 1973 ).

What kind of organization is the adjacency pair? It is not a statistical probability nor a categorical imperative. Rather, the organization described is a norm to which conversationalists hold one another accountable. The normative character of the adjacency pair is displayed in participants’ own conduct in interaction. For example, as the principle of conditional relevance implies, when a question does not receive an answer, questioners treat the answer as “noticeably” absent. A questioner’s orientation to a missing answer can be seen in three commonly produced types of subsequent conduct: pursuit, inference, and report. In the following example (from Drew 1981 ) mother asks the child, Roger, what time it is.

After Roger produces something other than an answer at line 2, mother repeats the question at line 3. Here then a failure to answer prompts the pursuit of a response. When this second question is met with three seconds of silence, Mother transforms the question, now asking, “what number’s that?” Notice that the first question, “What’s the time?” poses a complex, multi-faceted task for the child: He must first identify the numbers to which the hands are pointing and subsequently use those numbers to calculate the time. In response to a failure to answer this question, mother takes this complex task and breaks it down into components. Thus, in her subsequent conduct mother displays an inference that the child did not answer because he was not able to do so. Although it does not happen here, questioners may also report an absent answer saying such things as “you are not answering my question,” or “he didn’t answer the question”, or “she didn’t reply,” etc. In public inquiries, for instance, lawyers commonly suggest that the witness is not answering the question that has been asked of them (see Sidnell 2010 ).

Would-be answerers also orient to missing answers. Thus, the non-occurrence of an answer may occasion an account for not answering. One particularly common account for not answering is not knowing, as illustrated in Extracts 8 and 9 (see Heritage 1984 ).

Here, the recipient of a question accounts for not answering by saying s/he does not know. That is, in (8) Dee does not simply not answer the question—she treats not answering as something worthy of explanation and provides that explanation in the form of a claim not to know, as is also the case in (9). Further evidence of the participants’ own orientations to the norm-violation inherent in a failure to provide an answer is found in cases such as (10). Here the operator not only provides an account (explaining, in effect, that it is not her job to know the information requested) but furthermore apologizing for the failure to answer with “I’m sorry”.

Here then we have evidence, internal to these cases, for the claim that a question imposes on its recipient an obligation to provide an answer. Orientation to the norm is displayed in the participants’ own conduct of pursuing an answer, drawing inferences from an answer’s absence and accounting for the absence by claiming not to know. The point here is that the first pair part of an adjacency pair has the capacity to make some particular types of conduct noticeably or relevantly absent such that their non-occurrence is just as much an event as their occurrence.

9.3.4 Repair Organization

Whenever persons talk together they encounter problems of speaking, hearing, and/or understanding. Speakers are fallible and even the most eloquent among us sometimes make mistakes. The environments in which we interact are sometimes characterized by the presence of ambient noise. Recipients may be distracted or may suffer from hearing loss. A word may not be known by a recipient or it may fail to uniquely identify a referent. A lexical expression or grammatical construction may be ambiguous. These factors and others result in essentially ubiquitous troubles.

What we term “repair” refers to an organized set of practices through which participants are able to address and potentially resolve such troubles in the course of interaction—repair is a self-righting mechanism usable wherever troubles of speaking, hearing, and understanding are encountered but also usable elsewhere too and for other purposes than simply fixing problems.

Repair is organized in three basic ways. First, it is organized by a distinction between repair initiation and repair execution (or simply initiation and repair proper). Second, it is organized by position, where position is calibrated relative to the source of trouble: same turn, transition space between turns, next turn, third position. Third it is organized by a distinction between self (i.e. the one who produced the trouble source) and other. With these distinctions we can describe the basic organization of repair. By virtue of the turn-taking system—which allocates to the current speaker the right to produce a single TCU through to its first point of possible completion—the speaker of the trouble source has the first chance to initiate and to execute repair. Consider the quite subtle case in the example from Deb and Dick. In the second unit here, Deb produces a minor hitch over the word after “everybody” (possibly going for “stayed”) and self-repairs with “still here.”

Or in (12) Boo is on the way to saying “Friday” but cuts off its production and ends up saying “Sunday.”

We can make several observations based on this case. First, the repair is “premonitored” by a hesitation with “ah” before the word that eventually becomes the trouble source. Second, the repair is initiated by cut-off (phonetically close to a glottal stop) indicated by the dash in “Fri:-.” Third, the repair itself is “framed” by a repetition of the preposition “on.” By framing the repair in this way the speaker locates where in the prior talk the replacement belongs. In this case then the repair replaces a word in the prior talk. In other cases, the repair operates not to replace but rather to insert a word. For instance in the following, Bee inserts “Fat ol”’ into the prior talk resulting in the referential expression “Fat ol’ Vivian.”

If the speaker reaches the possible completion of a TCU, she may initiate repair in the transition space before the next speaker begins. Consider for instance the following case from a radio interview:

Here the interview asks “So what has the rest of the press gallery: (.) thought about this”. But before the interviewee can answer she goes on to replace “thought” by “done.” Notice again the way the repair is framed by a repeat (“about this”). The repair here is done in the transition space between turns.

In the next turn, the other is presented with her first opportunity to initiate repair and by and large that is all the other does—that is, although the other is likely often capable of repairing the trouble, typically and normatively she only initiates repair and leaves it up to the speaker of the trouble source to actually fix the problem ( Schegloff et al. 1977 ). Other has available to them a range of formats by which repair may be initiated in the next turn. These can be arranged on a scale according to their relative strength to locate the repairable. So there are repair initiation formats which do little more than indicate the presence of a trouble in the prior turn (see Drew 1997 ). This is illustrated by the following:

Here, after Jim assesses the height of the waves by saying “Christ thirty fee:t.” at line 05, Frank initiates repair with “He::h?” (line 09). Jim then redoes the assessment in a modifed form saying “Thirty fee(h)eet”. We can notice then that the repair initiation— “He::h?”—indicates only that there is a problem with the prior talk and not what the nature of the problem is or where, specifically, it is located.

In contrast, there are repair initiation formats that precisely locate the trouble source and, at the same time, propose a candidate understanding. Consider line 05 of the following example from a conversation between brother Stan and sister Joyce. When Joyce suggests a particular Bullocks location where Stan might be able to find a hat, Stan initiates repair of the referential expression with “Bullocks? ya mean that one righ t u:m (1.1) tch! (.) right by thee: u:m (.) whazit the p laza? theatre :: =”. Here then he offers an understanding of what Joyce means prefaced by “ya mean”.

And we can go on to note that the turn here is itself marked by a self-repair operation which we describe as searching for a word: “that one right u:m (1.1) tch! (.) right by thee: u:m (.) whazit the p laza? theatre :: =”. Stan eventually finds the word—“plaza theatre”— and Joyce confirms the candidate understanding with “=Uh huh,” in line 07. Notice further that when Stan pursues the issue in line 11 asking “ W hy t hat B ullocks. Is there something about it?” Joyce attempts to answer the question saying they have some pretty nice things. Stan then treats his own question as a trouble source and repairs it in third position replacing what he said in line 11 with, “Well I mean uh: do they have a good selection of hats?”

At each position within the unfolding structure of interaction participants are presented with an opportunity to address potential problems of speaking, hearing, and/or understanding. This set of practices is clearly crucial to the maintenance of understanding in conversation and other forms of interaction. We can also see that human language would be very different than it is if its users did not have recourse to the practices of repair—for instance the presence of homonyms and ambiguous grammatical constructions would threaten to derail even the most simple exchanges.

9.3.5 Intersecting Organizations in a Single Case

We can see that “Howdit go” is a sequence initiating first action—the first part of an adjacency pair which makes relevant a second, here an answer. Before turning to consider the response that is produced, we need to first consider in some more detail the design of this question. Note specifically that Dick employs the past tense thus locating the party at a time prior to the point at which this conversation is taking place. In this context, past tense conveys that the thing being talked about (the “it”/the party) is over and complete.

So there is a problem with the way in which Dick has formulated his question since, as it turns out, it is not quite right to say that the party is over (the guests have stayed and thereby continued the event). At the same time, the question is answerable as it stands—Dick has asked how it went, the party sensu stricto is over. In asking this question Dick creates a position for Deb to produce an answer. Thus there are two different actions relevant next:

Answer the question.

Address the problem with how the question has been formulated.

By virtue of the conditional relevance established by the question, anything that occurs in this slot may be inspected for how it answers the question (e.g. “They’re still here” meaning it went so well, they didn’t want to leave). If whatever is in the sequentially next position after a question cannot be heard as answering, it may be inspected by the recipient for how it accounts for not answering the question (e.g. “They’re still here” meaning I can’t talk about it right now). In short, anything that occurs here can be inspected for its relevance to the question asked and can thus serve as the basis for further inference. Imagine this pair of utterances without the “just great”— such that “everybody’s still here” comes as a response to “Howdit go?” Simplifying things somewhat, the problem with this is that “everybody’s still here” could easily be heard by a recipient as implying “it didn’t go well” or “it went too long” or “I’m trying to get them out.” There is then a built-in reason for answering this question in a straightforward way simply because any other way of responding might suggest a negative assessment and invite further inquiries.

At the same time, if she chooses simply to answer Dick’s question and respond with “just great” alone, Deb has let a mistaken assumption go unchallenged and uncorrected. This too is something to be avoided. As we’ve already noted there are certain things that become relevant at the completion of an event—a report to interested parties, an assessment, the reporting of news, and so on. Dick’s question, by locating the event in the past, proposes the relevance of those activities, indeed, it invites them. But to the extent that the event is not, in fact, over, these activities are not the relevant ones to do. There are then a number of intersecting reasons why Deb would like to do this assessment, “just great,” first as a response to Dick’s question but, at the same time, not allow the misunderstanding contained in Dick’s question to pass without being corrected.

So what in fact happens? Deb produces the assessment, “Oh: just grea:t,” without releasing the obstruent at the end of “just great.” Sounds (i.e. phonetic units) such as the last consonant in “great” can be produced either with or without a release of air. Here rather than produce this last sound (aspiration) of the last segment (“t”) of the last word (“great”) of this turn unit, Deb moves immediately into the first sound of “everybody.” 3 So one practice for talking through a possible completion is to withhold the production of the actual completion of the turn constructional unit and instead move directly into the next component of the turn. A speaker can thus talk in such a way that a projectable point of completion never actually occurs.

In the example with Deb and Dick, we can see that Deb uses this practice to get two relevant tasks done in a single turn-at-talk without risking the possibility of Dick selfselecting at the first possible completion. We thus have some interactional motivation for this compressed transition space. Moreover we can see that the organization of action into sequences, the organization of talk into turns (and into TCUs) and the organization of talk into an overall structure do not operate independently of one another. Although we can think of these heuristically as semi-autonomous organizations, in practice they are thoroughly interdigitated. This is what I mean when I say the utterance (or the turn-at-talk) is a product of multiple, intersecting, concurrently operative organizations of practice or machineries.

9.4 Interaction and Language Structure

In the remainder of this chapter I will attempt to describe some areas of overlapping interest between CA and linguistics. I will concentrate on intersections of turn organization and grammar or sentential syntax. A more thorough review would also discuss work on prosody in conversation as well as that on semantics and reference (see Enfield 2012 ; Fox et al. 2012 ; Walker 2012 ).

We have already seen that, according to Sacks et al. (1974) , sentence grammar plays a crucial role in the projection of a turn’s possible completion. Along the same lines we can note a number of other ways in which conversationalists draw on their knowledge of grammar in order to accomplish a range of turn-construction tasks ( Ono and Thompson 1996 ).

For instance, Gene Lerner (1991 , 1996 a ) has described the syntactic resources that the recipient of some bit of talk can use to project, and to produce, its completion. Lerner shows that there are particular grammatical structures which provide resources to a recipient and which they routinely exploit in producing such completions. For instance there are single TCU turns which consist of two components, for example an if -clause (protasis) and a then -clause (apodosis).

Similarly with when-then structures as in:

And there are also cases in which the initial component is a matrix clause projecting a finite complement:

In quite complex ways then conversationalists treat the normative structures of grammar as a resource to build and to recognize turns-at-talk. Moreover, the evidence suggests that turn building—which encapsulates the use of sentence grammar in the ways just described—is a product of interaction between speaker and recipient. This can be seen in a variety of ways (see Sidnell 2010 for additional evidence) but it goes to a fundamental point about the nature of language. Specifically, research in CA suggests that turns and the sentences which they house, are not constructed in the speaker’s mind and simply “delivered” by the mouth. Rather, as a speaker is producing a sentence (or a turn) she is monitoring its recipient. If that recipient is not gazing at the speaker ( Goodwin 1979 ) or gasps as the turn is being produced or, alternatively, does nothing where she might or should have, the speaker may alter the course of the turn or repair it, or extend it, and so on. As such, actual sentences are not the product of isolated individual speakers but of an interaction between speaker and recipient. As an illustrative example consider the following case from a telephone call in which Dee is telling cousin Mark how much her daughter and son-in-law have had to pay for a house:

At lines 39–42 in extract 21, Dee tells Mark that her daughter and son-in-law are paying 500 pounds a month in mortgage installments. There are several places in the course of this turn at which Mark might have produced an assessment that conveyed his understanding that five hundred pounds is “a lot of money” (see Goodwin 1986 ). By the time Dee has produced the first syllable of “hundred” the content of the turn is projectable. At the completion of “pounds,” Dee has come to a point of possible turn completion and again at the completion of “month.” However, in both cases, when Dee reaches these places within the unfolding course of her turn she has no evidence to suggest that Mark has recognized something assessable in her talk. When Mark fails to produce the assessment, Dee delays the progress of the turn by the production of first “um:” and then a micropause (lines 39–40). Such features of the talk may alert the recipient to the fact that a display of recognition is missing while at the same time extending the turn-at-talk so as to allow the recipient further opportunities to produce such a response before the current speaker’s turn reaches completion. Notice that immediately after Mark produces a gasp (which is subsequently elaborated with “’OW TH’ ELL D’ YO’U ↑DO ↓it”) Dee immediately completes the turn constructional unit with “repayments.” We can see then that Dee’s turn is carefully engineered to elicit a specific response at a particular place and that it is adjusted to ensure that such a response is indeed produced. The more general point here is thus that a single turn-at-talk is the product of an interaction between speaker and recipient.

Participants in conversation clearly rely on their tacit knowledge of grammar both to construct turns and to analyze them in the course of their production—to find that they are now beginning, now continuing, now nearing completion. But participants also adapt the normative structures of grammar to interactional ends in various ways. Consider for instance the following question asked in the midst of a telephone call:

Here what apparently begins a Yes–No question is altered in the course of its production such that the utterance ultimately produced is an in-situ wh-question. Whereas the Yes–No question “Are you going to be at my house on Sunday” asks whether A will be present, the wh-question “What time are you going to be at my house on Sunday” presupposes it. So this appears to be a repair operation in which the speaker adjusts the turn in progress to make it more accurately reflect what she assumes is taken-for-granted or already in common ground—that is that A will come to her house. The point for present purposes is that B ends up producing a sentence which violates a grammatical norm/rule in English which links morphosyntactic inversion with wh-movement (so in-situ wh-questions do not feature inversion) (see Lakoff 1974 ). However, this is not oriented to as a problem or error by either participant. Indeed, the response that A provides “what time am I to be there at” orients to the very difference in presupposition that we noted between the Yes–No question and the wh-Question—that is, by employing the BE + [infinitive] construction in “what time am I to be there at” A confers on B an entitlement to not only presuppose her attendance but, moreover, to specify the details of her arrival.

Another example of the way in which normative grammatical rules are adapted to interactional purposes has been described in work by Sun-Young Oh on zero-anaphora in English ( Oh 2005 , 2006 ). Native intuitions and, most, linguistic descriptions are alike in suggesting that an overt subject is required for finite declarative (as opposed to imperative) sentences in English. Generative accounts of grammar propose that the “pro-drop parameter” for English disallows “null-subjects” in contrast to languages such as Italian, Japanese, Korean, and so on. Where the subject of a declarative sentence is nevertheless not produced, this is explained according to a “situational ellipsis” in which weakly stressed, initial words of a sentence are elided via a process of phonological reduction where their referents are recoverable from the “extralinguistic context” ( Quirk et al. 1985 ; see Oh 2006 ). Through a detailed distributional analysis, Oh shows that zero-anaphora (in subject position) is a stable practice of speaking deployed to achieve a delimited range of tasks in conversation and not simply the product of phonological reduction. Specifically, Oh (2005 , 2006 ) shows that the practice is used in:

Second sayings: A zero anaphora may be used where a pair of linked turn constructional units are produced in which the second is a resaying of the first.

Pursuing recognition display: Where a speaker is pursing a display of recognition of a referent from a recipient a subsequent description of that referent may be produced in a clause with zero-anaphora.

Resumption of prior TCU following parenthetical: Where a TCU is resumed following a parenthetical insert, the resumption may be produced as a clause with zero-anaphora.

Highlighting the maximum continuity of the actions/events being described: Where the speaker is concerned to highlight the continuity of the events being described a zero anaphora may be used.

Where a speaker is faced with a choice between alternative reference forms, s/he may employ zero-anaphora and thereby avoid having to select one or the other.

This last practice occurs across a wide range of interactional circumstances. Consider, as an example, the following case in which wife Linda has called husband Jerry to release him from a request she had made earlier and also, perhaps, to remind him of a social obligation to which they, as a couple are committed, for the evening. At line 01 Jerry complains that he had a chance to work overtime that evening. The utterance is clearly a complaint as evidenced by the stance-marking “boy” which begins the turn, the formulation of the time as four or five hours which suggests a lot, 4 as well as the use of the construction “had a chance to” as opposed to “was going to,” “might have,” etc. A complaint necessarily involves someone who suffered an unhappy consequence (e.g. “There’s no more cake left!”) and, often at least, someone who caused the situation (“You ate all the cake!”). Here both aspects of the complainable matter are somewhat unclear. Although Linda is reminding Jerry of the obligation that will prevent him from working overtime it is not entirely obvious whether she is responsible for making the plan or whether these are primarily his friends or hers (or equally friends of both). More importantly, there is some ambiguity as to who stands to lose by Jerry’s not working. Given that Linda and Jerry are a married couple it is likely that the financial well-being of one cannot be disentangled from that of the other. So consider in this respect the talk at line 20. Here, after Linda has first sympathized (line 10) and subsequently proposed a remedy to the complaint (12), Jerry reinvokes its relevance by articulating the unhappy consequence at line 20.

Jerry’s “c’da used the m oney” at line 20 is a finite declarative clause and thus, according to most descriptions of English, should have an overt subject. Notice though that if he were to have produced an overt subject, Jerry would have been forced to select between saying, “we c’da used the m oney” or “I c’da used the m oney.” The latter would make little sense in this context given that Jerry is talking to his own wife. Alternatively, it might have led Linda to wonder whether Jerry was squirreling money away. If, on the other hand, Jerry were to have said “we c’da used the m oney” he would have undercut the grounds for the complaint he is trying to bring off by implying that Linda has also suffered by his not being able to work overtime. The solution for Jerry is simply to produce the turn without an overtly expressed subject.

Participants in interaction clearly orient to the grammar of the language they are speaking as a system of norms that in some sense constrains what they may do. We see this not only in the fact that speakers, typically, construct turns at talk that accord with these norms but also in the response that norm-violating talk elicits—for instance repair initiation and correction. At the same time, speakers routinely “work around” the normative constraints imposed by a given grammatical system to suit their interactional and commuicative purposes. This raises fundamental questions about the nature of the linguistic system—just what kind of a thing it is—answers to which are unfortunately well-beyond the scope of the present chapter.

9.4.1 Interaction and Language Structure: A Comparative, Typological Perspective

So far we have considered only work on English. Recently conversation analysts have begun to ask whether well-established differences in the grammatical and lexical structure of languages have any consequences for the organization of talk-in-interaction. The basic issue here may be summarized as follows:

whatever happens in interaction happens through the medium of some specific set of locally available semiotic resources…. conversation analysts have shown that actions in talk-in-interaction are formed through the use of distinctive prosodic patterns, lexical collocations, word order patterns as well as language-specific objects…. Of course, these semiotic resources vary significantly and systematically across different languages and communities…. Because every turn-at-talk is fashioned out of the linguistic resources of some particular language, the rich and enduring semiotic structures of language must be consequential in a basic way for social interaction. So although the problems are generic and the abilities apparently universal, the actual forms that interaction takes are shaped by and adapted to the particular resources that are locally available for their expression. ( Sidnell 2009 b : 3–4)

Initial attempts to address this issue focused largely on Japanese for at least two reasons. First, by the mid-1990s, there were several conversation analysts who were also native speakers of Japanese. Second, Japanese, with its agglutinating morphology, elaborate system of particles, and verb-final basic word order, differs from English in ways that could potentially be quite consequential for the organization of interaction and specifically for turn-projection and thus turn-taking (see Fox et al. 1996 ; Tanaka 2000 ).

Of the many studies that have been published since the late 1990s I will focus on just one. Hayashi and Hayano (2013) note that “[i]n every language for which we have adequate description, speakers have available a relatively stable set of turn-constructional practices that can be used to initiate repair on an utterance produced by a prior speaker” but that “the formatting of initiator techniques is sensitive to the grammatical inventory of the language in which they are produced” (2013: 293). These authors go on to contrast an other-initiation format in Japanese which they term “proferring an insertable element” (PIE) with the English practice which Sacks (1995 a ) described as an “appendor question.” As an example of the latter consider:

Here Dan checks his understanding of the reference to “they” in Roger’s turn by producing talk that is grammatically continuous with the trouble source turn. The case below is an example of the practice Hayashi and Hayano describe as “proferring an insertable element” (PIE):

[BB] ((A conversation between a barber and his customer. ‘Backward sham-poo’ in line 2 refers to the method of shampooing with the customer reclining backwards into a sink while facing up.)) 5

Like the appendor question in (24), the customer’s “bakkushanpuu o:?,” articulates a candidate understanding of the prior utterance—specifically what it is that elderly people dislike. The authors note that “the customer’s utterance is formatted in such a way as to be structurally insertable into the barber’s turn in line 1, as in yappari nenpaisha no hito bakkushanpuu o iyagaru yone .”

While the English and Japanese practices are similar in that both initiate repair by articulating a candidate understanding and by doing so with talk that is grammatically dependent upon the turn that contains the trouble source, they also differ in a number of ways. These differences reflect basic differences in the structure of English and Japanese. Appendor questions exploit the fact that “English does not mark the end of a syntactic unit” such that elements can be tacked on indefinitely (at least in principle) to recomplete the preceding unit. In Japanese, on the other hand, with its predicate-final structure, “closure of a clausal TCU is strongly marked with the clause-final predicate…. Thus, additional elements tacked on to the end of a preceding clausal TCU are ‘out of place’ in most cases because their ‘canonical’ position is before the clause-final predicate” ( Hayashi and Hayano 2013 : 300).

Moreover, appendor questions typically take the form of grammatically “optional” adjuncts (e.g. prepositional phrases) whereas PIEs include not only adjuncts, but also “core arguments” of the clause. In (19), for instance, “backward shampoo” is the direct object of the clause. The authors explain:

This difference stems from the fact that clauses in Japanese can be syntactically complete with unarticulated but contextually-recoverable core arguments (i.e., so-called ellipsis or zero-anaphora), whereas in English core arguments are typically expressed overtly and, in fact, this may be required if the turn is to be heard as syntactically complete. ( Hayashi and Hayano 2013 : 294)

Hayashi and Hayano’s study thus illustrates how the expression or realization of the generic organization of other-initiated repair is shaped by the available grammatical resources of particular languages. The contrasting forms that initiation takes in English and Japanese clearly reflect structural differences between those two languages in terms of clause structure, basic word patterns and the degree to which overt core arguments are required.

As we have seen, turns-at-talk are produced to accomplish actions—to question, to tell, to complain, to excuse, to agree, and so on. We can ask then whether the language-specific patterns described here and elsewhere might have some bearing on the actions they are used to implement resulting in a form of linguistic relativity (see Sidnell and Enfield 2012 for an initial attempt to address this issue).

9.5 Conclusion

Given the constraints of an overview chapter, I have not been able to describe or properly exemplify conversation analytic methods (see Sidnell 2012 ). This involves careful analysis of multiple instances across a collection so as to reveal the context-independent and generic features of a practice or phenomenon. Instead, in this chapter, I have tried to review some of the basic findings of CA and illustrate these with particularly clear examples. Those who wish to further explore CA would do well to read a study (such as Schegloff 1996 ) that will give a better sense of what is involved in developing an analysis of some particular practice. In the preceding discussion I have tried to introduce some of the main concerns of CA with a focus on the guiding principles and underlying assumptions of analysis, the key findings relating to different domains of organization (e.g. overall structural organization, turn-taking organization, sequence organization), and some intersections with topics in linguistics.

Examples are presented using the transcription conventions originally developed by Gail Jefferson. For present purposes, the most important symbols are the period (“.”) which indicates falling and final intonation, the question mark (“?”) indicating rising intonation, and brackets (“[“ and “]”) marking the onset and resolution of overlapping talk between two speakers.) Equal signs, which come in pairs—one at the end of a line and another at the start of the next line or one shortly thereafter—are used to indicate that the second line followed the first with no discernable silence between them, i.e. it was ‘latched’ to it. Numbers in parentheses (e.g. (0.5)) indicate silence, represented in tenths of a second. Finally, colons are used to indicate prolongation or stretching of the sound preceding them. The more colons, the longer the stretching. For an explanation of other symbols, see Sacks et al. (1974) ; and Sidnell (2009 a ) .

Notice that Rashida Jones also repairs her answer “it’s pretty great” by means of what Schegloff (1997) describes as “third turn repair”. So when Rashida Jones says, “It’s-uhm- it-I just mean it-ek- experientially for me it’s pr(h)etty gr(h)ea(h)t(h)” she is speaking after Ellen has responded to her initial answer (with “=mm mhm.” At line 5). However, Ellen’s response here, unlike Rashida’s at line 04, does not reveal a problematic understanding of the prior turn and thus does not prompt the repair that Rashida produces. In that respect instances of third turn repair are more akin to transition space repair (such as Ellen’s “The two of you.=” in line 13) than they are to instances of third position repair.

Deb also prefaces the answer to the question with “Oh” which can mark the preceding question as inapposite (see Heritage 1998 ).

To see this, consider that Jerry need not have indicated the length of time at all.

In the example from Japanese the following abbreviations are used:

lk : linking particle

fp : final particle

q : question particle

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Architecture and Design
  • Asian and Pacific Studies
  • Business and Economics
  • Classical and Ancient Near Eastern Studies
  • Computer Sciences
  • Cultural Studies
  • Engineering
  • General Interest
  • Geosciences
  • Industrial Chemistry
  • Islamic and Middle Eastern Studies
  • Jewish Studies
  • Library and Information Science, Book Studies
  • Life Sciences
  • Linguistics and Semiotics
  • Literary Studies
  • Materials Sciences
  • Mathematics
  • Social Sciences
  • Sports and Recreation
  • Theology and Religion
  • Publish your article
  • The role of authors
  • Promoting your article
  • Abstracting & indexing
  • Publishing Ethics
  • Why publish with De Gruyter
  • How to publish with De Gruyter
  • Our book series
  • Our subject areas
  • Your digital product at De Gruyter
  • Contribute to our reference works
  • Product information
  • Tools & resources
  • Product Information
  • Promotional Materials
  • Orders and Inquiries
  • FAQ for Library Suppliers and Book Sellers
  • Repository Policy
  • Free access policy
  • Open Access agreements
  • Database portals
  • For Authors
  • Customer service
  • People + Culture
  • Journal Management
  • How to join us
  • Working at De Gruyter
  • Mission & Vision
  • De Gruyter Foundation
  • De Gruyter Ebound
  • Our Responsibility
  • Partner publishers

research paper on linguistic analysis

Your purchase has been completed. Your documents are now available to view.

book: Linguistic Analysis

Linguistic Analysis

From data to theory.

  • Annarita Puglielli and Mara Frascarelli
  • X / Twitter

Please login or register with De Gruyter to order this product.

  • Language: English
  • Publisher: De Gruyter Mouton
  • Copyright year: 2011
  • Audience: Researchers, Scholars and Advanced Students of Linguistics concerned with Formal Analysis in a Typological, Comparative Perspective
  • Front matter: 8
  • Main content: 404
  • Published: March 29, 2011
  • ISBN: 9783110222517
  • Published: March 17, 2011
  • ISBN: 9783110222500

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 26 February 2024

Prosody in linguistic journals: a bibliometric analysis

  • Mengzhu Yan 1 &
  • Xue Wu   ORCID: orcid.org/0000-0001-6454-4208 1  

Humanities and Social Sciences Communications volume  11 , Article number:  311 ( 2024 ) Cite this article

1113 Accesses

1 Altmetric

Metrics details

  • Language and linguistics

The present study provides a systematic review of prosody research in linguistic journals through a bibliometric analysis. Using the bibliographic data from 2001 to 2021 in key linguistic journals that publish prosody-related research, this study adopted co-citation analysis and keyword analysis to investigate the state of the intellectual structure and the emerging trends of research on prosody in linguistics over the past 21 years. Additionally, this study identified the highly cited authors, articles and journals in the field of prosody. The results offer a better understanding of how research in this area has evolved and where the boundaries of prosody research might be pushed in the future.

Similar content being viewed by others

research paper on linguistic analysis

Numbers of articles in the three Japanese national newspapers, 1872–2021

research paper on linguistic analysis

A systematic and interdisciplinary review of mathematical models of language competition

research paper on linguistic analysis

Modelling individual and cross-cultural variation in the mapping of emotions to speech prosody

Introduction.

Prosody, often referred to as the music of speech, is defined as the organizational structure of speech, including linguistic functions such as tone, intonation, stress, and rhythm (Gussenhoven and Chen, 2020 ; Ladd, 2008). It has been well-established that prosody plays a key role in sentence processing in both L1 (first or native language) and L2 (second or non-native languages), including lexical activation and segmentation (e.g., Cutler and Butterfield, 1992 ; Cutler and Norris, 1988 ; Norris et al., 2006 ; Sanders et al., 2002 ), syntactic parsing (e.g., Cole et al., 2010a ; Frazier et al., 2006 ; Hwang and Schafer, 2006 ; Ip and Cutler, 2022 ; Lee and Watson, 2011 ; O’Brien et al., 2014 ; Roncaglia-Denissen et al., 2014 ; Schafer et al., 2000 ), information structure marking (e.g., Birch and Clifton, 1995 ; Breen et al., 2010 ; Calhoun, 2010 ; Clopper and Tonhauser, 2013 ; Katz and Selkirk, 2011 ; Kügler and Calhoun, 2020 ; Namjoshi and Tremblay, 2014 ; Steedman, 2000 ; Welby, 2003 ; Xu, 1999 ), and pragmatic information signaling such as speech attitudes, acts and emotion (e.g., Braun et al., 2019 ; Lin et al., 2020 ; Pell et al., 2011 ; Prieto, 2015 ; Repp, 2020 ).

Prosody has been investigated extensively given its significant status in language processing and its highly interdisciplinary nature involving linguistics, psychology, cognitive science, and computer science, especially with the advent of two early reviews: Shattuck-Hufnagel and Turk ( 1996 ) and Cutler et al. ( 1997 ). A decade later, more publications have provided comprehensive and state-of-the-art overviews of the theoretical and experimental advances of prosody (Cole, 2015 ; Wagner and Watson, 2010 ). However, to be best of our knowledge, no bibliometric overview of prosody research has been conducted to offer a better understanding of how research in this area has evolved and where the boundaries of prosody research might be pushed in the future.

The present study used a bibliometric approach which was initially used in library and information sciences for the analysis and classification of bibliographic material by sketching representative summaries of the extant literature (Broadus, 1987 ; Pritchard, 1969 ). Based on a large volume of bibliometric information, mathematical and statistical methods in bibliometric analysis make it possible to extract patterns that reveal the characteristics of publications in a specific discipline. In addition, with the assistance of network mapping techniques, bibliometric analysis can also be used to visualize the state of the intellectual structure of a specific research topic or field. In this study, we have used such an approach to perform co-citation analysis and keyword analysis to review publications on prosody in linguistic journals. Co-citation is a measure that gauges the connection between frequently referenced documents, with the intensity of co-citation determined by the frequency at which two documents have been jointly cited (Small and Sweeney, 1985 ). Co-citation analysis is important in bibliometric studies as “co-citation identifies relationships between papers which are regarded as important by authors in the specialty, but which are not identified by such techniques as bibliographic coupling or direct citation” (Small and Sweeney, 1985 , p. 19). Keyword analysis involves comparing the frequency of keywords in different periods, to identify significant changes to the key topics which is helpful in predicting the emerging trends of a research field (e.g., Lee, 2023 ; Lei and Liu, 2019a ; Zhang, 2019 ).

Bibliometric analysis has been widely used in different areas of linguistics. For instance, Zhang ( 2019 ) used this method to examine the field of second language acquisition (SLA); Lei and Liu ( 2019a ) rendered a bibliometric analysis of the 2005–2016 research trends in the field of applied linguistics; and Fu et al. ( 2021 ) employed this approach to analyze the evolution of the visual world recognition literature between 1999 and 2018. Since no bibliometric analysis has been conducted on prosody in linguistics, this present study takes the bibliometric approach to describe the state of the intellectual structure and the emerging trends of research on prosody in linguistics. The following research questions are addressed:

What is the research productivity of linguistic journals on prosody?

What is the intellectual structure in the field of prosody in terms of influential authors, references, and venues of publications?

What are the research trends of works on prosody in linguistics?

Methodology

The bibliometric data used in this study were retrieved from Web of Science (henceforth WoS) on 14 June 2022. There are three reasons why we used the database of WoS. First, WoS is a more widely used library resource than other databases such as Scopus and Google Scholar. For instance, the number of subscribers of WoS is two times larger than that of Scopus (Zhang, 2019 ). Second, only academic citations are provided by WoS. That is, compared to databases such as Google Scholar which provides mixed information of both academic and non-academic citations, WoS is more appropriate for calculating the scholarly values of the publications. Third, the availability of co-citation information in WoS makes it possible to conduct co-citation analysis which is one of the important bibliometric methods used in this study. As for the search terms, the present study used a combination of “prosod*” (In regular expressions, the asterisk (*) is used as a quantifier that specifies “zero or more” occurrences of the preceding character or group.), “autosegmental-metrical”, “metrical structure”, “accent”, “intonation*”, “stress”, “suprasegment*”, “F0” (fundamental frequency), “rhythm”, and “pitch” as the search query, and the Boolean OR operator was used to separate these terms. Moreover, the Boolean NOT operator was used to exclude research on “semantic prosody” Footnote 1 . The timeframe for the search was from January 2001 to December 2021.

Since the present study focuses on prosody research in linguistic journals, English research articles (excluding book reviews, editorial reviews, etc.) published in high-quality journals in the field of linguistics were included. Only published research articles were included to guarantee the quality and reliability of the publications under a strict quality control mechanism such as peer review (Zhu and Lei, 2021 ). As for the selection of high-quality journals, the current study chose SSCI-indexed Footnote 2 international journals in the field of linguistics for two reasons. First, those journals have rigorous peer-review processes. Second, most SSCI-indexed journals are accessible to worldwide academia. More than 200 SSCI-indexed journals in the field of linguistics have published research articles on prosody. However, some of those journals published less than three prosody-related articles in the past 21 years. A cut-off point of 30 articles per journal in the past 21 years was set to ensure that the majority of linguistic journals that published research on prosody are included for analysis in the current study. The cut-off point of 30 was set for two reasons. First, with this criterion, publications in the included journals cover more than 70% of the total publications. Another rationale behind our initial choice of 30 papers (as a rule of thumb) was to ensure a sufficient number of data points for robust statistical analysis and to focus on journals with a more substantial presence in the field of prosody. The list of journals included for and excluded from analysis in the present study can be found in the supplementary data.

Data cleaning

To avoid coding errors, data cleaning was performed using the measures proposed by Zhang ( 2019 ). Specifically, if different author names were used to refer to the same author, they were recoded to one unique version. For instance, “Elisabeth O. Selkirk”, “Selkirk, E. O.”, “Selkirk, EO”, “Selkirk, E.”, and “Selkirk, E” were all recoded as “Selkirk, E.”. Similarly, different keywords used to refer to the same concept were also recoded. For instance, “Event-related Potential”, “Event-related Potential (ERP)”, and “ERP” were all recoded as “Event-related Potential”. In addition, singular and plural forms of the same concept were identified and recoded as one. For instance, “boundary tone” and “boundary tones” referred to the same concept, hence all “boundary tones” were recoded to “boundary tone”. However, keywords that share a degree of similarity were not recoded as the same since their meanings can be different. For instance, “bilinguals” and “bilingualism” were kept as separate words since the former refers to people who speak more than two languages while the second exists as an ability of people or a characteristic of a community of people.

Data analysis

In this study, co-citation analysis and keyword analysis were performed. The data which spanned 21 years were divided into three periods (i.e. the 2001–2007 period, the 2008–2014 period, and the 2015–2021 period) and the results of the two forms of bibliometric analyses in the three periods are compared with each other to reveal important changes during the last 21 years.

Co-citation analysis assumes that if publications were frequently cited together, these publications would probably share similar themes (Hjørland, 2013 ). This technique is frequently used in previous bibliometric studies to reveal the intellectual structure in a particular research field (Rossetto et al., 2018 ; Zhang, 2019 ). Based on the references (i.e. papers that are cited by publications retrieved for the present study) in the surveyed articles, the co-citation network will cluster two publications together if they are co-cited by a third publication. The greatest strength of co-citation analysis is that apart from identifying the most influential authors, references, and venues of publications, it is also capable of discovering thematic clusters. It should be noted that clusters in the present bibliometric study are groups or sets of closely related items. Co-cited items will fall into the same cluster by using cluster techniques in VOS viewer (detailed information can be found in van Eck and Waltman, 2010 ).

The prosody-related articles published in linguistic journals between 2000 and 2021 cited more than 50,000 unique references. It would be impossible to interpret such a massive number of nodes if all the cited references were included in a network map. Therefore, when constructing the network maps in VOS viewer (van Eck and Waltman, 2017 ), we set a cutoff point at the values that could include the top 50 most-cited items in the maps in the present study to restrict the number of nodes following Zhang ( 2019 ).

Keyword analysis was used to identify important topics in publications retrieved for the present study in each period, and a cross-period comparison of the frequencies of those important topics was conducted to determine whether significant diachronic changes existed or not. The following four steps were used to conduct keyword analysis. First, author-supplied keywords and keywords-from-abstracts Footnote 3 were retrieved from each article. Keywords extracted from abstracts were utilized to augment the information provided by author-supplied keywords, compensating for either the absence of specified keywords in certain papers or when authors furnished a restricted set of keywords. The approach for extracting these keywords from abstracts was adapted from the methodology outlined in Zhang ( 2019 ). It is crucial to underscore that the keywords extracted from abstracts serve as supplementary additions to the author-provided keywords. Second, the raw frequency information of each keyword was computed. Third, the raw frequencies of each topic are normalized for the statistical test in the next step. Normalized frequencies of the topics were the prerequisite of a valid comparison since there were substantial differences in the number of publications in the three periods. We adopted the method proposed by Lei and Liu ( 2019b ) for the normalization. That is, for example, the normalized frequency of an author-supplied keyword in a period was calculated using the following formula: normalized frequency = (raw frequency in that period/total number of publications in that period) * 10,000. Last, a one-way chi-square test of the normalized frequencies of each topic in the three periods was conducted for the identification of significant cross-period differences between the research topics.

Results and discussion

In this section, the information about the productivity, authors, and affiliations of the retrieved publications is presented, followed by a co-citation analysis to visualize the intellectual structures in terms of influential publications, references, and authors, as well as the keyword analysis that could facilitate the identification of prominent topics in the field.

Annual volume of publications, authors, and their affiliations

A total number of 4598 publications on prosody in the SSCI-indexed key linguistic journals were retrieved. Figure 1 shows the annual productivity of prosody articles in linguistic journals. From early 2000 the publications exhibited an upward trend and remained at more than 300 publications per year since 2019 (see Fig. 1 ). A dip in terms of the number of publications in 2020 can be observed, likely due to the COVID-situation which slowed the research and publication process across the board.

figure 1

Distribution of prosody articles from 2001 to 2021.

Table 1 shows the top 25 prolific authors, regions, and institutions/authors’ affiliations with which the publications were associated. In terms of authors, 58 published more than 10 (10 included) articles, while 5279 authors published one article, the former only accounting for 0.81% of the total number of authors. Regarding the regions, the USA, Germany, England, Canada, the Netherlands, Australia, France, China, Spain, and Japan (in descending order in terms of being involved in publications related to prosody) topped the first ten and each was associated with more than 100 publications. When it comes to authors’ affiliations, the top five most prolific institutions published more than 100 articles. It is not surprising to note that the most prolific authors were highly associated with prolific regions and institutions.

Co-citation analysis and network mapping

Top-cited sources of publications.

Figures 2 – 4 show the network maps of the top 50 most-cited sources in the three periods (2001–2007, 2008–2014, 2015–2021), respectively, using the smart local moving algorithm (Waltman and van Eck, 2013 ). The term “sources” denotes the academic journals or books in which the references have been published. The density view is provided below in companion to illustrate the most-cited sources of publications in the respective period. The network maps show the major clusters of the top 50 most-cited sources in the three periods.

figure 2

Network map of the most-cited sources of publications (2001‒2007).

figure 3

Network map of the most-cited sources of publications (2008‒2014).

figure 4

Network map of the most-cited sources of publications (2015‒2021).

Firstly, it is important to note that, as shown in Table 2 , the Journal of Acoustic Society of America and the Journal of Phonetics have remained to be the top two most-cited journals across the three periods. The number of citations in the two journals has increased sharply across the three periods (2329 citations in the first period, 6249 in the second, and 9978 in the third). Five other journals (i.e. Journal of Memory and Language , Journal of Speech Language Hearing Research , Language and Speech , Cognition , and Phonetica ) have always remained the top 10 most-cited sources. These journals publish works in the production and comprehension of language and speech (including prosody), serving as valuable venues for novice researchers pursuing research in this area. It is important to note that ‘thesis’ is one of the most-cited sources of publications, which is probably because certain doctoral theses (e.g. Rooth, 1985 ) by influential experts on prosody have received substantial and continuous attention over the years.

These network maps indicate not only sub-areas of prosody research but also an interesting merge and split of research areas across the three periods in this field. The first period (Fig. 2 ) indicates five major clusters Footnote 4 representing five main areas in prosody. From left to right, the first is related to the linguistic investigation (e.g., Journal of Phonetics , Journal of Acoustic Society of America , Language and Speech ); the second small cluster on the top relates to L2 learning (e.g., TESOL Quarterly , Language Learning ); the third cluster on the bottom concerns the psycholinguistic aspects (e.g., Journal of Memory and Language , Cognition , Language and Cognitive Processes ); the fourth widely-spread cluster on the top right of the map is language development/language disorder (e.g., Journal of Child Language and Journal of Speech Hearing Research ); the last cluster located at the bottom right of the map represents the neurolinguistic research on prosody (e.g., Brain and Language , Neuropsychologia ). The clusters in the second (Fig. 3 ) and third periods (Fig. 4 ) are similar to those in the first period (Fig. 2 ). However, several changes are worth noticing. For example, the second period witnessed a merge of psycholinguistic and neurolinguistic journals (top) which then became the largest cluster of all, dominating the whole map. In addition, the third period has again separated the experimental approach from the formal/theoretical approach (e.g., Laboratory Phonology vs. Linguistic Inquiry ). Consistent with the increasing references and authors in the L2 research area identified in further below sections, the cluster of L2 prosody of the network map has expanded slightly from 2001–2007 (top of Fig. 2 ) to the 2015–2021 period (top of Fig. 4 ).

The highly influential references

Figures 5 – 7 show the network map of the top 50 most-cited references in the three periods (2001–2007, 2008–2014, and 2015–2021), respectively. The network map in Fig. 5 shows four major clusters of the top 50 (out of 24,383) most-cited references that were cited 16 times or more between 2001 and 2007. Figure 6 has five clusters of the top 50 (out of 51,107) cited references that were cited 37 times or more. Figure 7 represents four clusters of the top 50 (out of 82,660) cited references that were cited at least 46 times.

figure 5

Network map of the most-cited references (2001‒2007).

figure 6

Network map of the most-cited references (2008‒2014).

figure 7

Network map of the most-cited references (2015‒2021).

Fifteen references have appeared on the top 50 list across all three periods, with Beckman and Pierrehumbert 1986 , Chomsky and Halle ( 1968 ), Hayes ( 1995 ), Ladd ( 1996 ; 2008 ), Pierrehumbert ( 1990 ); Selkirk ( 1984 ), and Nespor and Vogel ( 1986 ) remaining in the top 20 throughout (the top 10 references were listed in Table 3 ). Five publications by Cutler and colleagues ( 1986 ; 1987 ; 1988 ; 1992 ; 1997 ) and four publications by Ladd ( 1996 , 2008; Ladd et al., 1999 , 2000 ) were on the list. Two clusters led by Ladd ( 1996 / 2008 ) Footnote 5 and Nespor and Vogel ( 1986 ) representing intonational phonology and prosodic phonology were among the top three most-cited references between 2001 and 2007, and between 2008 and 2014, and continued to be popularly cited between 2015 and 2021 ranking 4th for Ladd and ranking 8th for Nespor and Vogel ( 1986 ). Some other important references in the same cluster as Ladd across the three periods are Pierrehumbert and Beckman ( 1988 ), and Pierrehumbert and Beckman ( 1988 ), whose works are associated with the “Autosegmental-Metrical” (AM) approach that describes prosody on autonomous tiers for metrical structure and tones. The same cluster in the third period (Fig. 7 ) also covered publications in information structure (e.g., Rooth, 1992 ) and the use of prosody in marking information structure (Breen et al., 2010 ). These publications only made their first appearance on the Top 50 list only between 2015 and 2021. The possible reason might be the recent interest in the acoustic realization of focus as well as testing the Roothian theory that focus indicates the presence of alternatives that are relevant for the interpretation of discourse in a range of languages (e.g., Braun et al., 2019 ; Braun and Tagliapietra, 2010 ; Gotzner, 2017 ; Repp and Spalek, 2021 ; Spalek et al., 2014 ; Tjuka et al., 2020 ; Yan and Calhoun, 2019 ; Yan et al., 2023 ).

The largest cluster is located on the right side of the 2015–2021 map (Fig. 7 ), and this appears to be the only cluster that emerged in the last period, indicating a general topic of statistical methods/tools such as mixed-effects modeling with crossed random effects for subjects and items (Baayen et al., 2008 ) and logit mixed models (Jaeger, 2008 ) Footnote 6 . This newly emerged cluster also indicates the importance of applying state-of-the-art statistics in prosody research. These two references have been influential in motivating researchers, especially psycholinguists and cognitive psychologists, to switch from ANOVA to MEM analysis, with the latter now being the dominant type of analysis. Some of the most-cited references in this cluster are concerned with tools commonly used in prosodic research and analysis such as R software (R Core Team, 2017 ) and Praat (Boersma and Weenink, 2018 ) Footnote 7 . Some focus on model fitting procedures, e.g., parsimonious mixed models (Bates et al., 2015 ) and ‘maximal’ models (Barr et al. 2013 ). Although Baayen et al. ( 2008 ) was already cited 56 times, ranking the 14th between 2008 and 2014, its citations doubled to 118 in the recent period, ranking the 5th between 2015 and 2021, with Bates et al. ( 2015 ) and Barr et al. ( 2013 ) ranking the second and the third with 287 and 177 citations, respectively.

The most influential authors

Table 4 shows the top 50 most-cited authors across the three periods. It is not surprising that some authors of the most-cited references discussed in the previous section are also the most-cited authors overall (e.g., D.R. Ladd, M. Beckman, J. Pierrehumbert, E. Selkirk). Twenty-one of the top 50 authors have remained very influential across the three periods, among whom nine authors have topped the first 20 in all three periods (i.e., A. Cutler, J. Flege, C. Gussenhoven, D.R. Ladd, M.J. Munro, J. Pierrehumbert, E. Selkirk, L.D. Shriberg, Y. Xu). A. Cutler and J. Flege have remained to be in the top five most-cited authors list across all three periods.

With the trend of applying mixed-effects models using R software in prosody research, Bates et al. ( 2015 ), Barr et al. ( 2013 ), Baayen et al. ( 2008 ), and R Core Team ( 2017 ) moved to the most-cited authors’ list in the third period, i.e., 2015–2021 (see the middle cluster in Fig. 10 , network map of the most-cited authors). Among the many researchers who became influential authors, S.A. Jun joined the bottom right cluster (see Fig. 10 ), and the other influential authors in this cluster have remained highly cited across three periods (i.e., Y. Xu, L.D. Shriberg, J. Pierrehumbert, C. Gussenhoven, D.R. Ladd, P. Prieto, A. Arvaniti, M. Beckman, and E. Grabe). Notably, at the bottom of the 2001–2007 map (Fig. 8 , network map of the most-cited authors) is the smallest cluster represented by F. Flege and M. J. Munro, most likely the L2 prosody cluster. The cluster has continuously expanded across the three periods (left side of the 2008–2014, Fig. 9 and 2015–2021 map, Fig. 10 ) and was joined by other researchers in similar fields such as T.M. Derwing, P.K. Kuhl, K. Saito. It is interesting to note that some researchers are notably prolific within specific areas, such as Flege and Munro in the realm of L2 prosody, while others, like Ladd and Pierrehumbert, hold influence across the broader spectrum of the field. This divergence could probably be detected through cluster analysis. For instance, the former might have citations concentrated within a single cluster, while the latter could be cited across multiple clusters (see Fig. 10 ).

figure 8

Network map of the most-cited authors (2001‒2007).

figure 9

Network map of the most-cited authors (2008–2014).

figure 10

Network map of the most-cited authors (2015–2021).

Keyword analysis

Keywords in the retrieved publications across the three periods whose number of occurrences equal to or greater than 10 were submitted for Chi -square analysis to test for significant changes across the three periods. This resulted in a total of 207 author-supplied keywords (out of 7269, 2.85%) and 37 keywords-from-abstracts (out of 821, 4.50%). The cut-off point of 10 was chosen because we observed that the p -values of nearly all keywords with frequencies below 10 were larger than 0.05, indicating that the frequencies of these keywords remained stable across all three stages and did not undergo significant changes.

The results revealed that 61 keywords experienced a significant change in frequency ( p  < 0.05) and the other keywords showed no significant change ( p  ≥ 0.05) across the three periods. The results from keyword analysis uncovered important research trends in the field of prosody in the past 21 years. Firstly, it is unsurprising to note that the top ten most frequent author-supplied keywords (see Table 5 ) are closely related to (1) the concept of prosody (including prosody itself, intonation , phonology , stress and accent / accents ), (2) the two main aspects of the investigation of prosody (i.e., speech perception and speech production ), (3) the notion in information structure (i.e., focus ) that is usually signaled by prosody and widely studied by prosody researchers, (4) the language that is possibly most commonly investigated (i.e., English ) and (5) bilingualism , which appears to be widely researched, especially from the second period (2008 onwards). It is important to note that in Table 5 bilingualism was the only one on the top ten list whose frequency increased throughout the three periods, indicating the increasing significance of bilingual prosody research. Seven out of 10 topics have remained to be the most discussed throughout, while the other two topics have displayed a downward trend. The possible reasons for these trends will be discussed further below.

In the keyword analysis, as mentioned above, the biggest group contains words that remained unchanged in terms of the normed frequencies across the three periods, suggesting these topics are frequently discussed. One of the important findings is that the areas closely related to prosody, such as syntax (total count of author-supplied keywords across three periods: 50), morphology (53), lexical stress (52), and conversation analysis (59), turned out to be frequently discussed (≥30) research topics across the three periods. This suggests these areas have received constant attention in prosody research given the importance of prosody in these areas (Cole et al., 2010b ; Fodor, 1998 ; Harley et al., 1995 ; Kjelgaard and Speer, 1999 ; Pratt, 2018 ; Pratt and Fernandez, 2016 ; Selkirk, 2011 , 1984 ). Another key point to note is that keywords such as English , French , Dutch , Mandarin Chinese , and Japanese are languages that researchers in the field have maintained interest in throughout the history of prosody research. Among these languages, English has the most frequent occurrence, probably due to the importance of using prosody in language comprehension in English (as reviewed in Calhoun et al., 2023 ; Cole, 2015 ; Cutler et al., 1997 ) as well as its status as lingua franca leading to a large number of both L1 and L2 speakers. Other languages such as Mandarin Chinese (as a tonal language) and Japanese (as a pitch accent language) were also frequently investigated languages due to their typical prosodic features and their larger number of speakers (see Kügler and Calhoun, 2020 ). More importantly, intonation , fundamental frequency ( f0 ), accent ( s ), pitch accent Footnote 8 , stress and tone which are expected to be key topics in prosody research have indeed been shown to be the most-discussed throughout and continue to be the focus of prosody research.

We now turn to the keywords that have experienced a significant change, whose trends could be further divided into three groups. A sample of the three groups is provided in Table 6 . Group 1 displays a general increase across the three periods, Group 2 a general decrease, and Group 3 a rise across the first two periods followed by a fall in the third period (although all normed frequencies in the last period were higher than the first period). Group 1 concerns topics that involve a second language or more than one language: bilingualism , second language ( L2 ), second language acquisition , foreign accent ( s ) and cross-linguistic influence ( CLI ), suggesting that studying prosody in L2 or multilingual speakers beyond their native languages has gained more popularity across the three periods. This is probably because of the introduction of the L2 intonation learning theory (Mennen, 2015 ) which has been attested to be a useful model to predict difficulties that L2 learners encounter based on the intonation differences in learners’ L1 and L2. Group 1 also contains topics that might show newly developed directions in the last two decades: language attitudes , voice onset time ( VOT ), sound change , tone sandhi ( TS ), and syntax-phonology interface . Electroencephalography ( EEG ) is a topic in Group 1, indicating its rising importance in prosody research and its close connection to neuroscience to investigate brain activity in response to prosodic stimuli. This reflects the increasingly interdisciplinary nature of prosody research.

While some topics have gained increased attention across the three periods, some seem to experience a drop from the second period to the third period, following a rise from the first to the second period (Group 2). The representative topics in descending order in terms of frequencies are gesture , aphasia , language contact ( s ), and Cantonese . Many of these topics rose from no occurrence in the first period and maintained 10 or more occurrences in the second and third periods. It is within expectation to note that gesture , as a key visual cue, became more popular in the second period. It is probably because the research in this field was boosted by the publication of the special issue Audiovisual Prosody edited by Krahmer and Swerts ( 2009 ), and the special issue seems to have a lasting effect on this topic in the third period as it was still more frequent than in the first period. Further, aphasia , a commutative disorder resulting from brain damage, was one of the topics that were receiving less attention from the second to the third period. Inspection of the entire keyword list shows that in the third period, aphasia appeared in a number of other forms: primary progressive aphasia , aphasia rehabilitation , aphasia severity , deep dysphasia , fluent aphasia , music, and aphasia .

A number of topics seem to become less interesting to researchers in prosody and exhibit a decreasing trend in terms of frequencies (Group 3). Although prosody and phonology were highly frequent across the three periods, the normed frequencies nevertheless showed a downward trend. At first, it seemed impossible that the two became less important, however, it is reasonable that prosody and phonology were replaced by more specific terms such as F0 , pitch , or stress as keywords, and these terms were preferred for later empirical investigations of prosody. The frequencies of syllable and syllable structure decreased, possibly for similar reasons that these terms are relatively general and may have been substituted with more specific and relevant keywords such as onset , coda , or rhyme that are used in linguistic analysis to describe the components and organization of syllables.

Conclusion and implications

The present study has provided a systematic review of prosody research from 2001 to 2021 in linguistic journals through a bibliometric analysis. Based on the key findings in this study, several significant implications for prosody research have emerged. First, our results have shown a general rise of prosody-related publications in the past two decades, showing its increased significance in broader linguistic research. Second, the co-citation analysis has identified the most cited authors, references and journals, providing valuable information for scholars, especially novel researchers in the prosody field of where the most influential prosody research can be found, who is doing that research, and what areas that research covers. Another important finding worth noticing is that prosody research has witnessed a significant increase in statistical methods especially mixed effect models in the latest two periods (2008–2014, 2015–2021), compared to the previous period (2001–2007). This increase is likely due to the influential publication of the special issue Emerging Data Analysis in the Journal of Memory and Language in 2008. Therefore, it is reasonable to argue that as a unique mode of communication in academia, special issues are effective in highlighting essential or emerging research topics in a specific discipline.

Additionally, findings in the present bibliometric analysis shed light on research trends in prosody. For example, it reveals that intonation , stress and accent remained as the most-discussed topics across the three periods given their high relevance to prosody. It is also unsurprising that speech perception and production are also among the most-discussed topics. Some trends were observed by comparing the normed frequencies across the three periods. For instance, bilingualism has gained more popularity as a research topic from the second period, showing researchers’ increased focus on it given that more people are becoming bilingual or multilingual due to globalization. However, some languages (e.g. English, Chinese, and Japanese) always remain the most researched. The prevalence of English and Chinese might be partially attributed to extensive speaker and learner bases and existing extensive literature on prosody, while the rise of Japanese in prosody research could be attributed to the pioneering contributions made by Pierrehumbert and Beckman.

The bibliometrics-based method has gained popularity in the recent decade to offer a systematic review of research trends in many fields (e.g. Fu et al., 2021 ; Lei and Liu, 2019a ; Wu, 2022 ; Zhang, 2019 ). Although this quantitative analysis is based on a substantial number of research papers and reveals developmental patterns of a research topic across different periods, some limitations are observed in the present study. First, as this paper aims to review a large number of prosody-related studies and provide major trends of research on prosody, we have to acknowledge that the literature search does not guarantee that every piece of relevant literature can be covered, due to the selection of search terms and the authors’ choice of keywords in their publications. In the pursuit of a comprehensive understanding of prosody research, we acknowledge a limitation that our choice of search terms may not encapsulate the entire landscape of prosody-related concepts. For instance, concepts such as “duration” and “emphasis” play pivotal roles in prosody analysis but also may have potentially led to an overly broad search with irrelevant results. However, it should be noted that although these terms were not included as search terms, they appeared in our list of keywords given their relevance to prosody. Future studies could explore a broader spectrum of prosodic elements, thereby further advancing the field of prosody research.

Another possible limitation is the sources for the prosody-related articles: some of the prominent journals were excluded from the publication analysis or keyword analysis because they did not meet the criteria in the filtering process in the present study. For example, Journal of Acoustic Society of America is a usual place for prosodists to publish their high-quality research, Frontiers in Psychology published extensively on linguistics, and Linguistic Inquiry might be a major sources of citation. It should be noted that such journals do appear in the co-citation analysis if they are frequently cited, and the inclusion of such journals in the publication/keyword analyses in future studies might be beneficial. Additionally, given the quantitative nature of the study, a more detailed qualitative analysis is needed to complement it; and given the space limitations of the paper, it is not possible to delve into every aspect of the significant trends observed on prosody.

Furthermore, although the qualitative analysis of the research trends was supported by quantitative data, some extent of subjectivity was still involved. Therefore, the interpretations of research trends in our paper need to be confirmed or substantiated by other experts on prosody. It would be more helpful if bibliometric reviews could be read together with traditional reviews to gain a fuller picture of research in prosody.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Semantic prosody refers to the phenomenon in linguistics where certain words or phrases evoke a specific positive or negative connotation due to their consistent co-occurrence with particular words or in certain contexts (Hunston, 2007 ; Omidian and Siyanova‐Chanturia, 2020 ). This term was deliberately excluded from the search since our research focuses on speech prosody rather than the connotation of words in the context that ‘semantic prosody’ refers to.

It should be noted for practical purposes, not all prestigious journals can be included. Therefore, we choose to gather data from SSCI journals as they are generally more relevant to the field of linguistics in comparison to SCI journals.

Keywords-from-abstracts are nouns and noun phrases extracted from abstracts. There were three steps in extracting keywords from abstracts. First, n-grams (up to four) of nouns and noun phrases were extracted from the POS-tagged abstract in each article. During this step, instances of an n-gram appearing multiple times within an abstract were consolidated into a single occurrence to prevent an overabundance of counts. Second, the authors manually checked and filtered the n-grams to identify keywords. Third, the keywords identified in the preceding step were refined through the removal of duplicated items found in the author-supplied keywords.

It is important to note that the identification of common themes within journals does not imply that the journals confined to a particular cluster exclusively cover those themes. Journals may also publish content that is relevant to other thematic clusters. For instance, research focusing on the second language (L2) acquisition of prosody through psycholinguistic methods could potentially find its place in more than one thematic cluster, reflecting the interdisciplinary nature of prosodic research and its capacity to contribute to multiple areas of study.

Ladd ( 2008 ) is the second edition of Ladd ( 1996 ).

These two seminal works were published thanks to the special issue Emerging Data Analysis in the Journal of Memory and Language edited by Kenneth I. Forster, and Michael E.J. Masson dedicated to mixed effect models (MEMs).

It is important to note that Praat exists in many versions, and we have amalgamated all versions in the analysis and decided to cite the 2018 version here as it received most citations among many versions.

As two terms “accent(s)” as used in L2 accent or foreign accent and “pitch accent” as used based on the AM model may indicate different concepts, e.g., “accent(s)”, they were listed separately.

Baayen RH, Douglas JD, Douglas MB (2008) Mixed-effects modeling with crossed random effects for subjects and items. J Mem Lang 59(4):390–412

Article   Google Scholar  

Barr DJ, Levy R, Scheepers C, Tily HJ (2013) Random effects structure for confirmatory hypothesis testing: keep it maximal. J Mem Lang 68(3):255–278

Bates D, Mächler M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using Lme4. J Stat Softw 67(1):1–48. https://doi.org/10.18637/jss.v067.i01

Beckman ME, Pierrehumbert JB (1986) Intonational structure in Japanese and English. Phonology 3:255–309

Birch S, Clifton Jr C (1995) Focus, accent, and argument structure: effects on language comprehension. Lang Speech 38(Pt 4):365–391. https://doi.org/10.1177/002383099503800403

Article   PubMed   Google Scholar  

Boersma P, Weenink D (2018) Praat: doing phonetics by computer [computer program]. Version 6.0. 37. Accessed 3 Feb 2018 http://www.praat.org

Braun B, Tagliapietra L (2010) The role of contrastive intonation contours in the retrieval of contextual alternatives. Lang Cogn Process 25(7–9):1024–1043. https://doi.org/10.1080/01690960903036836

Braun B, Dehé N, Neitsch J, Wochner D, Zahner K (2019) The prosody of rhetorical and information-seeking questions in German. Lang Speech 62(4):779–807. https://doi.org/10.1177/0023830918816351

Breen M, Fedorenko E, Wagner M, Gibson E (2010) Acoustic correlates of information structure. Lang Cogn Process 25(7–9):1044–1098

Broadus RN (1987) Toward a definition of “Bibliometrics. Scientometrics 12(5–6):373–379. https://doi.org/10.1007/BF02016680

Calhoun S (2010) The centrality of metrical structure in signaling information structure: a probabilistic perspective. Language 86(1):1–42. https://doi.org/10.1353/lan.0.0197

Calhoun S, Warren P, Yan M (2023) Cross-language influences in the processing of L2 prosody. In: Elgort I, Siyanova-Chanturia A, Brysbaert M (eds) Cross-language influences in bilingual processing and second language acquisition. John Benjamins Publishing Company, pp. 47–73

Chomsky N, Halle M (1968) The sound pattern of English. Harper & Row

Clopper CG, Tonhauser J (2013) The prosody of focus in Paraguayan Guaraní. Int J Am Linguist 79(2):219–251

Cole J (2015) Prosody in context: a review. Lang Cogn Neurosci 30(1–2):1–31. https://doi.org/10.1080/23273798.2014.963130

Cole J, Mo Y, Baek S (2010a) The role of syntactic structure in guiding prosody perception with ordinary listeners and everyday speech. Lang Cogn Process 25(7–9):1141–1177

Cole J, Mo Y, Baek S (2010b) The role of syntactic structure in guiding prosody perception with ordinary listeners and everyday speech. Lang Cogn Process 25(7–9):1141–1177

Cutler A, Carter DM (1987) The predominance of strong initial syllables in the English vocabulary. Comput Speech Lang 2(3–4):133–142

Cutler A, Norris D (1988) The role of strong syllables in segmentation for lexical access. J Exp Psychol: HPP 14:113–121

Google Scholar  

Cutler A, Butterfield S (1992) Rhythmic cues to speech segmentation: evidence from juncture misperception. J Mem Lang 31(2):218–236

Cutler A, Dahan D, van Donselaar W (1997) Prosody in the comprehension of spoken language: a literature review. Lang Speech 40(2):141–201. https://doi.org/10.1177/002383099704000203

Cutler A, Dahan D, Van Donselaar W (1997) Prosody in the comprehension of spoken language: a literature review. Lang Speech 40(2):141–201

Cutler A, Mehler J, Norris D, Segui J (1986) The syllable’s differing role in the segmentation of French and English. J Mem Lang 25(4):385–400

Cutler A, Mehler J, Norris D, Segui J (1992) The monolingual nature of speech segmentation by bilinguals. Cogn Psychol 24:381–410

Article   CAS   PubMed   Google Scholar  

Fodor JD (1998) Learning to parse? J Psycholinguist Res 27(2):285–319

Frazier L, Katy C, Charles Jr C (2006) Prosodic phrasing is central to language comprehension. Trends Cogn Sci 10(6):244–249

Fu Y, Wang H, Guo H, Bermúdez-Margaretto B, Domínguez Martínez A (2021) What, where, when and how of visual word recognition: a bibliometrics review. Lang Speech 64(4):900–929. https://doi.org/10.1177/0023830920974710

Gotzner N (2017) Alternative sets in language processing: how focus alternatives are represented in the mind. Springer

Gussenhoven C (2004) The phonology of tone and intonation. Cambridge University Press

Gussenhoven C, Chen A (eds) (2020) The Oxford handbook of language prosody. Oxford University Press, Oxford, UK

Harley B, Howard J, Hart D (1995) Second language processing at different ages: do younger learners pay more attention to prosodic cues to sentence structure? Lang Learn 45(1):43–71. https://doi.org/10.1111/j.1467-1770.1995.tb00962.x

Hayes B (1995) Metrical stress theory: principles and case studies. University of Chicago Press, Chicago

Hjørland B (2013) Facet analysis: the logical approach to knowledge organization. Inf Process Manag 49(2):545–557. https://doi.org/10.1016/j.ipm.2012.10.001

Hunston S (2007) Semantic prosody revisited. Int J corpus Linguist 12(2):249–268

Hwang H, Schafer AJ (2006) Prosodic effects in parsing early vs. late closure sentences by second language learners and native speakers. In: Hoffmann R, Mixdorff H (eds) Speech prosody 2006, third international conference, paper 091. International Speech Communication Association, Dresden, Germany, May 2–5, 2006

Ip MHK, Cutler A (2022) Juncture prosody across languages: similar production but dissimilar perception. Lab Phonol 13(1)

Jaeger TF (2008) Categorical data analysis: away from ANOVAs (transformation or not) and towards logit mixed models. J Mem Lang 59(4):434–446

Article   PubMed   PubMed Central   Google Scholar  

Katz J, Selkirk E (2011) Contrastive focus vs. discourse-new: evidence from phonetic prominence in English. Language 87:771–816

Kjelgaard MM, Speer SR (1999) Prosodic facilitation and interference in the resolution of temporary syntactic closure ambiguity. J Mem Lang 40(2):153–194. https://doi.org/10.1006/jmla.1998.2620

Krahmer E, Swerts M (2009) Audiovisual prosody—introduction to the special issue. Lang Speech 52(2–3):129–133. https://doi.org/10.1177/0023830909103164

Kügler F, Calhoun S (2020) Prosodic encoding of information structure. In: Gussenhoven C, Chen A (eds) The Oxford handbook of language prosody. Oxford University Press, Oxford, UK, pp. 454–467

Ladd DR, Mennen I, Schepman A (2000) Phonological conditioning of peak alignment in rising pitch accents in Dutch. J Acoust Soc Am 107(5):2685–2696

Article   ADS   CAS   PubMed   Google Scholar  

Ladd DR, Dan F, Hanneke F, Schepman A (1999) Constant “segmental anchoring” of F0 movements under changes in speech rate. J Acoust Soc Am 106(3):1543–1554

Ladd DR (1996, 2008) Intonational phonology, 2nd edn. Cambridge studies in linguistics. Cambridge University Press, Cambridge, New York

Lee D (2023) Bibliometric analysis of Asian ‘language and linguistics’ research: a case of 13 countries. Humanit Soc Sci Commun 10(1):1–23

Article   MathSciNet   CAS   Google Scholar  

Lee E-K, Watson DG (2011) Effects of pitch accents in attachment ambiguity resolution. Lang Cogn Process 26(2):262–297

Lehiste I (1970) Suprasegmentals. Massachusetts Institute of Technology, Diss

Lei L, Liu D (2019a) Research trends in applied linguistics from 2005 to 2016: a bibliometric analysis and its implications. Appl Linguist 40(3):540–561

Article   ADS   Google Scholar  

Lei L, Liu D (2019b) The research trends and contributions of system’s publications over the past four decades (1973–2017): a bibliometric analysis. System 80:1–13

Levelt WJM, Roelofs A, Meyer AS (1999) A theory of lexical access in speech production. Behav Brain Sci 22(1):1–38

Lin Y, Ding H, Zhang Y (2020) Prosody dominates over semantics in emotion word processing: evidence from cross-channel and cross-modal stroop effects. J Speech Language Hear Res 63(3):896–912. https://doi.org/10.1044/2020_JSLHR-19-00258

Mennen I (2015) Beyond segments: towards a L2 intonation learning theory. In: Delais-Roussarie E, Mathieu A, Sophie H (eds) Prosody and language in contact. Prosody, phonology and phonetics. Springer, Berlin, Heidelberg

Namjoshi J, Tremblay A (2014) The processing of prosodic focus in French. Columbus, OH, USA

Nespor M, Vogel I (1986) Prosodic phonology. Foris publications, Dordrecht

Norris D, Cutler A, McQueen JM, Butterfield S (2006) Phonological and conceptual activation in speech comprehension. Cogn Psychol 53(2):146–193

O’Brien MG, Jackson CN, Gardner CE (2014) Cross-linguistic differences in prosodic cues to syntactic disambiguation in German and English. Appl Psycholinguist 35(1):27–70. https://doi.org/10.1017/S0142716412000252

Omidian T, Siyanova‐Chanturia A (2020) Semantic prosody revisited: Implications for language learning. TESOL Q 54(2):512–524

Pell MD, Jaywant A, Monetta L, Kotz SA (2011) Emotional speech processing: disentangling the effects of prosody and semantic cues. Cogn Emotion 25(5):834–853. https://doi.org/10.1080/02699931.2010.516915

Pierrehumbert J (1980) The phonology and phonetics of English intonation. Dissertation, Massachusetts Institute of Technology

Pierrehumbert J (1990) The meaning of intonational contours in the interpretation of discourse Janet Pierrehumbert and Julia Hirschberg. Intentions Commun 271

Pierrehumbert J, Beckman M (1988) Japanese tone structure. Linguistic Inquiry Monogr (15):1–282

Pratt E, Fernandez EM (2016) Implicit prosody and cue-based retrieval: L1 and L2 agreement and comprehension during reading. Front Psychol 7:1922. https://doi.org/10.3389/fpsyg.2016.01922

Pratt E (2018) Prosody in sentence processing. In: Fernandez EM, Smith Cairns H (eds) The handbook of psycholinguistics. John Wiley & Sons, Inc., Hoboken, NJ, pp. 365–391

Prieto P (2015) Intonational meaning. Wiley Interdiscip Rev: Cogn Sci 6(4):371–381. https://doi.org/10.1002/wcs.1352

Pritchard A (1969) Statistical bibliography or bibliometrics. J Documentation 25(4):348

R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

Repp S (2020) The prosody of Wh-exclamatives and Wh-questions in German: speech act differences, information structure, and sex of speaker. Language Speech 63(2):306–361. https://doi.org/10.1177/0023830919846147

Repp S, Spalek K (2021) The role of alternatives in language. Front Commun 6:682009

Roncaglia-Denissen MP, Schmidt-Kassow M, Heine A, Kotz SA (2014) On the impact of L2 speech rhythm on syntactic ambiguity resolution. Second Lang Res 31(2):157–178. https://doi.org/10.1177/0267658314554497

Rooth M (1992) A theory of focus interpretation. Nat Lang Semant 1(1):75–116

Rooth M (1985) Association with focus. PhD thesis, University of Massachusetts, MA, USA

Rossetto DE, Bernardes RC, Borini FM, Gattaz CC (2018) Structure and evolution of innovation research in the last 60 years: review and future trends in the field of business through the citations and co-citations analysis. Scientometrics 115(3):1329–1363

Sanders LD, Helen JN, Marty GW (2002) Speech segmentation by native and non-native speakers: the use of lexical, syntactic, and stress-pattern cues. J Speech, Lang, Hearing Res 45(3):519–530

Schafer A, Carlson K, Clifton Jr H, Frazier L (2000) Focus and the interpretation of pitch accent: disambiguating embedded questions. Lang Speech 43(1):75–105

Selkirk E (2011) The syntax–phonology interface. In: Goldsmith JA, Riggle J, Yu ACL (eds) The handbook of phonological theory, vol 2. pp. 435–483

Selkirk E (1984) Phonology and syntax: the relation between sound and structure. The MIT Press, Cambridge

Shattuck-Hufnagel S, Turk AE (1996) A prosody tutorial for investigators of auditory sentence processing. J Psycholinguist Res 25(2):193–247

Small H, Sweeney E (1985) Clustering the science citation index® using co-citations: I. A comparison of methods. Scientometrics 7:391–409

Spalek K, Gotzner N, Wartenburger I (2014) Not only the apples: focus sensitive particles improve memory for information-structural alternatives. J Mem Lang 70:68–84

Steedman M (2000) Information structure and the syntax-phonology interface. Linguist Inq 31(4):649–689. https://doi.org/10.1162/002438900554505

Strange W (1995) Speech perception and linguistic experience: issues in cross-language research

Tjuka A, Nguyen HTT, Spalek K (2020) Foxes, deer, and hedgehogs: the recall of focus alternatives in Vietnamese

van Eck NJ, Waltman L (2010) Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 84(2):523–538

van Eck NJ, Waltman L (2017) Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics 111(2):1053–1070. https://doi.org/10.1007/s11192-017-2300-7

Wagner M, Watson DG (2010) Experimental and theoretical advances in prosody: a review. Lang Cogn Process 25(7–9):905–945

Waltman L, van Eck NJ (2013) A smart local moving algorithm for large-scale modularity-based community detection. Eur Phys J B 86(11):1–14

Welby P (2003) Effects of pitch accent position, type, and status on focus projection. Lang Speech 46(1):53–81

Wu X (2022) Motivation in second language acquisition: a bibliometric analysis between 2000 and 2021. Front Psychol 13:1032316

Xu Y (1999) Effects of tone and focus on the formation and alignment of F0 contours. J Phon 27(1):55–105

Yan M, Calhoun S (2019) Priming effects of focus in Mandarin Chinese. Front Psychol 10:1985

Yan M, Calhoun S, Warren P (2023) The role of prominence in activating focused words and their alternatives in mandarin: evidence from lexical priming and recognition memory. Lang Speech 66(3):678–705. 00238309221126108

Zhang X (2019) A bibliometric analysis of second language acquisition between 1997 and 2018. Stud Second Lang Acquis 42(1):199–222. https://doi.org/10.1017/S0272263119000573

Zhu H, Lei L (2021) A dependency-based machine learning approach to the identification of research topics: a case in COVID-19 studies. Library Hi Tech

Download references

Acknowledgements

The study was supported by The National Social Science Fund of China (21CYY014).

Author information

Authors and affiliations.

School of Foreign Languages, Huazhong University of Science and Technology, 1037 Luoyu Road, Hongshan District, 430074, Wuhan, Hubei, China

Mengzhu Yan & Xue Wu

You can also search for this author in PubMed   Google Scholar

Contributions

These authors contributed equally to this work.

Corresponding author

Correspondence to Xue Wu .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Informed consent

Additional information.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Yan, M., Wu, X. Prosody in linguistic journals: a bibliometric analysis. Humanit Soc Sci Commun 11 , 311 (2024). https://doi.org/10.1057/s41599-024-02825-9

Download citation

Received : 09 May 2023

Accepted : 13 February 2024

Published : 26 February 2024

DOI : https://doi.org/10.1057/s41599-024-02825-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research paper on linguistic analysis

Linguistic Analysis

Welcome to Linguistic Analysis

A peer-reviewed research journal publishing articles in formal phonology, morphology, syntax and semantics. The journal has been in continuous publication since 1976. ISSN: 0098-9053

Please note that Volumes , Issues , Individual Articles , as well as a yearly Unlimited Access Pass (via IP Authentication or Username-and-Password ) to Linguistic Analysis are now available here for purchase and for download on this website. For more information on rates and ordering options, please visit the Rates  page. We will continue to add new material so come back to visit. Please Contact us  if you are interested in specific back issues.

Current Issue

Linguistic Analysis Volume 43 Issues 1 & 2 (2022)

Barcelona Conference on Syntax, Semantics, & Phonology , edited by Anna Paradis & Lorena Castillo-Ros.

This issue brings together a selection of ten papers presented at the 15th Workshop on Syntax, Semantics, and Phonology (WoSSP), held at the Universitat Autònoma de Barcelona, on June 28-29, 2018. WoSSP is a series of on-going workshops organized by PhD students for students who are working in any domain of generative linguistics, and which offers them a forum to share their work in progress . One of the main aims of the WoSSP conference is to provide a space where graduate students who wish to present their work may exchange ideas within different formal approaches to linguistic phenomena.

Read the Introduction

Issues in Preparation

Volume 43, 3-4: Dependency Grammars

This issue, edited by Timothy Osborne, brings together a selection of papers that examine dependency grammars from a variety of perspectives.

Volume 44, 1-2 Pot-pourri

A selection of orthodox and alternate linguistic perspectives, including an in-depth examination of phonology in classical Arabic poetry, and 3 article-length studies of English grammar by Michael Menaugh.

Note: Volume 43, 3-4, will be the last issue of the journal published in paper. Beginning with volume 44, 1-2, all issues will be available in electronic form only on this website <www.linguisticanalysis.com>. Interested parties will be able to purchase single articles, whole issues, or take advantage of the annual All-Access pass to everything.

Note: We are also uploading all past volumes and issues of the journal and expect this process to be completed by the end of 2023.

Thank you for your patience and continued support.

research paper on linguistic analysis

Help | Advanced Search

Computer Science > Computation and Language

Title: evaluating telugu proficiency in large language models_ a comparative analysis of chatgpt and gemini.

Abstract: The growing prominence of large language models (LLMs) necessitates the exploration of their capabilities beyond English. This research investigates the Telugu language proficiency of ChatGPT and Gemini, two leading LLMs. Through a designed set of 20 questions encompassing greetings, grammar, vocabulary, common phrases, task completion, and situational reasoning, the study delves into their strengths and weaknesses in handling Telugu. The analysis aims to identify the LLM that demonstrates a deeper understanding of Telugu grammatical structures, possesses a broader vocabulary, and exhibits superior performance in tasks like writing and reasoning. By comparing their ability to comprehend and use everyday Telugu expressions, the research sheds light on their suitability for real-world language interaction. Furthermore, the evaluation of adaptability and reasoning capabilities provides insights into how each LLM leverages Telugu to respond to dynamic situations. This comparative analysis contributes to the ongoing discussion on multilingual capabilities in AI and paves the way for future research in developing LLMs that can seamlessly integrate with Telugu-speaking communities.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Journal of Language Teaching and Research

Pragma-Multimodal Discourse Analysis of Environmental Slogans

  • Wasan N. Fadhil Kerbala University

One of the most effective ways to create awareness among people to care for the environment and keep it serene is framing slogans in images. This paper is a pragma-multimodal analysis of environmental slogans with images created on different social media platforms. It aims to discover the illocutionary act of each text and explore how each text cooperates with its image to create a comprehensive meaning. The dataset selected for this paper includes ten slogans with images. This paper was conducted qualitatively using a descriptive-analytical approach. Findings reveal that these slogans convey various illocutionary acts such as requesting, inviting, or asserting. Also, there is an inter-connectedness between the text and its image, which helps construct a successful meaning.

Author Biography

Wasan n. fadhil, kerbala university.

English Department

Arola, K. L., Sheppard, J. and Ball, C. E. (2014). Writer/ Designer: A guide to making multimodal projects. Bedford/St. Martin's.

Austin, J. L. (1962). How to do things with words. Oxford University Press.

Brown, G. and Yule, G. (1983). Discourse analysis. Cambridge University Press.

Byju's https://byjus.com . Environment- Meaning, Definition, Function, Example. Retrieved at 15-11-2023.

Castells, M. (2009). Communication power. Oxford University Press.

Jewitt, C. (2013). The Routledge handbook of multimodal analysis. Routledge.

Jones, R. H. (2012). Discourse analysis. Routledge.

Jorgensen, M. W. and Philips, L. (2002). Discourse analysis as theory and method. Sage Publication.

Kress, G., and van Leeuwen, T. (2001). Multimodal discourse: The modes and media of contemporary communication. Arnold.

Iedema, R. (2003). Multimodality, resemiotization: Extending the analysis of discourse as multi-semiotic practice. Visual Communication, 2(1), 29-57.

Liu, J. (2013). Visual images interpretive strategies in multimodal texts. Journal of Language Teaching and Research, 4(6). DOI:10.4304/jltr.4.6.1259- 1263

Liu, J. (2019). A multimodal discourse analysis of the interactive meaning in public service advertisement. Journal of Advances in Linguistics, 10, 1523-1534. https://Doi-org/10.24297/Jal.vIoio.8196

McCarthy, M. (1991). Discourse analysis for language teachers. Cambridge University Press.

O’ Halloran, K. (1999).Towards a systemic functional analysis of multisemiotic mathematics texts. Semiotica, 124(1-2), 1-29. doi:10.1515/semi.1999.124.1-2.1

O’ Halloran, K. (2011). Multimodal discourse analysis. In K. Hyland and B. Paltridge (eds.) Companion to Discourse. London.

Olivier, J. (2021). Self –directed multimodal learning within A context of transformative open education. In J. Olivier, (Ed.). Self – directed multimodal learning In higher education (pp. 1-49).

Searle, J. R. (1979). Expression and meaning: Studies in the theory of speech acts. Cambridge University Press.

Sloganshub.org/ save- tree- slogans- taglines/ Retrieved at 15-11-2023

Copyright © 2015-2024 ACADEMY PUBLICATION — All Rights Reserved

More information about the publishing system, Platform and Workflow by OJS/PKP.

  • Computer Vision
  • Federated Learning
  • Reinforcement Learning
  • Natural Language Processing
  • New Releases
  • AI Dev Tools
  • Advisory Board Members
  • 🐝 Partnership and Promotion

Logo

Existing work includes frameworks like Woodpecker, which focuses on extracting key concepts for hallucination diagnosis and mitigation in large language models. Models like AlpaGasus leverage fine-tuning high-quality data to enhance effectiveness and accuracy. Moreover, methodologies aim to improve factuality in outputs using similar fine-tuning techniques. These efforts collectively address critical issues in reliability and control, setting the groundwork for further advancements in the field.

Researchers from Huazhong University of Science and Technology, the University of New South Wales, and Nanyang Technological University have introduced HalluVault. This novel framework employs logic programming and metamorphic testing to detect Fact-Conflicting Hallucinations (FCH) in Large Language Models (LLMs). This method stands out by automating the update and validation of benchmark datasets, which traditionally rely on manual curation. By integrating logic reasoning and semantic-aware oracles, HalluVault ensures that the LLM’s responses are not only factually accurate but also logically consistent, setting a new standard in evaluating LLMs.

research paper on linguistic analysis

HalluVault’s methodology rigorously constructs a factual knowledge base primarily from Wikipedia data. The framework applies five unique logic reasoning rules to this base, creating a diversified and enriched dataset for testing. Test case-oracle pairs generated from this dataset serve as benchmarks for evaluating the consistency and accuracy of LLM responses. Two semantic-aware testing oracles are integral to the framework, assessing the semantic structure and logical consistency between the LLM outputs and the established truths. This systematic approach ensures that LLMs are evaluated under stringent conditions that mimic real-world data processing challenges, effectively measuring their reliability and factual accuracy.

The evaluation of HalluVault revealed significant improvements in detecting factual inaccuracies in LLM responses. Through systematic testing, the framework reduced the rate of hallucinations by up to 40% compared to previous benchmarks. In trials, LLMs using HalluVault’s methodology demonstrated a 70% increase in accuracy when responding to complex queries across varied knowledge domains. Furthermore, the semantic-aware oracles successfully identified logical inconsistencies in 95% of test cases, ensuring robust validation of LLM outputs against the enhanced factual dataset. These results validate HalluVault’s effectiveness in enhancing the factual reliability of LLMs.

research paper on linguistic analysis

To conclude, HalluVault introduces a robust framework for enhancing the factual accuracy of LLMs through logic programming and metamorphic testing. The framework ensures that LLM outputs are factually and logically consistent by automating the creation and updating of benchmarks with enriched data sources like Wikipedia and employing semantic-aware testing oracles. The significant reduction in hallucination rates and improved accuracy in complex queries underscore the framework’s effectiveness, marking a substantial advancement in the reliability of LLMs for practical applications.

Check out the  Paper .  All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on  Twitter . Join our  Telegram Channel ,   Discord Channel , and  LinkedIn Gr oup .

If you like our work, you will love our  newsletter..

Don’t Forget to join our  41k+ ML SubReddit

research paper on linguistic analysis

Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

  • Nikhil https://www.marktechpost.com/author/nikhil0980/ E2B Introduces Code Interpreter SDK: Enabling Code Interpreting Capabilities to AI Apps
  • Nikhil https://www.marktechpost.com/author/nikhil0980/ Visual Intuitive Physics: Enhancing Understanding Through Visualization
  • Nikhil https://www.marktechpost.com/author/nikhil0980/ Reinforcement Learning: Training AI Agents Through Rewards and Penalties
  • Nikhil https://www.marktechpost.com/author/nikhil0980/ DLAP: A Deep Learning Augmented LLMs Prompting Framework for Software Vulnerability Detection

RELATED ARTICLES MORE FROM AUTHOR

Google deepmind introduces alphafold 3: a revolutionary ai model that can predict the structure and interactions of all life’s molecules with unprecedented accuracy, deep learning techniques for autonomous driving: an overview, tramba: a novel hybrid transformer and mamba-based architecture for speech super resolution and enhancement for mobile and wearable platforms, what are the dimensions for creating retrieval augmented generation (rag) pipelines, ai21 labs introduces jamba-instruct model: an instruction-tuned version of their hybrid ssm-transformer jamba model, mardiflow: automating metadata abstraction for enhanced reproducibility in computational workflows, google deepmind introduces alphafold 3: a revolutionary ai model that can predict the structure..., tramba: a novel hybrid transformer and mamba-based architecture for speech super resolution and enhancement..., top ai presentation generators/tools, chatbi: a comprehensive and efficient technology for solving the natural language to business intelligence..., enhancing continual learning with imex-reg: a robust approach to mitigate catastrophic forgetting, beyond gpus: how quantum processing units (qpus) will transform computing.

  • AI Magazine
  • Privacy & TC
  • Cookie Policy

🐝 🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...

Thank You 🙌

Privacy Overview

IMAGES

  1. (PDF) Corpus-Based Linguistic Approaches to Critical Discourse Analysis

    research paper on linguistic analysis

  2. How To Write A Linguistics Term Paper

    research paper on linguistic analysis

  3. (PDF) A Deep Linguistic Analysis for Cross-language Information Retrieval

    research paper on linguistic analysis

  4. Introducing Linguistic Research

    research paper on linguistic analysis

  5. Linguistic analysis of novel Free Essay Example

    research paper on linguistic analysis

  6. (PDF) Persuasive Language in Advertising: Linguistic Analysis of

    research paper on linguistic analysis

VIDEO

  1. Linguistic Anthropology Paper

  2. nafiofneafansoef

  3. Narratives Discourse: Story and Plot

  4. Linguistic analysis of metaphoric gestures 🫶

  5. Unlock Law School Dreams: Master Vocab with The Indian Express

  6. Linguistic Competence and Performance

COMMENTS

  1. Exploring Language Analysis: A Comprehensive Examination of Linguistic Structures and Communication Dynamics

    By linking language analysis to current trends in artificial intelligence, such as sentiment analysis or intent recognition, the field becomes more relatable and appealing to a wider audience [6]. 6.

  2. Journal of Linguistics

    The Journal of Linguistics is an open access journal that is concerned with all branches of theoretical linguistics. Preference is given to original Research Articles which present analyses or generalisations based on sound empirical work, which results in making a clear contribution to current debate in theoretical linguistics.Papers should be accessible to non-specialist linguists with an ...

  3. Applied Linguistics Research: Current Issues, Methods, and Trends

    In a recent paper, King and Mackey call ... New to this edition are chapters on mixed methods research and analysis, narrative inquiry, ethics in applied linguistics research, and developing a research project, as well as chapters in areas of research which include researching language learning strategies, young learners, teachers' beliefs ...

  4. From Text to Thought: How Analyzing Language Can Advance Psychological

    The English language (including Old English and Middle English) has existed for a small fraction of human history, and approximately 5% of people today speak English as a first language, yet English speakers probably account for more than 99% of language-analysis research published in psychology journals (Lewis, 2009).

  5. The Oxford Handbook of Linguistic Analysis

    Abstract. This handbook compares the main analytic frameworks and methods of contemporary linguistics It offers an overview of linguistic theory, revealing the common concerns of competing approaches. By showing their current and potential applications, the book provides the means by which linguists and others can judge what are the most useful ...

  6. A Survey of Semantic Analysis Approaches

    Semantic analysis interprets language structures through natural language analysis. Natural language processing (NLP) ... Summarily, the report has been a critical review of over 12 published research papers on semantic analysis [13, 22, 30, 42,43,44,45,46,47,48,49,50]. Some papers have been scanned quickly to check the relevance of the topic ...

  7. Lingualyzer: A computational linguistic tool for multilingual and

    For most research in cognitive and social psychology, psycholinguistics, and cognitive science at large, text analysis has primarily focused on a very small and very specific part of human language, that of formal, written English from a WEIRD (Western, Educated, Industrialized, Rich and Democratic) population (Blasi et al., 2022; Henrich et al., 2010; Kučera & Mehl, 2022; Levisen, 2019).

  8. PDF Research Methods in Linguistics

    2 Ethics in linguistic research Penelope Eckert 11 3 Judgment data Carson T. Schütze and Jon Sprouse 27 4 Fieldwork for language description Shobhana Chelliah 51 ... 18 Constructing and supporting a linguistic analysis John Beavers and Peter Sells 397 19 Modeling in the language sciences Willem Zuidema and Bart de Boer 422 20 Variation analysis

  9. Conversation Analysis

    Conversation Analysis (hereafter CA) is an approach to language and social interaction that emerged in the mid-to-late 1960s through the collaboration of sociologists Harvey Sacks and Emmanuel Schegloff as well as a number of their students, most importantly, Gail Jefferson (see Lerner 2004).Although it originated in the United States within sociology, today working conversation analysts can ...

  10. Linguistic Analysis

    This book reconsiders the classic topics of linguistic analysis and reflects on universal aspects of language from a typological and comparative perspective. The aim is to show the crucial interactions which occur at the different levels of grammar (phonology, morphology, lexicon, syntax and pragmatics), illustrating their various roles in the structural organization of the sentence and ...

  11. PDF JOURNAL OF LANGUAGE AND LINGUISTIC STUDIES

    Types of analysis such as lexical, morphological, syntactic, semantic must be implemented to perform a perfect linguistic analysis of text / information. It is well known that sequential analysis is a linguistic analysis of a text (information) in a logical sequence, in which each task seems more difficult than the previous one.

  12. Prosody in linguistic journals: a bibliometric analysis

    The present study provides a systematic review of prosody research in linguistic journals through a bibliometric analysis. Using the bibliographic data from 2001 to 2021 in key linguistic journals ...

  13. From Text to Thought: How Analyzing Language Can Advance Psychological

    language analysis—natural-language processing and comparative linguistics—are contributing to how we understand topics as diverse as emotion, creativity, and religion and overcoming obstacles related to statistical power and culturally ... research, but they foreshadowed the impact of language analysis on psychological science. From Text to ...

  14. Linguistic Studies on Social Media: A Bibliometric Analysis

    This study aimed to present the status quo of linguistic studies on social media in the past decade. In particular, it conducted a bibliometric analysis of articles from the field of linguistics of the database of Web of Science Core Collection with the aid of the tool CiteSpace to identify the general characteristics, major strands of linguistics, main research methods, and important research ...

  15. Linguistic Analysis

    Welcome to Linguistic Analysis. A peer-reviewed research journal publishing articles in formal phonology, morphology, syntax and semantics. The journal has been in continuous publication since 1976. ... Volume 43, 3-4, will be the last issue of the journal published in paper. Beginning with volume 44, 1-2, all issues will be available in ...

  16. Trends and hot topics in linguistics studies from 2011 to 2021: A

    High citations most often characterize quality research that reflects the foci of the discipline. This study aims to spotlight the most recent hot topics and the trends looming from the highly cited papers (HCPs) in Web of Science category of linguistics and language & linguistics with bibliometric analysis. The bibliometric information of the 143 HCPs based on Essential Citation Indicators ...

  17. PDF JOURNAL OF LANGUAGE AND LINGUISTIC STUDIES

    JOURNAL OF LANGUAGE AND LINGUISTIC STUDIES ISSN: 1305-578X Journal of Language and Linguistic Studies, 12(2), 117-134; 2016 ELT research in Turkey: A content analysis of selected features of published articles Oktay Yağız a *, Burcu Aydınb, Ahmet Selçuk Akdemir c a Ataturk University, Erzurum,Turkey b Adnan Menderes University,Aydın,Turkey

  18. Language and Research: Exploring the Impact of Linguistic Analysis in

    Once research has been conducted, findings must be disseminated effectively, a task heavily reliant on language. In writing research papers, language precision is crucial to accurately convey ...

  19. Evaluating Telugu Proficiency in Large Language Models_ A Comparative

    The growing prominence of large language models (LLMs) necessitates the exploration of their capabilities beyond English. This research investigates the Telugu language proficiency of ChatGPT and Gemini, two leading LLMs. Through a designed set of 20 questions encompassing greetings, grammar, vocabulary, common phrases, task completion, and situational reasoning, the study delves into their ...

  20. Pragma-Multimodal Discourse Analysis of Environmental Slogans

    One of the most effective ways to create awareness among people to care for the environment and keep it serene is framing slogans in images. This paper is a pragma-multimodal analysis of environmental slogans with images created on different social media platforms. It aims to discover the illocutionary act of each text and explore how each text cooperates with its image to create a ...

  21. This AI Paper Introduces HalluVault for Detecting Fact-Conflicting

    The fundamental problem tackled by contemporary research is the inefficiency of existing data analysis methods. Traditional tools often need to catch up when tasked with processing large-scale data due to limitations in speed and adaptability. This inefficiency can significantly hinder progress, especially when real-time data analysis is crucial.

  22. Information Systems IE&IS

    In order to do that, the IS group helps organizations to: (i) understand the business needs and value propositions and accordingly design the required business and information system architecture; (ii) design, implement, and improve the operational processes and supporting (information) systems that address the business need, and (iii) use advanced data analytics methods and techniques to ...