- Machine Learning
- Español – América Latina
- Português – Brasil
- Tiếng Việt
Introduction to Large Language Models
New to language models or large language models? Check out the resources below.
What is a language model?
A language model is a machine learning model that aims to predict and generate plausible language. Autocomplete is a language model, for example.
These models work by estimating the probability of a token or sequence of tokens occurring within a longer sequence of tokens. Consider the following sentence:
If you assume that a token is a word, then a language model determines the probabilities of different words or sequences of words to replace that underscore. For example, a language model might determine the following probabilities:
A "sequence of tokens" could be an entire sentence or a series of sentences. That is, a language model could calculate the likelihood of different entire sentences or blocks of text.
Estimating the probability of what comes next in a sequence is useful for all kinds of things: generating text, translating languages, and answering questions, to name a few.
What is a large language model?
Modeling human language at scale is a highly complex and resource-intensive endeavor. The path to reaching the current capabilities of language models and large language models has spanned several decades.
As models are built bigger and bigger, their complexity and efficacy increases. Early language models could predict the probability of a single word; modern large language models can predict the probability of sentences, paragraphs, or even entire documents.
The size and capability of language models has exploded over the last few years as computer memory, dataset size, and processing power increases, and more effective techniques for modeling longer text sequences are developed.
How large is large?
The definition is fuzzy, but "large" has been used to describe BERT (110M parameters) as well as PaLM 2 (up to 340B parameters).
Parameters are the weights the model learned during training, used to predict the next token in the sequence. "Large" can refer either to the number of parameters in the model, or sometimes the number of words in the dataset.
Transformers
A key development in language modeling was the introduction in 2017 of Transformers, an architecture designed around the idea of attention . This made it possible to process longer sequences by focusing on the most important part of the input, solving memory issues encountered in earlier models.
Transformers are the state-of-the-art architecture for a wide variety of language model applications, such as translators.
If the input is "I am a good dog." , a Transformer-based translator transforms that input into the output "Je suis un bon chien." , which is the same sentence translated into French.
Full Transformers consist of an encoder and a decoder . An encoder converts input text into an intermediate representation, and a decoder converts that intermediate representation into useful text.
Self-attention
Transformers rely heavily on a concept called self-attention. The self part of self-attention refers to the "egocentric" focus of each token in a corpus. Effectively, on behalf of each token of input, self-attention asks, "How much does every other token of input matter to me ?" To simplify matters, let's assume that each token is a word and the complete context is a single sentence. Consider the following sentence:
The animal didn't cross the street because it was too tired.
There are 11 words in the preceding sentence, so each of the 11 words is paying attention to the other ten, wondering how much each of those ten words matters to them. For example, notice that the sentence contains the pronoun it . Pronouns are often ambiguous. The pronoun it always refers to a recent noun, but in the example sentence, which recent noun does it refer to: the animal or the street?
The self-attention mechanism determines the relevance of each nearby word to the pronoun it .
What are some use cases for LLMs?
LLMs are highly effective at the task they were built for, which is generating the most plausible text in response to an input. They are even beginning to show strong performance on other tasks; for example, summarization, question answering, and text classification. These are called emergent abilities . LLMs can even solve some math problems and write code (though it's advisable to check their work).
LLMs are excellent at mimicking human speech patterns. Among other things, they're great at combining information with different styles and tones.
However, LLMs can be components of models that do more than just generate text. Recent LLMs have been used to build sentiment detectors, toxicity classifiers, and generate image captions.
LLM Considerations
Models this large are not without their drawbacks.
The largest LLMs are expensive. They can take months to train, and as a result consume lots of resources.
They can also usually be repurposed for other tasks, a valuable silver lining.
Training models with upwards of a trillion parameters creates engineering challenges. Special infrastructure and programming techniques are required to coordinate the flow to the chips and back again.
There are ways to mitigate the costs of these large models. Two approaches are offline inference and distillation .
Bias can be a problem in very large models and should be considered in training and deployment.
As these models are trained on human language, this can introduce numerous potential ethical issues, including the misuse of language, and bias in race, gender, religion, and more.
It should be clear that as these models continue to get bigger and perform better, there is continuing need to be diligent about understanding and mitigating their drawbacks. Learn more about Google's approach to responsible AI .
Learn more about LLMs
Interested in a more in-depth introduction to large language models? Check out the new Large language models module in Machine Learning Crash Course .
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-09-06 UTC.
COS 597G (Fall 2022): Understanding Large Language Models
Instructor | (danqic AT cs.princeton.edu) |
Teaching assistant | (awettig AT cs.princeton.edu) |
Lectures | Monday/Wednesday 10:30-11:50am |
Location | |
Pre-lecture feedback meetings | Monday 3:30-4pm for Wednesday lectures, Friday 4:45pm-5:15pm for Monday lectures, COS 412 |
Office hours | Danqi's office hour: Monday 2:30-3:30pm, COS 412 ( ) Alex's office hour: Wednesday 3-4pm, Friend Center (student space lobby) |
Feedback form | |
We will use a Slack team for most communiations this semester (no Ed!). We will let you get in the Slack team after the first lecture; If you join the class late, just email us and we will add you. As long as you are on Slack, we prefer Slack messages over emails for all logistical questions. We also encourage students to use Slack for discussion of lecture content and projects.
Large language models (LLMs) have utterly transformed the field of natural language processing (NLP) in the last 3-4 years. They form the basis of state-of-art systems and become ubiquitous in solving a wide range of natural language understanding and generation tasks. With the unprecedented potential and capabilities, these models also give rise to new ethical and scalability challenges. This course aims to cover cutting-edge research topics centering around pre-trained language models. We will discuss their technical foundations (BERT, GPT, T5 models, mixture-of-expert models, retrieval-based models), emerging capabilities (knowledge, reasoning, few-shot learning, in-context learning), fine-tuning and adaptation, system design, as well as security and ethics. We will cover each topic and discuss important papers in depth. Students will be expected to routinely read and present research papers and complete a research project at the end.
This is an advanced graduate course and all the students are expected to have taken machine learning and NLP courses before and are familiar with deep learning models such as Transformers.
Learning goals:
- This course is intended to prepare you for performing cutting-edge research in natural language processing, especially topics related to pre-trained language models. We will discuss the state-of-the-art, their capabilities and limitations.
- Practice your research skills, including reading research papers, conducting literature survey, oral presentations, as well as providing constructive feedback.
- Gain hands-on experience through the final project, from brainstorming ideas to implementation and empirical evaluation and writing the final paper.
Course structure
- Class participation (25%) : In each class, we will cover 1-2 papers. You are required to read these papers in depth and answer around 3 pre-lecture questions (see "pre-lecture questions" in the schedule table) before 11:59pm prior to the lecture day . These questions are designed to test your undersatnding and stimulate your thinking on the topic and will count towards class participation (we will not grade the correctness; as long as you do your best to answer these questions, you will be good). In the last 20 minutes of the class, we will review and discuss these questions in small groups.
- 1-2 papers have already been chosen for each topic. We also encourage you to include background, or useful materials from "recommended reading" when you see there is a fit.
- You are also required to meet with the instructor before the lecture (Monday 3:30-4pm for Wednesday lectures and Friday 4:45-5:15pm for Monday lectures). Please send your draft slides on Slack before 11:59pm the day prior to the meeting and we will go over your slides during the meeting.
- You are expected to present 1-2 times and you will receive feedback on your presentation from 3-4 classmates.
- Lecture feedback (5%) : In addition to giving lectures, you are also required to provide written feedback to the presenter(s) on their lecture, 1+ pages in length, commenting on the content, delivery, clarity, completeness, etc. No need for complete sentences, bullet points are fine, but should be thorough and constructive. These notes should be sent to the instructor/TA on Slack within a day of the lecture ( a google doc link is preferred ). You are expected to do this 2-3 times throughout the semester.
- Train or fine-tune a medium-sized language model (e.g., BERT/RoBERTa, T5, GPT-2) yourself for the task of your interest. You will probably need to access pre-trained models on HuggingFace's hub . If you don't have decent compute resources, we will provide certain compute budget for you to execute your projects.
- Prompt and evaluate a very large language model (e.g., GPT-3, Codex) to understand their capabilities, limitations or risks. We will provide certain budget for you to access these large models if needed.
Useful materials:
- J & M, slp3 is an NLP textbook for you to check out specific topics in NLP.
- On the Opportunities and Risks of Foundation Models (published by Stanford researchers in July 2021) surveys a range of topics on foundational models (large langauge models are a large part of them).
- A Primer in BERTology: What we know about how BERT works provides an excellent overview of what we understand about BERT (last update: Nov 2020).
A Beginner’s Guide to Language Models
A language model is a probability distribution over words or word sequences. Learn more about different types of language models and what they can do.
Extracting information from textual data has changed dramatically over the past decade. As the term natural language processing has overtaken text mining as the name of the field, the methodology has changed tremendously, too. One of the main drivers of this change was the emergence of language models as a basis for many applications aiming to distill valuable insights from raw text.
Language Model Definition
In learning about natural language processing, I’ve been fascinated by the evolution of language models over the past years. You may have heard about GPT-3 and the potential threats it poses , but how did we get this far? How can a machine produce an article that mimics a journalist?
What Is a Language Model?
A language model is a probability distribution over words or word sequences. In practice, it gives the probability of a certain word sequence being “valid.” Validity in this context does not refer to grammatical validity. Instead, it means that it resembles how people write, which is what the language model learns. This is an important point. There’s no magic to a language model like other machine learning models , particularly deep neural networks , it’s just a tool to incorporate abundant information in a concise manner that’s reusable in an out-of-sample context.
More on Data Science: Basic Probability Theory and Statistics Terms to Know
What Can a Language Model Do?
The abstract understanding of natural language, which is necessary to infer word probabilities from context, can be used for a number of tasks. Lemmatization or stemming aims to reduce a word to its most basic form, thereby dramatically decreasing the number of tokens. These algorithms work better if the part-of-speech role of the word is known. A verb’s postfixes can be different from a noun’s postfixes, hence the rationale for part-of-speech tagging (or POS-tagging ), a common task for a language model.
With a good language model, we can perform extractive or abstractive summarization of texts. If we have models for different languages, a machine translation system can be built easily. Less straightforward use-cases include answering questions (with or without context, see the example at the end of the article). Language models can also be used for speech recognition , OCR , handwriting recognition and more. There’s a whole spectrum of opportunities.
Types of Language Models
There are two types of language models:
- Probabilistic methods.
- Neural network-based modern language models
It’s important to note the difference between them.
Probabilistic Language Model
A simple probabilistic language model is constructed by calculating n-gram probabilities. An n-gram is an n word sequence, n being an integer greater than zero. An n-gram’s probability is the conditional probability that the n-gram’s last word follows a particular n-1 gram (leaving out the last word). It’s the proportion of occurrences of the last word following the n-1 gram leaving the last word out. This concept is a Markov assumption. Given the n-1 gram (the present), the n-gram probabilities (future) does not depend on the n-2, n-3, etc grams (past).
There are evident drawbacks of this approach. Most importantly, only the preceding n words affect the probability distribution of the next word. Complicated texts have deep context that may have decisive influence on the choice of the next word. Thus, what the next word is might not be evident from the previous n-words, not even if n is 20 or 50. A term has influence on a previous word choice: the word United is much more probable if it is followed by States of America. Let’s call this the context problem.
On top of that, it’s evident that this approach scales poorly. As size increases (n), the number of possible permutations skyrocket, even though most of the permutations never occur in the text. And all the occuring probabilities (or all n-gram counts) have to be calculated and stored. In addition, non-occurring n-grams create a sparsity problem , as in, the granularity of the probability distribution can be quite low. Word probabilities have few different values, therefore most of the words have the same probability.
Neural Network-Based Language Models
Neural network based language models ease the sparsity problem by the way they encode inputs. Word embedding layers create an arbitrary sized vector of each word that incorporates semantic relationships as well. These continuous vectors create the much needed granularity in the probability distribution of the next word. Moreover, the language model is a function, as all neural networks are with lots of matrix computations, so it’s not necessary to store all n-gram counts to produce the probability distribution of the next word.
Evolution of Language Models
Even though neural networks solve the sparsity problem, the context problem remains. First, language models were developed to solve the context problem more and more efficiently — bringing more and more context words to influence the probability distribution. Secondly, the goal was to create an architecture that gives the model the ability to learn which context words are more important than others.
The first model, which I outlined previously, is a dense (or hidden) layer and an output layer stacked on top of a continuous bag-of-words (CBOW) Word2Vec model . A CBOW Word2Vec model is trained to guess the word from context. A Skip-Gram Word2Vec model does the opposite, guessing context from the word. In practice, a CBOW Word2Vec model requires a lot of examples of the following structure to train it: the inputs are n words before and/or after the word, which is the output. We can see that the context problem is still intact.
Recurrent Neural Networks (RNN)
Recurrent neural networks (RNNs) are an improvement regarding this matter. Since RNNs can be either a long short-term memory (LSTM) or a gated recurrent unit (GRU) cell based network, they take all previous words into account when choosing the next word. AllenNLP’s ELMo takes this notion a step further, utilizing a bidirectional LSTM, which takes into account the context before and after the word counts.
Transformers
The main drawback of RNN-based architectures stems from their sequential nature. As a consequence, training times soar for long sequences because there is no possibility for parallelization. The solution for this problem is the transformer architecture .
The GPT models from OpenAI and Google’s BERT utilize the transformer architecture, as well. These models also employ a mechanism called “Attention,” by which the model can learn which inputs deserve more attention than others in certain cases.
In terms of model architecture, the main quantum leaps were firstly RNNs, specifically, LSTM and GRU, solving the sparsity problem and reducing the disk space language models use, and subsequently, the transformer architecture, making parallelization possible and creating attention mechanisms. But architecture is not the only aspect a language model can excel in.
Compared to the GPT-1 architecture, GPT-3 has virtually nothing novel. But it’s huge. It has 175 billion parameters, and it was trained on the largest corpus a model has ever been trained on in common crawl. This is partly possible because of the semi-supervised training strategy of a language model. A text can be used as a training example with some words omitted. The incredible power of GPT-3 comes from the fact that it has read more or less all text that has appeared on the internet over the past years, and it has the capability to reflect most of the complexity natural language contains.
Trained for Multiple Purposes
Finally, I’d like to review the T5 model from Google . Previously, language models were used for standard NLP tasks, like part-of-speech (POS) tagging or machine translation with slight modifications. With a little retraining , BERT can be a POS-tagger because of its abstract ability to understand the underlying structure of natural language.
With T5, there is no need for any modifications for NLP tasks. If it gets a text with some <M> tokens in it, it knows that those tokens are gaps to fill with the appropriate words. It can also answer questions. If it receives some context after the questions, it searches the context for the answer. Otherwise, it answers from its own knowledge. Fun fact: It beat its own creators in a trivia quiz.
More on Language Models: NLP for Beginners: A Complete Guide
Future of Language Models
Personally, I think this is the field that we are closest to creating an AI. There’s a lot of buzz around AI, and many simple decision systems and almost any neural network are called AI, but this is mainly marketing. By definition, artificial intelligence involves human-like intelligence capabilities performed by a machine. While transfer learning shines in the field of computer vision, and the notion of transfer learning is essential for an AI system, the very fact that the same model can do a wide range of NLP tasks and can infer what to do from the input is itself spectacular. It brings us one step closer to actually creating human-like intelligence systems.
IMAGES
VIDEO
COMMENTS
A language model is a probability distribution \(p\) over sequences \(x_{1:L}\). Intuitively, a good language model should have linguistic capabilities and world knowledge. An autoregressive language model allows for efficient generation of a completion \(x_{i+1:L}\) given a prompt \(x_{1:i}\).
Define language models and large language models (LLMs). Define key LLM concepts, including Transformers and self-attention. Describe the costs and benefits of LLMs, along with common use...
This document discusses natural language processing and language models. It begins by explaining that natural language processing aims to give computers the ability to process human language in order to perform tasks like dialogue systems, machine translation, and question answering.
Prompt and evaluate a very large language model (e.g., GPT-3, Codex) to understand their capabilities, limitations or risks. We will provide certain budget for you to access these large models if needed.
What are LLMs? Modern LLM Architecture. LLM Training Procedure. LLM Inference – Prompting, In-Context Learning and Chain of Thought. Evaluating LLMs. Multimodal LLMs. Review: Language Models as Generalists. Language models can be used to not just perform a single task, but multiple tasks by learning to predict the next token or sentence.
A language model is a probability distribution over words or word sequences. Learn more about different types of language models and what they can do.
Figure 10.2 Training setup for a neural language model approach to machine translation. Source-target bi-texts are concatenated and used to train a language model. Early efforts using this clever approach demonstrated surprisingly good results on standard datasets and led to a series of innovations that were the basis for net-
The document presents a review of large language models (LLMs) for code generation. It discusses different types of LLMs including left-to-right, masked, and encoder-decoder models. Existing models for code generation like Codex, GPT-Neo, GPT-J, and CodeParrot are compared.
Course Introduction. Module 1 - Applications with LLMs. Module 2 - Embeddings, Vector Databases, and Search. Module 3 - Multi-stage Reasoning. Module 4 - Fine-tuning and Evaluating LLMs. Module 5 - Society and LLMs.
Language Models: Putting it All Together. Before 2017: best language models. Use encoder/decoder architectures based on RNNs. Use word embeddings for word representations. Use attention mechanisms.