Robot

Last few days of free access to Embibe

Click on Get Started to access Learning Outcomes today

Embibe Logo

Share this article

link

Table of Contents

Latest updates.

Ways To Improve Learning Outcomes: Learn Tips & Tricks

Ways To Improve Learning Outcomes: Learn Tips & Tricks

The Three States of Matter: Solids, Liquids, and Gases

The Three States of Matter: Solids, Liquids, and Gases

Types of Motion: Introduction, Parameters, Examples

Types of Motion: Introduction, Parameters, Examples

Understanding Frequency Polygon: Detailed Explanation

Understanding Frequency Polygon: Detailed Explanation

Uses of Silica Gel in Packaging?

Uses of Silica Gel in Packaging?

Visual Learning Style for Students: Pros and Cons

Visual Learning Style for Students: Pros and Cons

Air Pollution: Know the Causes, Effects & More

Air Pollution: Know the Causes, Effects & More

Sexual Reproduction in Flowering Plants

Sexual Reproduction in Flowering Plants

Integers Introduction: Check Detailed Explanation

Integers Introduction: Check Detailed Explanation

Human Respiratory System – Detailed Explanation

Human Respiratory System – Detailed Explanation

Tag cloud :.

  • entrance exams
  • engineering
  • ssc cgl 2024
  • Written By Priya_Singh
  • Last Modified 24-01-2023

Data Representation: Definition, Types, Examples

Data Representation: Data representation is a technique for analysing numerical data. The relationship between facts, ideas, information, and concepts is depicted in a diagram via data representation. It is a fundamental learning strategy that is simple and easy to understand. It is always determined by the data type in a specific domain. Graphical representations are available in many different shapes and sizes.

In mathematics, a graph is a chart in which statistical data is represented by curves or lines drawn across the coordinate point indicated on its surface. It aids in the investigation of a relationship between two variables by allowing one to evaluate the change in one variable’s amount in relation to another over time. It is useful for analysing series and frequency distributions in a given context. On this page, we will go through two different types of graphs that can be used to graphically display data. Continue reading to learn more.

Learn Informative Blog

Data Representation in Maths

Definition: After collecting the data, the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data.

Any information gathered may be organised in a frequency distribution table, and then shown using pictographs or bar graphs. A bar graph is a representation of numbers made up of equally wide bars whose lengths are determined by the frequency and scale you choose.

The collected raw data can be placed in any one of the given ways:

  • Serial order of alphabetical order
  • Ascending order
  • Descending order

Data Representation Example

Example: Let the marks obtained by \(30\) students of class VIII in a class test, out of \(50\)according to their roll numbers, be:

\(39,\,25,\,5,\,33,\,19,\,21,\,12,41,\,12,\,21,\,19,\,1,\,10,\,8,\,12\)

\(17,\,19,\,17,\,17,\,41,\,40,\,12,41,\,33,\,19,\,21,\,33,\,5,\,1,\,21\)

The data in the given form is known as raw data or ungrouped data. The above-given data can be placed in the serial order as shown below:

Data Representation Example

Now, for say you want to analyse the standard of achievement of the students. If you arrange them in ascending or descending order, it will give you a better picture.

Ascending order:

\(1,\,1,\,5,\,5,\,8,\,10,\,12,12,\,12,\,12,\,17,\,17,\,17,\,19,\,19\)

\(19,\,19,\,21,\,21,\,21,\,25,\,33,33,\,33,\,39,\,40,\,41,\,41,\,41\)

Descending order:

\(41,\,41,\,41,\,40,\,39,\,33,\,33,33,\,25,\,21,\,21,\,21,\,21,\,19,\,19\)

\(19,\,19,\,17,\,17,\,17,\,12,\,12,12,\,12,\,10,\,8,\,5,\,5,1,\,1\)

When the raw data is placed in ascending or descending order of the magnitude is known as an array or arrayed data.

Graph Representation in Data Structure

A few of the graphical representation of data is given below:

  • Frequency distribution table

Pictorial Representation of Data: Bar Chart

The bar graph represents the ​qualitative data visually. The information is displayed horizontally or vertically and compares items like amounts, characteristics, times, and frequency.

The bars are arranged in order of frequency, so more critical categories are emphasised. By looking at all the bars, it is easy to tell which types in a set of data dominate the others. Bar graphs can be in many ways like single, stacked, or grouped.

Bar Chart

Graphical Representation of Data: Frequency Distribution Table

A frequency table or frequency distribution is a method to present raw data in which one can easily understand the information contained in the raw data.

The frequency distribution table is constructed by using the tally marks. Tally marks are a form of a numerical system with the vertical lines used for counting. The cross line is placed over the four lines to get a total of \(5\).

Frequency Distribution Table

Consider a jar containing the different colours of pieces of bread as shown below:

Frequency Distribution Table Example

Construct a frequency distribution table for the data mentioned above.

Frequency Distribution Table Example

Graphical Representation of Data: Histogram

The histogram is another kind of graph that uses bars in its display. The histogram is used for quantitative data, and ranges of values known as classes are listed at the bottom, and the types with greater frequencies have the taller bars.

A histogram and the bar graph look very similar; however, they are different because of the data level. Bar graphs measure the frequency of the categorical data. A categorical variable has two or more categories, such as gender or hair colour.

Histogram

Graphical Representation of Data: Pie Chart

The pie chart is used to represent the numerical proportions of a dataset. This graph involves dividing a circle into different sectors, where each of the sectors represents the proportion of a particular element as a whole. Thus, it is also known as a circle chart or circle graph.

Pie Chart

Graphical Representation of Data: Line Graph

A graph that uses points and lines to represent change over time is defined as a line graph. In other words, it is the chart that shows a line joining multiple points or a line that shows the link between the points.

The diagram illustrates the quantitative data between two changing variables with the straight line or the curve that joins a series of successive data points. Linear charts compare two variables on the vertical and the horizontal axis.

Line Graph

General Rules for Visual Representation of Data

We have a few rules to present the information in the graphical representation effectively, and they are given below:

  • Suitable Title:  Ensure that the appropriate title is given to the graph, indicating the presentation’s subject.
  • Measurement Unit:  Introduce the measurement unit in the graph.
  • Proper Scale:  To represent the data accurately, choose an appropriate scale.
  • Index:  In the Index, the appropriate colours, shades, lines, design in the graphs are given for better understanding.
  • Data Sources:  At the bottom of the graph, include the source of information wherever necessary.
  • Keep it Simple:  Build the graph in a way that everyone should understand easily.
  • Neat:  You have to choose the correct size, fonts, colours etc., in such a way that the graph must be a model for the presentation of the information.

Solved Examples on Data Representation

Q.1. Construct the frequency distribution table for the data on heights in \(({\rm{cm}})\) of \(20\) boys using the class intervals \(130 – 135,135 – 140\) and so on. The heights of the boys in \({\rm{cm}}\) are: 

Data Representation Example 1

Ans: The frequency distribution for the above data can be constructed as follows:

Data Representation Example

Q.2. Write the steps of the construction of Bar graph? Ans: To construct the bar graph, follow the given steps: 1. Take a graph paper, draw two lines perpendicular to each other, and call them horizontal and vertical. 2. You have to mark the information given in the data like days, weeks, months, years, places, etc., at uniform gaps along the horizontal axis. 3. Then you have to choose the suitable scale to decide the heights of the rectangles or the bars and then mark the sizes on the vertical axis. 4. Draw the bars or rectangles of equal width and height marked in the previous step on the horizontal axis with equal spacing. The figure so obtained will be the bar graph representing the given numerical data.

Q.3. Read the bar graph and then answer the given questions: I. Write the information provided by the given bar graph. II. What is the order of change of the number of students over several years? III. In which year is the increase of the student maximum? IV. State whether true or false. The enrolment during \(1996 – 97\) is double that of \(1995 – 96\)

pictorial representation of data

Ans: I. The bar graph represents the number of students in class \({\rm{VI}}\) of a school during the academic years \(1995 – 96\,to\,1999 – 2000\). II. The number of stcccccudents is changing in increasing order as the heights of bars are growing. III. The increase in the number of students in uniform and the increase in the height of bars is uniform. Hence, in this case, the growth is not maximum in any of the years. The enrolment in the years is \(1996 – 97\, = 200\). and the enrolment in the years is \(1995 – 96\, = 150\). IV. The enrolment in \(1995 – 97\,\) is not double the enrolment in \(1995 – 96\). So the statement is false.

Q.4. Write the frequency distribution for the given information of ages of \(25\) students of class VIII in a school. \(15,\,16,\,16,\,14,\,17,\,17,\,16,\,15,\,15,\,16,\,16,\,17,\,15\) \(16,\,16,\,14,\,16,\,15,\,14,\,15,\,16,\,16,\,15,\,14,\,15\) Ans: Frequency distribution of ages of \(25\) students:

Data Representation Example

Q.5. There are \(20\) students in a classroom. The teacher asked the students to talk about their favourite subjects. The results are listed below:

Data Representation Example

By looking at the above data, which is the most liked subject? Ans: Representing the above data in the frequency distribution table by using tally marks as follows:

Data Representation Example

From the above table, we can see that the maximum number of students \((7)\) likes mathematics.

Also, Check –

  • Diagrammatic Representation of Data

In the given article, we have discussed the data representation with an example. Then we have talked about graphical representation like a bar graph, frequency table, pie chart, etc. later discussed the general rules for graphic representation. Finally, you can find solved examples along with a few FAQs. These will help you gain further clarity on this topic.

Test Informative Blog

FAQs on Data Representation

Q.1: How is data represented? A: The collected data can be expressed in various ways like bar graphs, pictographs, frequency tables, line graphs, pie charts and many more. It depends on the purpose of the data, and accordingly, the type of graph can be chosen.

Q.2: What are the different types of data representation? A : The few types of data representation are given below: 1. Frequency distribution table 2. Bar graph 3. Histogram 4. Line graph 5. Pie chart

Q.3: What is data representation, and why is it essential? A: After collecting the data, the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data. Importance: The data visualization gives us a clear understanding of what the information means by displaying it visually through maps or graphs. The data is more natural to the mind to comprehend and make it easier to rectify the trends outliners or trends within the large data sets.

Q.4: What is the difference between data and representation? A: The term data defines the collection of specific quantitative facts in their nature like the height, number of children etc., whereas the information in the form of data after being processed, arranged and then presented in the state which gives meaning to the data is data representation.

Q.5: Why do we use data representation? A: The data visualization gives us a clear understanding of what the information means by displaying it visually through maps or graphs. The data is more natural to the mind to comprehend and make it easier to rectify the trends outliners or trends within the large data sets.

Related Articles

Ways To Improve Learning Outcomes: With the development of technology, students may now rely on strategies to enhance learning outcomes. No matter how knowledgeable a...

The Three States of Matter: Anything with mass and occupied space is called ‘Matter’. Matters of different kinds surround us. There are some we can...

Motion is the change of a body's position or orientation over time. The motion of humans and animals illustrates how everything in the cosmos is...

Understanding Frequency Polygon: Students who are struggling with understanding Frequency Polygon can check out the details here. A graphical representation of data distribution helps understand...

When you receive your order of clothes or leather shoes or silver jewellery from any online shoppe, you must have noticed a small packet containing...

Visual Learning Style: We as humans possess the power to remember those which we have caught visually in our memory and that too for a...

Air Pollution: In the past, the air we inhaled was pure and clean. But as industrialisation grows and the number of harmful chemicals in the...

In biology, flowering plants are known by the name angiosperms. Male and female reproductive organs can be found in the same plant in flowering plants....

Integers Introduction: To score well in the exam, students must check out the Integers introduction and understand them thoroughly. The collection of negative numbers and whole...

Human Respiratory System: Students preparing for the NEET and Biology-related exams must have an idea about the human respiratory system. It is a network of tissues...

Place Value of Numbers: Detailed Explanation

Place Value of Numbers: Students must understand the concept of the place value of numbers to score high in the exam. In mathematics, place value...

The Leaf: Types, Structures, Parts

The Leaf: Students who want to understand everything about the leaf can check out the detailed explanation provided by Embibe experts. Plants have a crucial role...

Factors Affecting Respiration: Definition, Diagrams with Examples

In plants, respiration can be regarded as the reversal of the photosynthetic process. Like photosynthesis, respiration involves gas exchange with the environment. Unlike photosynthesis, respiration...

General Terms Related to Spherical Mirrors

General terms related to spherical mirrors: A mirror with the shape of a portion cut out of a spherical surface or substance is known as a...

Number System: Types, Conversion and Properties

Number System: Numbers are highly significant and play an essential role in Mathematics that will come up in further classes. In lower grades, we learned how...

Types of Respiration

Every living organism has to "breathe" to survive. The process by which the living organisms use their food to get energy is called respiration. It...

Animal Cell: Definition, Diagram, Types of Animal Cells

Animal Cell: An animal cell is a eukaryotic cell with membrane-bound cell organelles without a cell wall. We all know that the cell is the fundamental...

Conversion of Percentages: Conversion Method & Examples

Conversion of Percentages: To differentiate and explain the size of quantities, the terms fractions and percent are used interchangeably. Some may find it difficult to...

Arc of a Circle: Definition, Properties, and Examples

Arc of a circle: A circle is the set of all points in the plane that are a fixed distance called the radius from a fixed point...

Ammonia (NH3): Preparation, Structure, Properties and Uses

Ammonia, a colourless gas with a distinct odour, is a chemical building block and a significant component in producing many everyday items. It is found...

CGPA to Percentage: Calculator for Conversion, Formula, & More

CGPA to Percentage: The average grade point of a student is calculated using their cumulative grades across all subjects, omitting any supplemental coursework. Many colleges,...

Uses of Ether – Properties, Nomenclature, Uses, Disadvantages

Uses of Ether:  Ether is an organic compound containing an oxygen atom and an ether group connected to two alkyl/aryl groups. It is formed by the...

General and Middle Terms: Definitions, Formula, Independent Term, Examples

General and Middle terms: The binomial theorem helps us find the power of a binomial without going through the tedious multiplication process. Further, the use...

Mutually Exclusive Events: Definition, Formulas, Solved Examples

Mutually Exclusive Events: In the theory of probability, two events are said to be mutually exclusive events if they cannot occur simultaneously or at the...

Geometry: Definition, Shapes, Structure, Examples

Geometry is a branch of mathematics that is largely concerned with the forms and sizes of objects, their relative positions, and the qualities of space....

Bohr’s Model of Hydrogen Atom: Expressions for Radius, Energy

Rutherford’s Atom Model was undoubtedly a breakthrough in atomic studies. However, it was not wholly correct. The great Danish physicist Niels Bohr (1885–1962) made immediate...

Types of Functions: Definition, Classification and Examples

Types of Functions: Functions are the relation of any two sets. A relation describes the cartesian product of two sets. Cartesian products of two sets...

data representation types

39 Insightful Publications

World Economic Forum

Embibe Is A Global Innovator

accenture

Innovator Of The Year Education Forever

Interpretable And Explainable AI

Interpretable And Explainable AI

Tedx

Revolutionizing Education Forever

Amazon AI Conclave

Best AI Platform For Education

Forbes India

Enabling Teachers Everywhere

ACM

Decoding Performance

World Education Summit

Leading AI Powered Learning Solution Provider

Journal of Educational Data Mining

Auto Generation Of Tests

BW Disrupt

Disrupting Education In India

Springer

Problem Sequencing Using DKT

Fortune India Forty Under Fourty

Help Students Ace India's Toughest Exams

Edtech Digest

Best Education AI Platform

Nasscom Product Connect

Unlocking AI Through Saas

Tech In Asia

Fixing Student’s Behaviour With Data Analytics

Your Story

Leveraging Intelligence To Deliver Results

City AI

Brave New World Of Applied AI

vccircle

You Can Score Higher

INK Talks

Harnessing AI In Education

kstart

Personalized Ed-tech With AI

StartUpGrind

Exciting AI Platform, Personalizing Education

Digital Women Award

Disruptor Award For Maximum Business Impact

The Mumbai Summit 2020 AI

Top 20 AI Influencers In India

USPTO

Proud Owner Of 9 Patents

StartUpGrind

Innovation in AR/VR/MR

StartUpGrind

Best Animated Frames Award 2024

Close

Trending Searches

Previous year question papers, sample papers.

Unleash Your True Potential With Personalised Learning on EMBIBE

Pattern

Ace Your Exam With Personalised Learning on EMBIBE

Enter mobile number.

By signing up, you agree to our Privacy Policy and Terms & Conditions

Data Representation in Computer: Number Systems, Characters, Audio, Image and Video

What is data representation in computer, number systems, binary number system, octal number system, decimal number system, hexadecimal number system, data representation of characters, data representation of audio, image and video, faqs about data representation in computer, what is number system with example, you might also like, what is c++ programming language c++ character set, c++ tokens, what is artificial intelligence functions, 6 benefits, applications of ai, what is microprocessor evolution of microprocessor, types, features, 10 evolution of computing machine, history, what are decision making statements in c types, what are c++ keywords set of 59 keywords in c ++, what is cloud computing classification, characteristics, principles, types of cloud providers, generations of computer first to fifth, classification, characteristics, features, examples, types of computer software: systems software, application software, what are operators in c different types of operators in c, types of storage devices, advantages, examples, 10 types of computers | history of computers, advantages, what is operating system functions, types, types of user interface, advantages and disadvantages of operating system, advantages and disadvantages of flowcharts, what is flowchart in programming symbols, advantages, preparation, what is problem solving algorithm, steps, representation, what are data types in c++ types, data and information: definition, characteristics, types, channels, approaches, what are expressions in c types, leave a reply cancel reply.

Learning Materials

  • Business Studies
  • Combined Science
  • Computer Science
  • Engineering
  • English Literature
  • Environmental Science
  • Human Geography
  • Macroeconomics
  • Microeconomics
  • Data Representation in Computer Science

Dive deep into the realm of Computer Science with this comprehensive guide about data representation. Data representation, a fundamental concept in computing, refers to the various ways that information can be expressed digitally. The interpretation of this data plays a critical role in decision-making procedures in businesses and scientific research. Gain an understanding of binary data representation, the backbone of digital computing. 

Data Representation in Computer Science

Create learning materials about Data Representation in Computer Science with our free learning app!

  • Instand access to millions of learning materials
  • Flashcards, notes, mock-exams and more
  • Everything you need to ace your exams

Millions of flashcards designed to help you ace your studies

  • Cell Biology

What are the main benefits and drawbacks of storing data in Unicode?

What is byte order or endianness in terms of Unicode data storage?

What is the purpose of Unicode Collation?

What are some popular methods for Unicode compression?

What is Unicode in the context of computer science?

Why is the UTF-8 format advantageous?

What are the primary benefits of Unicode?

What need or problem did the introduction of Unicode address in the digital world?

How does Unicode employ different types of encoding such as UTF-8, UTF-16, and UTF-32?

What is the Byte Order Mark (BOM) in terms of Unicode encoding?

What are the four forms of Unicode Normalisation and what is their purpose?

Review generated flashcards

to start learning or create your own AI flashcards

Start learning or create your own AI flashcards

  • Algorithms in Computer Science
  • Computer Network
  • Computer Organisation and Architecture
  • Computer Programming
  • Computer Systems
  • Analogue Signal
  • Binary Arithmetic
  • Binary Conversion
  • Binary Number System
  • Bitmap Graphics
  • Data Compression
  • Data Encoding
  • Digital Signal
  • Hexadecimal Conversion
  • Hexadecimal Number System
  • Huffman Coding
  • Image Representation
  • Lempel Ziv Welch
  • Logic Circuits
  • Lossless Compression
  • Lossy Compression
  • Numeral Systems
  • Quantisation
  • Run Length Encoding
  • Sample Rate
  • Sampling Informatics
  • Sampling Theorem
  • Signal Processing
  • Sound Representation
  • Two's Complement
  • What is ASCII
  • What is Unicode
  • What is Vector Graphics
  • Data Structures
  • Functional Programming
  • Issues in Computer Science
  • Problem Solving Techniques
  • Theory of Computation

Binary data representation uses a system of numerical notation that has just two possible states represented by 0 and 1 (also known as 'binary digits' or 'bits'). Grasp the practical applications of binary data representation and explore its benefits.

Finally, explore the vast world of data model representation. Different types of data models offer a variety of ways to organise data in databases . Understand the strategic role of data models in data representation, and explore how they are used to design efficient database systems. This comprehensive guide positions you at the heart of data representation in Computer Science .

Understanding Data Representation in Computer Science

In the realm of Computer Science, data representation plays a paramount role. It refers to the methods or techniques used to represent, or express information in a computer system. This encompasses everything from text and numbers to images, audio, and beyond.

Basic Concepts of Data Representation

Data representation in computer science is about how a computer interprets and functions with different types of information. Different information types require different representation techniques. For instance, a video will be represented differently than a text document.

When working with various forms of data, it is important to grasp a fundamental understanding of:

  • Binary system
  • Bits and Bytes
  • Number systems: decimal, hexadecimal
  • Character encoding: ASCII, Unicode

Data in a computer system is represented in binary format, as a sequence of 0s and 1s, denoting 'off' and 'on' states respectively. The smallest component of this binary representation is known as a bit , which stands for 'binary digit'.

A byte , on the other hand, generally encompasses 8 bits. An essential aspect of expressing numbers and text in a computer system, are the decimal and hexadecimal number systems, and character encodings like ASCII and Unicode.

Role of Data Representation in Computer Science

Data Representation is the foundation of computing systems and affects both hardware and software designs. It enables both logic and arithmetic operations to be performed in the binary number system , on which computers are based.

An illustrative example of the importance of data representation is when you write a text document. The characters you type are represented in ASCII code - a set of binary numbers. Each number is sent to the memory, represented as electrical signals; everything you see on your screen is a representation of the underlying binary data.

Computing operations and functions, like searching, sorting or adding, rely heavily on appropriate data representation for efficient execution. Also, computer programming languages and compilers require a deep understanding of data representation to successfully interpret and execute commands.

As technology evolves, so too does our data representation techniques. Quantum computing, for example, uses quantum bits or "qubits". A qubit can represent a 0, 1, or both at the same time, thanks to the phenomenon of quantum superposition.

Types of Data Representation

In computer systems , various types of data representation techniques are utilized:

Numbers can be represented in real, integer, and rational formats. Text is represented by using different types of encodings, such as ASCII or Unicode. Images can be represented in various formats like JPG, PNG, or GIF, each having its specific rendering algorithm and compression techniques.

Tables are another important way of data representation, especially in the realm of databases .

NameEmail
John Doe[email protected]
Jane Doe[email protected]

This approach is particularly effective in storing structured data, making information readily accessible and easy to handle. By understanding the principles of data representation, you can better appreciate the complexity and sophistication behind our everyday interactions with technology.

Data Representation and Interpretation

To delve deeper into the world of Computer Science, it is essential to study the intricacies of data representation and interpretation. While data representation is about the techniques through which data are expressed or encoded in a computer system, data interpretation refers to the computing machines' ability to understand and work with these encoded data.

Basics of Data Representation and Interpretation

The core of data representation and interpretation is founded on the binary system. Represented by 0s and 1s, the binary system signifies the 'off' and 'on' states of electric current, seamlessly translating them into a language comprehensible to computing hardware.

For instance, \[ 1101 \, \text{in binary is equivalent to} \, 13 \, \text{in decimal} \] This interpretation happens consistently in the background during all of your interactions with a computer system.

Now, try imagining a vast array of these binary numbers. It could get overwhelming swiftly. To bring order and efficiency to this chaos, binary digits (or bits) are grouped into larger sets like bytes, kilobytes, and so on. A single byte , the most commonly used set, contains eight bits. Here's a simplified representation of how bits are grouped:

  • 1 bit = Binary Digit
  • 8 bits = 1 byte
  • 1024 bytes = 1 kilobyte (KB)
  • 1024 KB = 1 megabyte (MB)
  • 1024 MB = 1 gigabyte (GB)
  • 1024 GB = 1 terabyte (TB)

However, the binary system isn't the only number system pivotal for data interpretation. Both decimal (base 10) and hexadecimal (base 16) systems play significant roles in processing numbers and text data. Moreover, translating human-readable language into computer interpretable format involves character encodings like ASCII (American Standard Code for Information Interchange) and Unicode.

These systems interpret alphabetic characters, numerals, punctuation marks, and other common symbols into binary code. For example, the ASCII value for capital 'A' is 65, which corresponds to \(01000001\) in binary.

In the world of images, different encoding schemes interpret pixel data. JPG, PNG, and GIF, being common examples of such encoded formats. Similarly, audio files utilise encoding formats like MP3 and WAV to store sound data.

Importance of Data Interpretation in Computer Science

Understanding data interpretation in computer science is integral to unlocking the potential of any computing process or system. When coded data is input into a system, your computer must interpret this data accurately to make it usable.

Consider typing a document in a word processor like Microsoft Word. As you type, each keystroke is converted to an ASCII code by your keyboard. Stored as binary, these codes are transmitted to the active word processing software. The word processor interprets these codes back into alphabetic characters, enabling the correct letters to appear on your screen, as per your keystrokes.

Data interpretation is not just an isolated occurrence, but a recurring necessity - needed every time a computing process must deal with data. This is no different when you're watching a video, browsing a website, or even when the computer boots up.

Rendering images and videos is an ideal illustration of the importance of data interpretation.

Digital photos and videos are composed of tiny dots, or pixels, each encoded with specific numbers to denote colour composition and intensity. Every time you view a photo or play a video, your computer interprets the underlying data and reassembles the pixels to form a comprehensible image or video sequence on your screen.

Data interpretation further extends to more complex territories like facial recognition, bioinformatics, data mining, and even artificial intelligence. In these applications, data from various sources is collected, converted into machine-acceptable format, processed, and interpreted to provide meaningful outputs.

In summary, data interpretation is vital for the functionality, efficiency, and progress of computer systems and the services they provide. Understanding the basics of data representation and interpretation, thereby, forms the backbone of computer science studies.

Delving into Binary Data Representation

Binary data representation is the most fundamental and elementary form of data representation in computing systems. At the lowermost level, every piece of information processed by a computer is converted into a binary format.

Understanding Binary Data Representation

Binary data representation is based on the binary numeral system. This system, also known as the base-2 system, uses only two digits - 0 and 1 to represent all kinds of data. The concept dates back to the early 18th-century mathematics and has since found its place as the bedrock of modern computers. In computing, the binary system's digits are called bits (short for 'binary digit'), and they are the smallest indivisible unit of data.

Each bit can be in one of two states representing 0 ('off') or 1 ('on'). Formally, the binary number \( b_n b_{n-1} ... b_2 b_1 b_0 \), is interpreted using the formula: \[ B = b_n \times 2^n + b_{n-1} \times 2^{n-1} + ... + b_2 \times 2^2 + b_1 \times 2^1 + b_0 \times 2^0 \] Where \( b_i \) are the binary digits and \( B \) is the corresponding decimal number.

For example, for the binary number 1011, the process will look like this: \[ B = 1*2^3 + 0*2^2 + 1*2^1 + 1*2^0 \]

This mathematical translation makes it possible for computing machines to perform complex operations even though they understand only the simple language of 'on' and 'off' signals.

When representing character data, computing systems use binary-encoded formats. ASCII and Unicode are common examples. In ASCII, each character is assigned a unique 7-bit binary code. For example, the binary representation for the uppercase letter 'A' is 0100001. Interpreting such encoded data back to a human-readable format is a core responsibility of computing systems and forms the basis for the exchange of digital information globally.

Practical Application of Binary Data Representation

Binary data representation is used across every single aspect of digital computing. From simple calculations performed by a digital calculator to the complex animations rendered in a high-definition video game, binary data representation is at play in the background.

Consider a simple calculation like 7+5. When you input this into a digital calculator, the numbers and the operation get converted into their binary equivalents. The microcontroller inside the calculator processes these binary inputs, performs the sum operation in binary, and finally, returns the result as a binary output. This binary output is then converted back into a decimal number which you see displayed on the calculator screen.

When it comes to text files, every character typed into the document is converted to its binary equivalent using a character encoding system, typically ASCII or Unicode. It is then saved onto your storage device as a sequence of binary digits.

Similarly, for image files, every pixel is represented as a binary number. Each binary number, called a 'bit map', specifies the colour and intensity of each pixel. When you open the image file, the computer reads the binary data and presents it on your screen as a colourful, coherent image. The concept extends even further into the internet and network communications, data encryption , data compression , and more.

When you are downloading a file over the internet, it is sent to your system as a stream of binary data. The web browser on your system receives this data, recognizes the type of file and accordingly interprets the binary data back into the intended format.

In essence, every operation that you can perform on a computer system, no matter how simple or complex, essentially boils down to large-scale manipulation of binary data. And that sums up the practical application and universal significance of binary data representation in digital computing.

Binary Tree Representation in Data Structures

Binary trees occupy a central position in data structures , especially in algorithms and database designs. As a non-linear data structure, a binary tree is essentially a tree-like model where each node has a maximum of two children, often distinguished as 'left child' and 'right child'.

Fundamentals of Binary Tree Representation

A binary tree is a tree data structure where each parent node has no more than two children, typically referred to as the left child and the right child. Each node in the binary tree contains:

  • A data element
  • Pointer or link to the left child
  • Pointer or link to the right child

The topmost node of the tree is known as the root. The nodes without any children, usually dwelling at the tree's last level, are known as leaf nodes or external nodes. Binary trees are fundamentally differentiated by their properties and the relationships among the elements. Some types include:

  • Full Binary Tree: A binary tree where every node has 0 or 2 children.
  • Complete Binary Tree: A binary tree where all levels are completely filled except possibly the last level, which is filled from left to right.
  • Perfect Binary Tree: A binary tree where all internal nodes have two children and all leaves are at the same level.
  • Skewed Binary Tree: A binary tree where every node has only left child or only right child.

In a binary tree, the maximum number of nodes \( N \) at any level \( L \) can be calculated using the formula \( N = 2^{L-1} \). Conversely, for a tree with \( N \) nodes, the maximum height or maximum number of levels is \( \lceil Log_2(N+1) \rceil \).

Binary tree representation employs arrays and linked lists. Sometimes, an implicit array-based representation suffices, especially for complete binary trees. The root is stored at index 0, while for each node at index \( i \), the left child is stored at index \( 2i + 1 \), and the right child at \( 2i + 2 \).

However, the most common representation is the linked-node representation that utilises a node-based structure. Each node in the binary tree is a data structure that contains a data field and two pointers pointing to its left and right child nodes.

Usage of Binary Tree in Data Structures

Binary trees are typically used for expressing hierarchical relationships, and thus find application across various areas in computer science. In mathematical applications, binary trees are ideal for expressing certain elements' relationships.

For example, binary trees are used to represent expressions in arithmetic and Boolean algebra.

Consider an arithmetic expression like (4 + 5) * 6. This can be represented using a binary tree where the operators are parent nodes, and the operands are children. The expression gets evaluated by performing operations in a specific tree traversal order.

Among the more complex usages, binary search trees — a variant of binary trees — are employed in database engines and file systems .

  • Binary Heaps, a type of binary tree, are used as an efficient priority queue in many algorithms like Dijkstra's algorithm and the Heap Sort algorithm.
  • Binary trees are also used in creating binary space partition trees, which are used for quickly finding objects in games and 3D computer graphics.
  • Syntax trees used in compilers are a direct application of binary trees. They help translate high-level language expressions into machine code.
  • Huffman Coding Trees, which are used in data compression algorithms, are another variant of binary trees.

The theoretical underpinnings of all these binary tree applications are the traversal methods and operations, such as insertion and deletion, which are intrinsic to the data structure.

Binary trees are also used in advanced machine-learning algorithms. Decision Tree is a type of binary tree that uses a tree-like model of decisions. It is one of the most successful forms of supervised learning algorithms in data mining and machine learning.

The advantages of a binary tree lie in their efficient organisation and quick data access, making them a cornerstone of many complex data structures and algorithms. Understanding the workings and fundamentals of binary tree representation will equip you with a stronger pillaring in the world of data structures and computer science in general.

Grasping Data Model Representation

When dealing with vast amounts of data, organising and understanding the relationships between different pieces of data is of utmost importance. This is where data model representation comes into play in computer science. A data model provides an abstract, simplified view of real-world data. It defines the data elements and the relationships among them, providing an organised and consistent representation of data.

Exploring Different Types of Data Models

Understanding the intricacies of data models will equip you with a solid foundation in making sense of complex data relationships. Some of the most commonly used data models include:

  • Hierarchical Model
  • Network Model
  • Relational Model
  • Entity-Relationship Model
  • Object-Oriented Model
  • Semantic Model

The Hierarchical Model presents data in a tree-like structure, where each record has one parent record and many children. This model is largely applied in file systems and XML documents. The limitations are that this model does not allow a child to have multiple parents, thus limiting its real-world applications.

The Network Model, an enhancement of the hierarchical model, allows a child node to have multiple parent nodes, resulting in a graph structure. This model is suitable for representing complex relationships but comes with its own challenges such as iteration and navigation, which can be intricate.

The Relational Model, created by E.F. Codd, uses a tabular structure to depict data and their relationships. Each row represents a collection of related data values, and each column represents a particular attribute. This is the most widely used model due to its simplicity and flexibility.

The Entity-Relationship Model illustrates the conceptual view of a database. It uses three basic concepts: Entities, Attributes (the properties of these entities), and Relationships among entities. This model is most commonly used in database design .

The Object-Oriented Model goes a step further and adds methods (functions) to the entities besides attributes. This data model integrates the data and the operations applicable to the data into a single component known as an object. Such an approach enables encapsulation, a significant characteristic of object-oriented programming.

The Semantic Model aims to capture more meaning of data by defining the nature of data and the relationships that exist between them. This model is beneficial in representing complex data interrelations and is used in expert systems and artificial intelligence fields.

The Role of Data Models in Data Representation

Data models provide a method for the efficient representation and interaction of data elements, thus forming an integral part of any database system. They provide the theoretical foundation for designing databases, thereby playing an essential role in the development of applications.

A data model is a set of concepts and rules for formally describing and representing real-world data. It serves as a blueprint for designing and implementing databases and assists communication between system developers and end-users.

Databases serve as vast repositories, storing a plethora of data. Such vast data needs effective organisation and management for optimal access and usage. Here, data models come into play, providing a structural view of data, thereby enabling the efficient organisation, storage and retrieval of data.

Consider a library system. The system needs to record data about books, authors, publishers, members, and loans. All these items represent different entities. Relationships exist between these entities. For example, a book is published by a publisher, an author writes a book, or a member borrows a book. Using an Entity-Relationship Model, we can effectively represent all these entities and relationships, aiding the library system's development process.

Designing such a model requires careful consideration of what data is required to be stored and how different data elements relate to each other. Depending on their specific requirements, database developers can select the most suitable data model representation. This choice can significantly affect the functionality, performance, and scalability of the resulting databases.

From decision-support systems and expert systems to distributed databases and data warehouses, data models find a place in various applications.

Modern NoSQL databases often use several models simultaneously to meet their needs. For example, a document-based model for unstructured data and a column-based model for analyzing large data sets. In this way, data models continue to evolve and adapt to the burgeoning needs of the digital world.

Therefore, acquiring a strong understanding of data model representations and their roles forms an integral part of the database management and design process. It empowers you with the ability to handle large volumes of diverse data efficiently and effectively.

Data Representation - Key takeaways

  • Data representation refers to techniques used to express information in computer systems, encompassing text, numbers, images, audio, and more.
  • Data Representation is about how computers interpret and function with different information types, including binary systems, bits and bytes, number systems (decimal, hexadecimal) and character encoding (ASCII, Unicode).
  • Binary Data Representation is the conversion of all kinds of information processed by a computer into binary format.
  • Express hierarchical relationships across various areas in computer science.
  • Represent relationships in mathematical applications, used in database engines, file systems, and priority queues in algorithms.
  • Data Model Representation is an abstract, simplified view of real-world data that defines the data elements, and their relationships and provides a consistently organised way of representing data.

Flashcards in Data Representation in Computer Science 441

Benefits are universality, consistency, and compatibility, especially with ASCII. Limitations include high space usage for inclusive encoding forms like UTF-32, complexity in processing Unicode due to variable-length encoding, and handling complexity due to multiple forms of encoding and nuances like normalisation.

Byte order or endianness defines the order in which a sequence of bytes is stored. Two forms exist: big-endian, where the most significant byte is stored first, and little-endian, where the least significant byte goes first.

Unicode Collation is the arrangement of text strings based on language-specific rules. It determines the correct order for sorting different Unicode characters extending beyond basic alphabetic sequence.

Some popular methods include Huffman coding, Burrows-Wheeler Transform, Standard Compression Scheme for Unicode (SCSU), and Binary Ordered Compression for Unicode (BOCU).

Unicode is a universal character encoding system that provides a unique identifier for every character, regardless of the platform, device, application, or language and can represent characters from almost every written language.

The UTF-8 format is advantageous due to its backward compatibility with ASCII, ensuring seamless integration with existing ASCII-based systems. It also uses 1-4 bytes per character, maintaining efficient memory usage.

Data Representation in Computer Science

Learn with 441 Data Representation in Computer Science flashcards in the free StudySmarter app

We have 14,000 flashcards about Dynamic Landscapes.

Already have an account? Log in

Frequently Asked Questions about Data Representation in Computer Science

What is data representation?

Data representation is the method used to encode information into a format that can be used and understood by computer systems. It involves the conversion of real-world data, such as text, images, sounds, numbers, into forms like binary or hexadecimal which computers can process. The choice of representation can affect the quality, accuracy and efficiency of data processing. Precisely, it's how computer systems interpret and manipulate data.

What does data representation mean?

Data representation refers to the methods or techniques used to express, display or encode data in a readable format for a computer or a user. This could be in different forms such as binary, decimal, or alphabetic forms. It's crucial in computer science since it links the abstract world of thought and concept to the concrete domain of signals, signs and symbols. It forms the basis of information processing and storage in contemporary digital computing systems.

Why is data representation important?

Data representation is crucial as it allows information to be processed, transferred, and interpreted in a meaningful way. It helps in organising and analysing data effectively, providing insights for decision-making processes. Moreover, it facilitates communication between the computer system and the real world, enabling computing outcomes to be understood by users. Finally, accurate data representation ensures integrity and reliability of the data, which is vital for effective problem solving.

How to make a graphical representation of data?

To create a graphical representation of data, first collect and organise your data. Choose a suitable form of data representation such as bar graphs, pie charts, line graphs, or histograms depending on the type of data and the information you want to display. Use a data visualisation tool or software such as Excel or Tableau to help you generate the graph. Always remember to label your axes and provide a title and legend if necessary.

What is data representation in statistics?

Data representation in statistics refers to the various methods used to display or present data in meaningful ways. This often includes the use of graphs, charts, tables, histograms or other visual tools that can help in the interpretation and analysis of data. It enables efficient communication of information and helps in drawing statistical conclusions. Essentially, it's a way of providing a visual context to complex datasets, making the data easily understandable.

Test your knowledge with multiple choice flashcards

Data Representation in Computer Science

Join the StudySmarter App and learn efficiently with millions of flashcards and more!

Keep learning, you are doing great.

Discover learning materials with the free StudySmarter app

1

About StudySmarter

StudySmarter is a globally recognized educational technology company, offering a holistic learning platform designed for students of all ages and educational levels. Our platform provides learning support for a wide range of subjects, including STEM, Social Sciences, and Languages and also helps students to successfully master various tests and exams worldwide, such as GCSE, A Level, SAT, ACT, Abitur, and more. We offer an extensive library of learning materials, including interactive flashcards, comprehensive textbook solutions, and detailed explanations. The cutting-edge technology and tools we provide help students create their own learning materials. StudySmarter’s content is not only expert-verified but also regularly updated to ensure accuracy and relevance.

Data Representation in Computer Science

StudySmarter Editorial Team

Team Computer Science Teachers

  • 19 minutes reading time
  • Checked by StudySmarter Editorial Team

Study anywhere. Anytime.Across all devices.

Create a free account to save this explanation..

Save explanations to your personalised space and access them anytime, anywhere!

By signing up, you agree to the Terms and Conditions and the Privacy Policy of StudySmarter.

Sign up to highlight and take notes. It’s 100% free.

Join over 22 million students in learning with our StudySmarter App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Study Planner
  • Smart Note-Taking

Join over 22 million students in learning with our StudySmarter App

data representation types

  • Data representation

Bytes of memory

  • Abstract machine

Unsigned integer representation

Signed integer representation, pointer representation, array representation, compiler layout, array access performance, collection representation.

  • Consequences of size and alignment rules

Uninitialized objects

Pointer arithmetic, undefined behavior.

  • Computer arithmetic

Arena allocation

This course is about learning how computers work, from the perspective of systems software: what makes programs work fast or slow, and how properties of the machines we program impact the programs we write. We want to communicate ideas, tools, and an experimental approach.

The course divides into six units:

  • Assembly & machine programming
  • Storage & caching
  • Kernel programming
  • Process management
  • Concurrency

The first unit, data representation , is all about how different forms of data can be represented in terms the computer can understand.

Computer memory is kind of like a Lite Brite.

Lite Brite

A Lite Brite is big black backlit pegboard coupled with a supply of colored pegs, in a limited set of colors. You can plug in the pegs to make all kinds of designs. A computer’s memory is like a vast pegboard where each slot holds one of 256 different colors. The colors are numbered 0 through 255, so each slot holds one byte . (A byte is a number between 0 and 255, inclusive.)

A slot of computer memory is identified by its address . On a computer with M bytes of memory, and therefore M slots, you can think of the address as a number between 0 and M −1. My laptop has 16 gibibytes of memory, so M = 16×2 30 = 2 34 = 17,179,869,184 = 0x4'0000'0000 —a very large number!

The problem of data representation is the problem of representing all the concepts we might want to use in programming—integers, fractions, real numbers, sets, pictures, texts, buildings, animal species, relationships—using the limited medium of addresses and bytes.

Powers of ten and powers of two. Digital computers love the number two and all powers of two. The electronics of digital computers are based on the bit , the smallest unit of storage, which a base-two digit: either 0 or 1. More complicated objects are represented by collections of bits. This choice has many scale and error-correction advantages. It also refracts upwards to larger choices, and even into terminology. Memory chips, for example, have capacities based on large powers of two, such as 2 30 bytes. Since 2 10 = 1024 is pretty close to 1,000, 2 20 = 1,048,576 is pretty close to a million, and 2 30 = 1,073,741,824 is pretty close to a billion, it’s common to refer to 2 30 bytes of memory as “a giga byte,” even though that actually means 10 9 = 1,000,000,000 bytes. But for greater precision, there are terms that explicitly signal the use of powers of two. 2 30 is a gibibyte : the “-bi-” component means “binary.”
Virtual memory. Modern computers actually abstract their memory spaces using a technique called virtual memory . The lowest-level kind of address, called a physical address , really does take on values between 0 and M −1. However, even on a 16GiB machine like my laptop, the addresses we see in programs can take on values like 0x7ffe'ea2c'aa67 that are much larger than M −1 = 0x3'ffff'ffff . The addresses used in programs are called virtual addresses . They’re incredibly useful for protection: since different running programs have logically independent address spaces, it’s much less likely that a bug in one program will crash the whole machine. We’ll learn about virtual memory in much more depth in the kernel unit ; the distinction between virtual and physical addresses is not as critical for data representation.

Most programming languages prevent their users from directly accessing memory. But not C and C++! These languages let you access any byte of memory with a valid address. This is powerful; it is also very dangerous. But it lets us get a hands-on view of how computers really work.

C++ programs accomplish their work by constructing, examining, and modifying objects . An object is a region of data storage that contains a value, such as the integer 12. (The standard specifically says “a region of data storage in the execution environment, the contents of which can represent values”.) Memory is called “memory” because it remembers object values.

In this unit, we often use functions called hexdump to examine memory. These functions are defined in hexdump.cc . hexdump_object(x) prints out the bytes of memory that comprise an object named x , while hexdump(ptr, size) prints out the size bytes of memory starting at a pointer ptr .

For example, in datarep1/add.cc , we might use hexdump_object to examine the memory used to represent some integers:

This display reports that a , b , and c are each four bytes long; that a , b , and c are located at different, nonoverlapping addresses (the long hex number in the first column); and shows us how the numbers 1, 2, and 3 are represented in terms of bytes. (More on that later.)

The compiler, hardware, and standard together define how objects of different types map to bytes. Each object uses a contiguous range of addresses (and thus bytes), and objects never overlap (objects that are active simultaneously are always stored in distinct address ranges).

Since C and C++ are designed to help software interface with hardware devices, their standards are transparent about how objects are stored. A C++ program can ask how big an object is using the sizeof keyword. sizeof(T) returns the number of bytes in the representation of an object of type T , and sizeof(x) returns the size of object x . The result of sizeof is a value of type size_t , which is an unsigned integer type large enough to hold any representable size. On 64-bit architectures, such as x86-64 (our focus in this course), size_t can hold numbers between 0 and 2 64 –1.

Qualitatively different objects may have the same data representation. For example, the following three objects have the same data representation on x86-64, which you can verify using hexdump :

In C and C++, you can’t reliably tell the type of an object by looking at the contents of its memory. That’s why tricks like our different addf*.cc functions work.

An object can have many names. For example, here, local and *ptr refer to the same object:

The different names for an object are sometimes called aliases .

There are five objects here:

  • ch1 , a global variable
  • ch2 , a constant (non-modifiable) global variable
  • ch3 , a local variable
  • ch4 , a local variable
  • the anonymous storage allocated by new char and accessed by *ch4

Each object has a lifetime , which is called storage duration by the standard. There are three different kinds of lifetime.

  • static lifetime: The object lasts as long as the program runs. ( ch1 , ch2 )
  • automatic lifetime: The compiler allocates and destroys the object automatically as the program runs, based on the object’s scope (the region of the program in which it is meaningful). ( ch3 , ch4 )
  • dynamic lifetime: The programmer allocates and destroys the object explicitly. ( *allocated_ch )

Objects with dynamic lifetime aren’t easy to use correctly. Dynamic lifetime causes many serious problems in C programs, including memory leaks, use-after-free, double-free, and so forth. Those serious problems cause undefined behavior and play a “disastrously central role” in “our ongoing computer security nightmare” . But dynamic lifetime is critically important. Only with dynamic lifetime can you construct an object whose size isn’t known at compile time, or construct an object that outlives the function that created it.

The compiler and operating system work together to put objects at different addresses. A program’s address space (which is the range of addresses accessible to a program) divides into regions called segments . Objects with different lifetimes are placed into different segments. The most important segments are:

  • Code (also known as text or read-only data ). Contains instructions and constant global objects. Unmodifiable; static lifetime.
  • Data . Contains non-constant global objects. Modifiable; static lifetime.
  • Heap . Modifiable; dynamic lifetime.
  • Stack . Modifiable; automatic lifetime.

The compiler decides on a segment for each object based on its lifetime. The final compiler phase, which is called the linker , then groups all the program’s objects by segment (so, for instance, global variables from different compiler runs are grouped together into a single segment). Finally, when a program runs, the operating system loads the segments into memory. (The stack and heap segments grow on demand.)

We can use a program to investigate where objects with different lifetimes are stored. (See cs61-lectures/datarep2/mexplore0.cc .) This shows address ranges like this:

Object declaration
(C++ program text)

Lifetime
(abstract machine)

Segment

Example address range
(runtime location in x86-64 Linux, non- )

Constant global

Static

Code (or Text)

0x40'0000 (≈1 × 2 )

Global

Static

Data

0x60'0000 (≈1.5 × 2 )

Local

Automatic

Stack

0x7fff'448d'0000 (≈2 = 2 × 2 )

Anonymous, returned by

Dynamic

Heap

0x1a0'0000 (≈8 × 2 )

Constant global data and global data have the same lifetime, but are stored in different segments. The operating system uses different segments so it can prevent the program from modifying constants. It marks the code segment, which contains functions (instructions) and constant global data, as read-only, and any attempt to modify code-segment memory causes a crash (a “Segmentation violation”).

An executable is normally at least as big as the static-lifetime data (the code and data segments together). Since all that data must be in memory for the entire lifetime of the program, it’s written to disk and then loaded by the OS before the program starts running. There is an exception, however: the “bss” segment is used to hold modifiable static-lifetime data with initial value zero. Such data is common, since all static-lifetime data is initialized to zero unless otherwise specified in the program text. Rather than storing a bunch of zeros in the object files and executable, the compiler and linker simply track the location and size of all zero-initialized global data. The operating system sets this memory to zero during the program load process. Clearing memory is faster than loading data from disk, so this optimization saves both time (the program loads faster) and space (the executable is smaller).

Abstract machine and hardware

Programming involves turning an idea into hardware instructions. This transformation happens in multiple steps, some you control and some controlled by other programs.

First you have an idea , like “I want to make a flappy bird iPhone game.” The computer can’t (yet) understand that idea. So you transform the idea into a program , written in some programming language . This process is called programming.

A C++ program actually runs on an abstract machine . The behavior of this machine is defined by the C++ standard , a technical document. This document is supposed to be so precisely written as to have an exact mathematical meaning, defining exactly how every C++ program behaves. But the document can’t run programs!

C++ programs run on hardware (mostly), and the hardware determines what behavior we see. Mapping abstract machine behavior to instructions on real hardware is the task of the C++ compiler (and the standard library and operating system). A C++ compiler is correct if and only if it translates each correct program to instructions that simulate the expected behavior of the abstract machine.

This same rough series of transformations happens for any programming language, although some languages use interpreters rather than compilers.

A bit is the fundamental unit of digital information: it’s either 0 or 1.

C++ manages memory in units of bytes —8 contiguous bits that together can represent numbers between 0 and 255. C’s unit for a byte is char : the abstract machine says a byte is stored in char . That means an unsigned char holds values in the inclusive range [0, 255].

The C++ standard actually doesn’t require that a byte hold 8 bits, and on some crazy machines from decades ago , bytes could hold nine bits! (!?)

But larger numbers, such as 258, don’t fit in a single byte. To represent such numbers, we must use multiple bytes. The abstract machine doesn’t specify exactly how this is done—it’s the compiler and hardware’s job to implement a choice. But modern computers always use place–value notation , just like in decimal numbers. In decimal, the number 258 is written with three digits, the meanings of which are determined both by the digit and by their place in the overall number:

\[ 258 = 2\times10^2 + 5\times10^1 + 8\times10^0 \]

The computer uses base 256 instead of base 10. Two adjacent bytes can represent numbers between 0 and \(255\times256+255 = 65535 = 2^{16}-1\) , inclusive. A number larger than this would take three or more bytes.

\[ 258 = 1\times256^1 + 2\times256^0 \]

On x86-64, the ones place, the least significant byte, is on the left, at the lowest address in the contiguous two-byte range used to represent the integer. This is the opposite of how decimal numbers are written: decimal numbers put the most significant digit on the left. The representation choice of putting the least-significant byte in the lowest address is called little-endian representation. x86-64 uses little-endian representation.

Some computers actually store multi-byte integers the other way, with the most significant byte stored in the lowest address; that’s called big-endian representation. The Internet’s fundamental protocols, such as IP and TCP, also use big-endian order for multi-byte integers, so big-endian is also called “network” byte order.

The C++ standard defines five fundamental unsigned integer types, along with relationships among their sizes. Here they are, along with their actual sizes and ranges on x86-64:

Type

Size
(abstract machine)

Size
(x86-64)

Range
(x86-64)


(byte)

1

1

[0, 255] = [0, 2 −1]

≥1

2

[0, 65,535] = [0, 2 −1]


(or )

4

[0, 4,294,967,295] = [0, 2 −1]

8

[0, 18,446,744,073,709,551,615] = [0, 2 −1]

8

[0, 18,446,744,073,709,551,615] = [0, 2 −1]

Other architectures and operating systems implement different ranges for these types. For instance, on IA32 machines like Intel’s Pentium (the 32-bit processors that predated x86-64), sizeof(long) was 4, not 8.

Note that all values of a smaller unsigned integer type can fit in any larger unsigned integer type. When a value of a larger unsigned integer type is placed in a smaller unsigned integer object, however, not every value fits; for instance, the unsigned short value 258 doesn’t fit in an unsigned char x . When this occurs, the C++ abstract machine requires that the smaller object’s value equals the least -significant bits of the larger value (so x will equal 2).

In addition to these types, whose sizes can vary, C++ has integer types whose sizes are fixed. uint8_t , uint16_t , uint32_t , and uint64_t define 8-bit, 16-bit, 32-bit, and 64-bit unsigned integers, respectively; on x86-64, these correspond to unsigned char , unsigned short , unsigned int , and unsigned long .

This general procedure is used to represent a multi-byte integer in memory.

  • Write the large integer in hexadecimal format, including all leading zeros required by the type size. For example, the unsigned value 65534 would be written 0x0000FFFE . There will be twice as many hexadecimal digits as sizeof(TYPE) .
  • Divide the integer into its component bytes, which are its digits in base 256. In our example, they are, from most to least significant, 0x00, 0x00, 0xFF, and 0xFE.

In little-endian representation, the bytes are stored in memory from least to most significant. If our example was stored at address 0x30, we would have:

In big-endian representation, the bytes are stored in the reverse order.

Computers are often fastest at dealing with fixed-length numbers, rather than variable-length numbers, and processor internals are organized around a fixed word size . A word is the natural unit of data used by a processor design . In most modern processors, this natural unit is 8 bytes or 64 bits , because this is the power-of-two number of bytes big enough to hold those processors’ memory addresses. Many older processors could access less memory and had correspondingly smaller word sizes, such as 4 bytes (32 bits).

The best representation for signed integers—and the choice made by x86-64, and by the C++20 abstract machine—is two’s complement . Two’s complement representation is based on this principle: Addition and subtraction of signed integers shall use the same instructions as addition and subtraction of unsigned integers.

To see what this means, let’s think about what -x should mean when x is an unsigned integer. Wait, negative unsigned?! This isn’t an oxymoron because C++ uses modular arithmetic for unsigned integers: the result of an arithmetic operation on unsigned values is always taken modulo 2 B , where B is the number of bits in the unsigned value type. Thus, on x86-64,

-x is simply the number that, when added to x , yields 0 (mod 2 B ). For example, when unsigned x = 0xFFFFFFFFU , then -x == 1U , since x + -x equals zero (mod 2 32 ).

To obtain -x , we flip all the bits in x (an operation written ~x ) and then add 1. To see why, consider the bit representations. What is x + (~x + 1) ? Well, (~x) i (the i th bit of ~x ) is 1 whenever x i is 0, and vice versa. That means that every bit of x + ~x is 1 (there are no carries), and x + ~x is the largest unsigned integer, with value 2 B -1. If we add 1 to this, we get 2 B . Which is 0 (mod 2 B )! The highest “carry” bit is dropped, leaving zero.

Two’s complement arithmetic uses half of the unsigned integer representations for negative numbers. A two’s-complement signed integer with B bits has the following values:

  • If the most-significant bit is 1, the represented number is negative. Specifically, the represented number is – (~x + 1) , where the outer negative sign is mathematical negation (not computer arithmetic).
  • If every bit is 0, the represented number is 0.
  • If the most-significant but is 0 but some other bit is 1, the represented number is positive.

The most significant bit is also called the sign bit , because if it is 1, then the represented value depends on the signedness of the type (and that value is negative for signed types).

Another way to think about two’s-complement is that, for B -bit integers, the most-significant bit has place value 2 B –1 in unsigned arithmetic and negative 2 B –1 in signed arithmetic. All other bits have the same place values in both kinds of arithmetic.

The two’s-complement bit pattern for x + y is the same whether x and y are considered as signed or unsigned values. For example, in 4-bit arithmetic, 5 has representation 0b0101 , while the representation 0b1100 represents 12 if unsigned and –4 if signed ( ~0b1100 + 1 = 0b0011 + 1 == 4). Let’s add those bit patterns and see what we get:

Note that this is the right answer for both signed and unsigned arithmetic : 5 + 12 = 17 = 1 (mod 16), and 5 + -4 = 1.

Subtraction and multiplication also produce the same results for unsigned arithmetic and signed two’s-complement arithmetic. (For instance, 5 * 12 = 60 = 12 (mod 16), and 5 * -4 = -20 = -4 (mod 16).) This is not true of division. (Consider dividing the 4-bit representation 0b1110 by 2. In signed arithmetic, 0b1110 represents -2, so 0b1110/2 == 0b1111 (-1); but in unsigned arithmetic, 0b1110 is 14, so 0b1110/2 == 0b0111 (7).) And, of course, it is not true of comparison. In signed 4-bit arithmetic, 0b1110 < 0 , but in unsigned 4-bit arithmetic, 0b1110 > 0 . This means that a C compiler for a two’s-complement machine can use a single add instruction for either signed or unsigned numbers, but it must generate different instruction patterns for signed and unsigned division (or less-than, or greater-than).

There are a couple quirks with C signed arithmetic. First, in two’s complement, there are more negative numbers than positive numbers. A representation with sign bit is 1, but every other bit 0, has no positive counterpart at the same bit width: for this number, -x == x . (In 4-bit arithmetic, -0b1000 == ~0b1000 + 1 == 0b0111 + 1 == 0b1000 .) Second, and far worse, is that arithmetic overflow on signed integers is undefined behavior .

Type

Size
(abstract machine)

Size
(x86-64)

Range
(x86-64)

1

1

[−128, 127] = [−2 , 2 −1]


(or )

=

2

[−32,768, 32,767] = [−2 , 2 −1]

=

4

[−2,147,483,648, 2,147,483,647] = [−2 , 2 −1]

=

8

[−9,223,372,036,854,775,808, 9,223,372,036,854,775,807] = [−2 , 2 −1]

=

8

[−9,223,372,036,854,775,808, 9,223,372,036,854,775,807] = [−2 , 2 −1]

The C++ abstract machine requires that signed integers have the same sizes as their unsigned counterparts.

We distinguish pointers , which are concepts in the C abstract machine, from addresses , which are hardware concepts. A pointer combines an address and a type.

The memory representation of a pointer is the same as the representation of its address value. The size of that integer is the machine’s word size; for example, on x86-64, a pointer occupies 8 bytes, and a pointer to an object located at address 0x400abc would be stored as:

The C++ abstract machine defines an unsigned integer type uintptr_t that can hold any address. (You have to #include <inttypes.h> or <cinttypes> to get the definition.) On most machines, including x86-64, uintptr_t is the same as unsigned long . Cast a pointer to an integer address value with syntax like (uintptr_t) ptr ; cast back to a pointer with syntax like (T*) addr . Casts between pointer types and uintptr_t are information preserving, so this assertion will never fail:

Since it is a 64-bit architecture, the size of an x86-64 address is 64 bits (8 bytes). That’s also the size of x86-64 pointers.

To represent an array of integers, C++ and C allocate the integers next to each other in memory, in sequential addresses, with no gaps or overlaps. Here, we put the integers 0, 1, and 258 next to each other, starting at address 1008:

Say that you have an array of N integers, and you access each of those integers in order, accessing each integer exactly once. Does the order matter?

Computer memory is random-access memory (RAM), which means that a program can access any bytes of memory in any order—it’s not, for example, required to read memory in ascending order by address. But if we run experiments, we can see that even in RAM, different access orders have very different performance characteristics.

Our arraysum program sums up all the integers in an array of N integers, using an access order based on its arguments, and prints the resulting delay. Here’s the result of a couple experiments on accessing 10,000,000 items in three orders, “up” order (sequential: elements 0, 1, 2, 3, …), “down” order (reverse sequential: N , N −1, N −2, …), and “random” order (as it sounds).

order trial 1 trial 2 trial 3
, up 8.9ms 7.9ms 7.4ms
, down 9.2ms 8.9ms 10.6ms
, random 316.8ms 352.0ms 360.8ms

Wow! Down order is just a bit slower than up, but random order seems about 40 times slower. Why?

Random order is defeating many of the internal architectural optimizations that make memory access fast on modern machines. Sequential order, since it’s more predictable, is much easier to optimize.

Foreshadowing. This part of the lecture is a teaser for the Storage unit, where we cover access patterns and caching, including the processor caches that explain this phenomenon, in much more depth.

The C++ programming language offers several collection mechanisms for grouping subobjects together into new kinds of object. The collections are arrays, structs, and unions. (Classes are a kind of struct. All library types, such as vectors, lists, and hash tables, use combinations of these collection types.) The abstract machine defines how subobjects are laid out inside a collection. This is important, because it lets C/C++ programs exchange messages with hardware and even with programs written in other languages: messages can be exchanged only when both parties agree on layout.

Array layout in C++ is particularly simple: The objects in an array are laid out sequentially in memory, with no gaps or overlaps. Assume a declaration like T x[N] , where x is an array of N objects of type T , and say that the address of x is a . Then the address of element x[i] equals a + i * sizeof(T) , and sizeof(a) == N * sizeof(T) .

Sidebar: Vector representation

The C++ library type std::vector defines an array that can grow and shrink. For instance, this function creates a vector containing the numbers 0 up to N in sequence:

Here, v is an object with automatic lifetime. This means its size (in the sizeof sense) is fixed at compile time. Remember that the sizes of static- and automatic-lifetime objects must be known at compile time; only dynamic-lifetime objects can have varying size based on runtime parameters. So where and how are v ’s contents stored?

The C++ abstract machine requires that v ’s elements are stored in an array in memory. (The v.data() method returns a pointer to the first element of the array.) But it does not define std::vector ’s layout otherwise, and C++ library designers can choose different layouts based on their needs. We found these to hold for the std::vector in our library:

sizeof(v) == 24 for any vector of any type, and the address of v is a stack address (i.e., v is located in the stack segment).

The first 8 bytes of the vector hold the address of the first element of the contents array—call it the begin address . This address is a heap address, which is as expected, since the contents must have dynamic lifetime. The value of the begin address is the same as that of v.data() .

Bytes 8–15 hold the address just past the contents array—call it the end address . Its value is the same as &v.data()[v.size()] . If the vector is empty, then the begin address and the end address are the same.

Bytes 16–23 hold an address greater than or equal to the end address. This is the capacity address . As a vector grows, it will sometimes outgrow its current location and move its contents to new memory addresses. To reduce the number of copies, vectors usually to request more memory from the operating system than they immediately need; this additional space, which is called “capacity,” supports cheap growth. Often the capacity doubles on each growth spurt, since this allows operations like v.push_back() to execute in O (1) time on average.

Compilers must also decide where different objects are stored when those objects are not part of a collection. For instance, consider this program:

The abstract machine says these objects cannot overlap, but does not otherwise constrain their positions in memory.

On Linux, GCC will put all these variables into the stack segment, which we can see using hexdump . But it can put them in the stack segment in any order , as we can see by reordering the declarations (try declaration order i1 , c1 , i2 , c2 , c3 ), by changing optimization levels, or by adding different scopes (braces). The abstract machine gives the programmer no guarantees about how object addresses relate. In fact, the compiler may move objects around during execution, as long as it ensures that the program behaves according to the abstract machine. Modern optimizing compilers often do this, particularly for automatic objects.

But what order does the compiler choose? With optimization disabled, the compiler appears to lay out objects in decreasing order by declaration, so the first declared variable in the function has the highest address. With optimization enabled, the compiler follows roughly the same guideline, but it also rearranges objects by type—for instance, it tends to group char s together—and it can reuse space if different variables in the same function have disjoint lifetimes. The optimizing compiler tends to use less space for the same set of variables. This is because it’s arranging objects by alignment.

The C++ compiler and library restricts the addresses at which some kinds of data appear. In particular, the address of every int value is always a multiple of 4, whether it’s located on the stack (automatic lifetime), the data segment (static lifetime), or the heap (dynamic lifetime).

A bunch of observations will show you these rules:

Type Size Address restrictions ( )
( , ) 1 No restriction 1
( ) 2 Multiple of 2 2
( ) 4 Multiple of 4 4
( ) 8 Multiple of 8 8
4 Multiple of 4 4
8 Multiple of 8 8
16 Multiple of 16 16
8 Multiple of 8 8

These are the alignment restrictions for an x86-64 Linux machine.

These restrictions hold for most x86-64 operating systems, except that on Windows, the long type has size and alignment 4. (The long long type has size and alignment 8 on all x86-64 operating systems.)

Just like every type has a size, every type has an alignment. The alignment of a type T is a number a ≥1 such that the address of every object of type T must be a multiple of a . Every object with type T has size sizeof(T) —it occupies sizeof(T) contiguous bytes of memory; and has alignment alignof(T) —the address of its first byte is a multiple of alignof(T) . You can also say sizeof(x) and alignof(x) where x is the name of an object or another expression.

Alignment restrictions can make hardware simpler, and therefore faster. For instance, consider cache blocks. CPUs access memory through a transparent hardware cache. Data moves from primary memory, or RAM (which is large—a couple gigabytes on most laptops—and uses cheaper, slower technology) to the cache in units of 64 or 128 bytes. Those units are always aligned: on a machine with 128-byte cache blocks, the bytes with memory addresses [127, 128, 129, 130] live in two different cache blocks (with addresses [0, 127] and [128, 255]). But the 4 bytes with addresses [4n, 4n+1, 4n+2, 4n+3] always live in the same cache block. (This is true for any small power of two: the 8 bytes with addresses [8n,…,8n+7] always live in the same cache block.) In general, it’s often possible to make a system faster by leveraging restrictions—and here, the CPU hardware can load data faster when it can assume that the data lives in exactly one cache line.

The compiler, library, and operating system all work together to enforce alignment restrictions.

On x86-64 Linux, alignof(T) == sizeof(T) for all fundamental types (the types built in to C: integer types, floating point types, and pointers). But this isn’t always true; on x86-32 Linux, double has size 8 but alignment 4.

It’s possible to construct user-defined types of arbitrary size, but the largest alignment required by a machine is fixed for that machine. C++ lets you find the maximum alignment for a machine with alignof(std::max_align_t) ; on x86-64, this is 16, the alignment of the type long double (and the alignment of some less-commonly-used SIMD “vector” types ).

We now turn to the abstract machine rules for laying out all collections. The sizes and alignments for user-defined types—arrays, structs, and unions—are derived from a couple simple rules or principles. Here they are. The first rule applies to all types.

1. First-member rule. The address of the first member of a collection equals the address of the collection.

Thus, the address of an array is the same as the address of its first element. The address of a struct is the same as the address of the first member of the struct.

The next three rules depend on the class of collection. Every C abstract machine enforces these rules.

2. Array rule. Arrays are laid out sequentially as described above.

3. Struct rule. The second and subsequent members of a struct are laid out in order, with no overlap, subject to alignment constraints.

4. Union rule. All members of a union share the address of the union.

In C, every struct follows the struct rule, but in C++, only simple structs follow the rule. Complicated structs, such as structs with some public and some private members, or structs with virtual functions, can be laid out however the compiler chooses. The typical situation is that C++ compilers for a machine architecture (e.g., “Linux x86-64”) will all agree on a layout procedure for complicated structs. This allows code compiled by different compilers to interoperate.

That next rule defines the operation of the malloc library function.

5. Malloc rule. Any non-null pointer returned by malloc has alignment appropriate for any type. In other words, assuming the allocated size is adequate, the pointer returned from malloc can safely be cast to T* for any T .

Oddly, this holds even for small allocations. The C++ standard (the abstract machine) requires that malloc(1) return a pointer whose alignment is appropriate for any type, including types that don’t fit.

And the final rule is not required by the abstract machine, but it’s how sizes and alignments on our machines work.

6. Minimum rule. The sizes and alignments of user-defined types, and the offsets of struct members, are minimized within the constraints of the other rules.

The minimum rule, and the sizes and alignments of basic types, are defined by the x86-64 Linux “ABI” —its Application Binary Interface. This specification standardizes how x86-64 Linux C compilers should behave, and lets users mix and match compilers without problems.

Consequences of the size and alignment rules

From these rules we can derive some interesting consequences.

First, the size of every type is a multiple of its alignment .

To see why, consider an array with two elements. By the array rule, these elements have addresses a and a+sizeof(T) , where a is the address of the array. Both of these addresses contain a T , so they are both a multiple of alignof(T) . That means sizeof(T) is also a multiple of alignof(T) .

We can also characterize the sizes and alignments of different collections .

  • The size of an array of N elements of type T is N * sizeof(T) : the sum of the sizes of its elements. The alignment of the array is alignof(T) .
  • The size of a union is the maximum of the sizes of its components (because the union can only hold one component at a time). Its alignment is also the maximum of the alignments of its components.
  • The size of a struct is at least as big as the sum of the sizes of its components. Its alignment is the maximum of the alignments of its components.

Thus, the alignment of every collection equals the maximum of the alignments of its components.

It’s also true that the alignment equals the least common multiple of the alignments of its components. You might have thought lcm was a better answer, but the max is the same as the lcm for every architecture that matters, because all fundamental alignments are powers of two.

The size of a struct might be larger than the sum of the sizes of its components, because of alignment constraints. Since the compiler must lay out struct components in order, and it must obey the components’ alignment constraints, and it must ensure different components occupy disjoint addresses, it must sometimes introduce extra space in structs. Here’s an example: the struct will have 3 bytes of padding after char c , to ensure that int i2 has the correct alignment.

Thanks to padding, reordering struct components can sometimes reduce the total size of a struct. Padding can happen at the end of a struct as well as the middle. Padding can never happen at the start of a struct, however (because of Rule 1).

The rules also imply that the offset of any struct member —which is the difference between the address of the member and the address of the containing struct— is a multiple of the member’s alignment .

To see why, consider a struct s with member m at offset o . The malloc rule says that any pointer returned from malloc is correctly aligned for s . Every pointer returned from malloc is maximally aligned, equalling 16*x for some integer x . The struct rule says that the address of m , which is 16*x + o , is correctly aligned. That means that 16*x + o = alignof(m)*y for some integer y . Divide both sides by a = alignof(m) and you see that 16*x/a + o/a = y . But 16/a is an integer—the maximum alignment is a multiple of every alignment—so 16*x/a is an integer. We can conclude that o/a must also be an integer!

Finally, we can also derive the necessity for padding at the end of structs. (How?)

What happens when an object is uninitialized? The answer depends on its lifetime.

  • static lifetime (e.g., int global; at file scope): The object is initialized to 0.
  • automatic or dynamic lifetime (e.g., int local; in a function, or int* ptr = new int ): The object is uninitialized and reading the object’s value before it is assigned causes undefined behavior.

Compiler hijinks

In C++, most dynamic memory allocation uses special language operators, new and delete , rather than library functions.

Though this seems more complex than the library-function style, it has advantages. A C compiler cannot tell what malloc and free do (especially when they are redefined to debugging versions, as in the problem set), so a C compiler cannot necessarily optimize calls to malloc and free away. But the C++ compiler may assume that all uses of new and delete follow the rules laid down by the abstract machine. That means that if the compiler can prove that an allocation is unnecessary or unused, it is free to remove that allocation!

For example, we compiled this program in the problem set environment (based on test003.cc ):

The optimizing C++ compiler removes all calls to new and delete , leaving only the call to m61_printstatistics() ! (For instance, try objdump -d testXXX to look at the compiled x86-64 instructions.) This is valid because the compiler is explicitly allowed to eliminate unused allocations, and here, since the ptrs variable is local and doesn’t escape main , all allocations are unused. The C compiler cannot perform this useful transformation. (But the C compiler can do other cool things, such as unroll the loops .)

One of C’s more interesting choices is that it explicitly relates pointers and arrays. Although arrays are laid out in memory in a specific way, they generally behave like pointers when they are used. This property probably arose from C’s desire to explicitly model memory as an array of bytes, and it has beautiful and confounding effects.

We’ve already seen one of these effects. The hexdump function has this signature (arguments and return type):

But we can just pass an array as argument to hexdump :

When used in an expression like this—here, as an argument—the array magically changes into a pointer to its first element. The above call has the same meaning as this:

C programmers transition between arrays and pointers very naturally.

A confounding effect is that unlike all other types, in C arrays are passed to and returned from functions by reference rather than by value. C is a call-by-value language except for arrays. This means that all function arguments and return values are copied, so that parameter modifications inside a function do not affect the objects passed by the caller—except for arrays. For instance: void f ( int a[ 2 ]) { a[ 0 ] = 1 ; } int main () { int x[ 2 ] = { 100 , 101 }; f(x); printf( "%d \n " , x[ 0 ]); // prints 1! } If you don’t like this behavior, you can get around it by using a struct or a C++ std::array . #include <array> struct array1 { int a[ 2 ]; }; void f1 (array1 arg) { arg.a[ 0 ] = 1 ; } void f2 (std :: array < int , 2 > a) { a[ 0 ] = 1 ; } int main () { array1 x = {{ 100 , 101 }}; f1(x); printf( "%d \n " , x.a[ 0 ]); // prints 100 std :: array < int , 2 > x2 = { 100 , 101 }; f2(x2); printf( "%d \n " , x2[ 0 ]); // prints 100 }

C++ extends the logic of this array–pointer correspondence to support arithmetic on pointers as well.

Pointer arithmetic rule. In the C abstract machine, arithmetic on pointers produces the same result as arithmetic on the corresponding array indexes.

Specifically, consider an array T a[n] and pointers T* p1 = &a[i] and T* p2 = &a[j] . Then:

Equality : p1 == p2 if and only if (iff) p1 and p2 point to the same address, which happens iff i == j .

Inequality : Similarly, p1 != p2 iff i != j .

Less-than : p1 < p2 iff i < j .

Also, p1 <= p2 iff i <= j ; and p1 > p2 iff i > j ; and p1 >= p2 iff i >= j .

Pointer difference : What should p1 - p2 mean? Using array indexes as the basis, p1 - p2 == i - j . (But the type of the difference is always ptrdiff_t , which on x86-64 is long , the signed version of size_t .)

Addition : p1 + k (where k is an integer type) equals the pointer &a[i + k] . ( k + p1 returns the same thing.)

Subtraction : p1 - k equals &a[i - k] .

Increment and decrement : ++p1 means p1 = p1 + 1 , which means p1 = &a[i + 1] . Similarly, --p1 means p1 = &a[i - 1] . (There are also postfix versions, p1++ and p1-- , but C++ style prefers the prefix versions.)

No other arithmetic operations on pointers are allowed. You can’t multiply pointers, for example. (You can multiply addresses by casting the pointers to the address type, uintptr_t —so (uintptr_t) p1 * (uintptr_t) p2 —but why would you?)

From pointers to iterators

Let’s write a function that can sum all the integers in an array.

This function can compute the sum of the elements of any int array. But because of the pointer–array relationship, its a argument is really a pointer . That allows us to call it with subarrays as well as with whole arrays. For instance:

This way of thinking about arrays naturally leads to a style that avoids sizes entirely, using instead a sentinel or boundary argument that defines the end of the interesting part of the array.

These expressions compute the same sums as the above:

Note that the data from first to last forms a half-open range . iIn mathematical notation, we care about elements in the range [first, last) : the element pointed to by first is included (if it exists), but the element pointed to by last is not. Half-open ranges give us a simple and clear way to describe empty ranges, such as zero-element arrays: if first == last , then the range is empty.

Note that given a ten-element array a , the pointer a + 10 can be formed and compared, but must not be dereferenced—the element a[10] does not exist. The C/C++ abstract machines allow users to form pointers to the “one-past-the-end” boundary elements of arrays, but users must not dereference such pointers.

So in C, two pointers naturally express a range of an array. The C++ standard template library, or STL, brilliantly abstracts this pointer notion to allow two iterators , which are pointer-like objects, to express a range of any standard data structure—an array, a vector, a hash table, a balanced tree, whatever. This version of sum works for any container of int s; notice how little it changed:

Some example uses:

Addresses vs. pointers

What’s the difference between these expressions? (Again, a is an array of type T , and p1 == &a[i] and p2 == &a[j] .)

The first expression is defined analogously to index arithmetic, so d1 == i - j . But the second expression performs the arithmetic on the addresses corresponding to those pointers. We will expect d2 to equal sizeof(T) * d1 . Always be aware of which kind of arithmetic you’re using. Generally arithmetic on pointers should not involve sizeof , since the sizeof is included automatically according to the abstract machine; but arithmetic on addresses almost always should involve sizeof .

Although C++ is a low-level language, the abstract machine is surprisingly strict about which pointers may be formed and how they can be used. Violate the rules and you’re in hell because you have invoked the dreaded undefined behavior .

Given an array a[N] of N elements of type T :

Forming a pointer &a[i] (or a + i ) with 0 ≤ i ≤ N is safe.

Forming a pointer &a[i] with i < 0 or i > N causes undefined behavior.

Dereferencing a pointer &a[i] with 0 ≤ i < N is safe.

Dereferencing a pointer &a[i] with i < 0 or i ≥ N causes undefined behavior.

(For the purposes of these rules, objects that are not arrays count as single-element arrays. So given T x , we can safely form &x and &x + 1 and dereference &x .)

What “undefined behavior” means is horrible. A program that executes undefined behavior is erroneous. But the compiler need not catch the error. In fact, the abstract machine says anything goes : undefined behavior is “behavior … for which this International Standard imposes no requirements.” “Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).” Other possible behaviors include allowing hackers from the moon to steal all of a program’s data, take it over, and force it to delete the hard drive on which it is running. Once undefined behavior executes, a program may do anything, including making demons fly out of the programmer’s nose.

Pointer arithmetic, and even pointer comparisons, are also affected by undefined behavior. It’s undefined to go beyond and array’s bounds using pointer arithmetic. And pointers may be compared for equality or inequality even if they point to different arrays or objects, but if you try to compare different arrays via less-than, like this:

that causes undefined behavior.

If you really want to compare pointers that might be to different arrays—for instance, you’re writing a hash function for arbitrary pointers—cast them to uintptr_t first.

Undefined behavior and optimization

A program that causes undefined behavior is not a C++ program . The abstract machine says that a C++ program, by definition, is a program whose behavior is always defined. The C++ compiler is allowed to assume that its input is a C++ program. (Obviously!) So the compiler can assume that its input program will never cause undefined behavior. Thus, since undefined behavior is “impossible,” if the compiler can prove that a condition would cause undefined behavior later, it can assume that condition will never occur.

Consider this program:

If we supply a value equal to (char*) -1 , we’re likely to see output like this:

with no assertion failure! But that’s an apparently impossible result. The printout can only happen if x + 1 > x (otherwise, the assertion will fail and stop the printout). But x + 1 , which equals 0 , is less than x , which is the largest 8-byte value!

The impossible happens because of undefined behavior reasoning. When the compiler sees an expression like x + 1 > x (with x a pointer), it can reason this way:

“Ah, x + 1 . This must be a pointer into the same array as x (or it might be a boundary pointer just past that array, or just past the non-array object x ). This must be so because forming any other pointer would cause undefined behavior.

“The pointer comparison is the same as an index comparison. x + 1 > x means the same thing as &x[1] > &x[0] . But that holds iff 1 > 0 .

“In my infinite wisdom, I know that 1 > 0 . Thus x + 1 > x always holds, and the assertion will never fail.

“My job is to make this code run fast. The fastest code is code that’s not there. This assertion will never fail—might as well remove it!”

Integer undefined behavior

Arithmetic on signed integers also has important undefined behaviors. Signed integer arithmetic must never overflow. That is, the compiler may assume that the mathematical result of any signed arithmetic operation, such as x + y (with x and y both int ), can be represented inside the relevant type. It causes undefined behavior, therefore, to add 1 to the maximum positive integer. (The ubexplore.cc program demonstrates how this can produce impossible results, as with pointers.)

Arithmetic on unsigned integers is much safer with respect to undefined behavior. Unsigned integers are defined to perform arithmetic modulo their size. This means that if you add 1 to the maximum positive unsigned integer, the result will always be zero.

Dividing an integer by zero causes undefined behavior whether or not the integer is signed.

Sanitizers, which in our makefiles are turned on by supplying SAN=1 , can catch many undefined behaviors as soon as they happen. Sanitizers are built in to the compiler itself; a sanitizer involves cooperation between the compiler and the language runtime. This has the major performance advantage that the compiler introduces exactly the required checks, and the optimizer can then use its normal analyses to remove redundant checks.

That said, undefined behavior checking can still be slow. Undefined behavior allows compilers to make assumptions about input values, and those assumptions can directly translate to faster code. Turning on undefined behavior checking can make some benchmark programs run 30% slower [link] .

Signed integer undefined behavior

File cs61-lectures/datarep5/ubexplore2.cc contains the following program.

What will be printed if we run the program with ./ubexplore2 0x7ffffffe 0x7fffffff ?

0x7fffffff is the largest positive value can be represented by type int . Adding one to this value yields 0x80000000 . In two's complement representation this is the smallest negative number represented by type int .

Assuming that the program behaves this way, then the loop exit condition i > n2 can never be met, and the program should run (and print out numbers) forever.

However, if we run the optimized version of the program, it prints only two numbers and exits:

The unoptimized program does print forever and never exits.

What’s going on here? We need to look at the compiled assembly of the program with and without optimization (via objdump -S ).

The unoptimized version basically looks like this:

This is a pretty direct translation of the loop.

The optimized version, though, does it differently. As always, the optimizer has its own ideas. (Your compiler may produce different results!)

The compiler changed the source’s less than or equal to comparison, i <= n2 , into a not equal to comparison in the executable, i != n2 + 1 (in both cases using signed computer arithmetic, i.e., modulo 2 32 )! The comparison i <= n2 will always return true when n2 == 0x7FFFFFFF , the maximum signed integer, so the loop goes on forever. But the i != n2 + 1 comparison does not always return true when n2 == 0x7FFFFFFF : when i wraps around to 0x80000000 (the smallest negative integer), then i equals n2 + 1 (which also wrapped), and the loop stops.

Why did the compiler make this transformation? In the original loop, the step-6 jump is immediately followed by another comparison and jump in steps 1 and 2. The processor jumps all over the place, which can confuse its prediction circuitry and slow down performance. In the transformed loop, the step-7 jump is never followed by a comparison and jump; instead, step 7 goes back to step 4, which always prints the current number. This more streamlined control flow is easier for the processor to make fast.

But the streamlined control flow is only a valid substitution under the assumption that the addition n2 + 1 never overflows . Luckily (sort of), signed arithmetic overflow causes undefined behavior, so the compiler is totally justified in making that assumption!

Programs based on ubexplore2 have demonstrated undefined behavior differences for years, even as the precise reasons why have changed. In some earlier compilers, we found that the optimizer just upgraded the int s to long s—arithmetic on long s is just as fast on x86-64 as arithmetic on int s, since x86-64 is a 64-bit architecture, and sometimes using long s for everything lets the compiler avoid conversions back and forth. The ubexplore2l program demonstrates this form of transformation: since the loop variable is added to a long counter, the compiler opportunistically upgrades i to long as well. This transformation is also only valid under the assumption that i + 1 will not overflow—which it can’t, because of undefined behavior.

Using unsigned type prevents all this undefined behavior, because arithmetic overflow on unsigned integers is well defined in C/C++. The ubexplore2u.cc file uses an unsigned loop index and comparison, and ./ubexplore2u and ./ubexplore2u.noopt behave exactly the same (though you have to give arguments like ./ubexplore2u 0xfffffffe 0xffffffff to see the overflow).

Computer arithmetic and bitwise operations

Basic bitwise operators.

Computers offer not only the usual arithmetic operators like + and - , but also a set of bitwise operators. The basic ones are & (and), | (or), ^ (xor/exclusive or), and the unary operator ~ (complement). In truth table form:

(and)
0 0
0 1
(or)
0 1
1 1
(xor)
0 1
1 0
(complement)
1
0

In C or C++, these operators work on integers. But they work bitwise: the result of an operation is determined by applying the operation independently at each bit position. Here’s how to compute 12 & 4 in 4-bit unsigned arithmetic:

These basic bitwise operators simplify certain important arithmetics. For example, (x & (x - 1)) == 0 tests whether x is zero or a power of 2.

Negation of signed integers can also be expressed using a bitwise operator: -x == ~x + 1 . This is in fact how we define two's complement representation. We can verify that x and (-x) does add up to zero under this representation:

Bitwise "and" ( & ) can help with modular arithmetic. For example, x % 32 == (x & 31) . We essentially "mask off", or clear, higher order bits to do modulo-powers-of-2 arithmetics. This works in any base. For example, in decimal, the fastest way to compute x % 100 is to take just the two least significant digits of x .

Bitwise shift of unsigned integer

x << i appends i zero bits starting at the least significant bit of x . High order bits that don't fit in the integer are thrown out. For example, assuming 4-bit unsigned integers

Similarly, x >> i appends i zero bits at the most significant end of x . Lower bits are thrown out.

Bitwise shift helps with division and multiplication. For example:

A modern compiler can optimize y = x * 66 into y = (x << 6) + (x << 1) .

Bitwise operations also allows us to treat bits within an integer separately. This can be useful for "options".

For example, when we call a function to open a file, we have a lot of options:

  • Open for reading?
  • Open for writing?
  • Read from the end?
  • Optimize for writing?

We have a lot of true/false options.

One bad way to implement this is to have this function take a bunch of arguments -- one argument for each option. This makes the function call look like this:

The long list of arguments slows down the function call, and one can also easily lose track of the meaning of the individual true/false values passed in.

A cheaper way to achieve this is to use a single integer to represent all the options. Have each option defined as a power of 2, and simply | (or) them together and pass them as a single integer.

Flags are usually defined as powers of 2 so we set one bit at a time for each flag. It is less common but still possible to define a combination flag that is not a power of 2, so that it sets multiple bits in one go.

File cs61-lectures/datarep5/mb-driver.cc contains a memory allocation benchmark. The core of the benchmark looks like this:

The benchmark tests the performance of memnode_arena::allocate() and memnode_arena::deallocate() functions. In the handout code, these functions do the same thing as new memnode and delete memnode —they are wrappers for malloc and free . The benchmark allocates 4096 memnode objects, then free-and-then-allocates them for noperations times, and then frees all of them.

We only allocate memnode s, and all memnode s are of the same size, so we don't need metadata that keeps track of the size of each allocation. Furthermore, since all dynamically allocated data are freed at the end of the function, for each individual memnode_free() call we don't really need to return memory to the system allocator. We can simply reuse these memory during the function and returns all memory to the system at once when the function exits.

If we run the benchmark with 100000000 allocation, and use the system malloc() , free() functions to implement the memnode allocator, the benchmark finishes in 0.908 seconds.

Our alternative implementation of the allocator can finish in 0.355 seconds, beating the heavily optimized system allocator by a factor of 3. We will reveal how we achieved this in the next lecture.

We continue our exploration with the memnode allocation benchmark introduced from the last lecture.

File cs61-lectures/datarep6/mb-malloc.cc contains a version of the benchmark using the system new and delete operators.

In this function we allocate an array of 4096 pointers to memnode s, which occupy 2 3 *2 12 =2 15 bytes on the stack. We then allocate 4096 memnode s. Our memnode is defined like this:

Each memnode contains a std::string object and an unsigned integer. Each std::string object internally contains a pointer points to an character array in the heap. Therefore, every time we create a new memnode , we need 2 allocations: one to allocate the memnode itself, and another one performed internally by the std::string object when we initialize/assign a string value to it.

Every time we deallocate a memnode by calling delete , we also delete the std::string object, and the string object knows that it should deallocate the heap character array it internally maintains. So there are also 2 deallocations occuring each time we free a memnode.

We make the benchmark to return a seemingly meaningless result to prevent an aggressive compiler from optimizing everything away. We also use this result to make sure our subsequent optimizations to the allocator are correct by generating the same result.

This version of the benchmark, using the system allocator, finishes in 0.335 seconds. Not bad at all.

Spoiler alert: We can do 15x better than this.

1st optimization: std::string

We only deal with one file name, namely "datarep/mb-filename.cc", which is constant throughout the program for all memnode s. It's also a string literal, which means it as a constant string has a static life time. Why can't we just simply use a const char* in place of the std::string and let the pointer point to the static constant string? This saves us the internal allocation/deallocation performed by std::string every time we initialize/delete a string.

The fix is easy, we simply change the memnode definition:

This version of the benchmark now finishes in 0.143 seconds, a 2x improvement over the original benchmark. This 2x improvement is consistent with a 2x reduction in numbers of allocation/deallocation mentioned earlier.

You may ask why people still use std::string if it involves an additional allocation and is slower than const char* , as shown in this benchmark. std::string is much more flexible in that it also deals data that doesn't have static life time, such as input from a user or data the program receives over the network. In short, when the program deals with strings that are not constant, heap data is likely to be very useful, and std::string provides facilities to conveniently handle on-heap data.

2nd optimization: the system allocator

We still use the system allocator to allocate/deallocate memnode s. The system allocator is a general-purpose allocator, which means it must handle allocation requests of all sizes. Such general-purpose designs usually comes with a compromise for performance. Since we are only memnode s, which are fairly small objects (and all have the same size), we can build a special- purpose allocator just for them.

In cs61-lectures/datarep5/mb2.cc , we actually implement a special-purpose allocator for memnode s:

This allocator maintains a free list (a C++ vector ) of freed memnode s. allocate() simply pops a memnode off the free list if there is any, and deallocate() simply puts the memnode on the free list. This free list serves as a buffer between the system allocator and the benchmark function, so that the system allocator is invoked less frequently. In fact, in the benchmark, the system allocator is only invoked for 4096 times when it initializes the pointer array. That's a huge reduction because all 10-million "recycle" operations in the middle now doesn't involve the system allocator.

With this special-purpose allocator we can finish the benchmark in 0.057 seconds, another 2.5x improvement.

However this allocator now leaks memory: it never actually calls delete ! Let's fix this by letting it also keep track of all allocated memnode s. The modified definition of memnode_arena now looks like this:

With the updated allocator we simply need to invoke arena.destroy_all() at the end of the function to fix the memory leak. And we don't even need to invoke this method manually! We can use the C++ destructor for the memnode_arena struct, defined as ~memnode_arena() in the code above, which is automatically called when our arena object goes out of scope. We simply make the destructor invoke the destroy_all() method, and we are all set.

Fixing the leak doesn't appear to affect performance at all. This is because the overhead added by tracking the allocated list and calling delete only affects our initial allocation the 4096 memnode* pointers in the array plus at the very end when we clean up. These 8192 additional operations is a relative small number compared to the 10 million recycle operations, so the added overhead is hardly noticeable.

Spoiler alert: We can improve this by another factor of 2.

3rd optimization: std::vector

In our special purpose allocator memnode_arena , we maintain an allocated list and a free list both using C++ std::vector s. std::vector s are dynamic arrays, and like std::string they involve an additional level of indirection and stores the actual array in the heap. We don't access the allocated list during the "recycling" part of the benchmark (which takes bulk of the benchmark time, as we showed earlier), so the allocated list is probably not our bottleneck. We however, add and remove elements from the free list for each recycle operation, and the indirection introduced by the std::vector here may actually be our bottleneck. Let's find out.

Instead of using a std::vector , we could use a linked list of all free memnode s for the actual free list. We will need to include some extra metadata in the memnode to store pointers for this linked list. However, unlike in the debugging allocator pset, in a free list we don't need to store this metadata in addition to actual memnode data: the memnode is free, and not in use, so we can use reuse its memory, using a union:

We then maintain the free list like this:

Compared to the std::vector free list, this free list we always directly points to an available memnode when it is not empty ( free_list !=nullptr ), without going through any indirection. In the std::vector free list one would first have to go into the heap to access the actual array containing pointers to free memnode s, and then access the memnode itself.

With this change we can now finish the benchmark under 0.3 seconds! Another 2x improvement over the previous one!

Compared to the benchmark with the system allocator (which finished in 0.335 seconds), we managed to achieve a speedup of nearly 15x with arena allocation.

Page Statistics

Page Metric#
Views0
Avg. time spent0 s

Table Of Contents

  • Introduction to Functional Computer
  • Fundamentals of Architectural Design

Data Representation

  • Instruction Set Architecture : Instructions and Formats
  • Instruction Set Architecture : Design Models
  • Instruction Set Architecture : Addressing Modes
  • Performance Measurements and Issues
  • Computer Architecture Assessment 1
  • Fixed Point Arithmetic : Addition and Subtraction
  • Fixed Point Arithmetic : Multiplication
  • Fixed Point Arithmetic : Division
  • Floating Point Arithmetic
  • Arithmetic Logic Unit Design
  • CPU's Data Path
  • CPU's Control Unit
  • Control Unit Design
  • Concepts of Pipelining
  • Computer Architecture Assessment 2
  • Pipeline Hazards
  • Memory Characteristics and Organization
  • Cache Memory
  • Virtual Memory
  • I/O Communication and I/O Controller
  • Input/Output Data Transfer
  • Direct Memory Access controller and I/O Processor
  • CPU Interrupts and Interrupt Handling
  • Computer Architecture Assessment 3

Course Computer Architecture

Digital computers store and process information in binary form as digital logic has only two values "1" and "0" or in other words "True or False" or also said as "ON or OFF". This system is called radix 2. We human generally deal with radix 10 i.e. decimal. As a matter of convenience there are many other representations like Octal (Radix 8), Hexadecimal (Radix 16), Binary coded decimal (BCD), Decimal etc.

Every computer's CPU has a width measured in terms of bits such as 8 bit CPU, 16 bit CPU, 32 bit CPU etc. Similarly, each memory location can store a fixed number of bits and is called memory width. Given the size of the CPU and Memory, it is for the programmer to handle his data representation. Most of the readers may be knowing that 4 bits form a Nibble, 8 bits form a byte. The word length is defined by the Instruction Set Architecture of the CPU. The word length may be equal to the width of the CPU.

The memory simply stores information as a binary pattern of 1's and 0's. It is to be interpreted as what the content of a memory location means. If the CPU is in the Fetch cycle, it interprets the fetched memory content to be instruction and decodes based on Instruction format. In the Execute cycle, the information from memory is considered as data. As a common man using a computer, we think computers handle English or other alphabets, special characters or numbers. A programmer considers memory content to be data types of the programming language he uses. Now recall figure 1.2 and 1.3 of chapter 1 to reinforce your thought that conversion happens from computer user interface to internal representation and storage.

  • Data Representation in Computers

Information handled by a computer is classified as instruction and data. A broad overview of the internal representation of the information is illustrated in figure 3.1. No matter whether it is data in a numeric or non-numeric form or integer, everything is internally represented in Binary. It is up to the programmer to handle the interpretation of the binary pattern and this interpretation is called Data Representation . These data representation schemes are all standardized by international organizations.

Choice of Data representation to be used in a computer is decided by

  • The number types to be represented (integer, real, signed, unsigned, etc.)
  • Range of values likely to be represented (maximum and minimum to be represented)
  • The Precision of the numbers i.e. maximum accuracy of representation (floating point single precision, double precision etc)
  • If non-numeric i.e. character, character representation standard to be chosen. ASCII, EBCDIC, UTF are examples of character representation standards.
  • The hardware support in terms of word width, instruction.

Before we go into the details, let us take an example of interpretation. Say a byte in Memory has value "0011 0001". Although there exists a possibility of so many interpretations as in figure 3.2, the program has only one interpretation as decided by the programmer and declared in the program.

  • Fixed point Number Representation

Fixed point numbers are also known as whole numbers or Integers. The number of bits used in representing the integer also implies the maximum number that can be represented in the system hardware. However for the efficiency of storage and operations, one may choose to represent the integer with one Byte, two Bytes, Four bytes or more. This space allocation is translated from the definition used by the programmer while defining a variable as integer short or long and the Instruction Set Architecture.

In addition to the bit length definition for integers, we also have a choice to represent them as below:

  • Unsigned Integer : A positive number including zero can be represented in this format. All the allotted bits are utilised in defining the number. So if one is using 8 bits to represent the unsigned integer, the range of values that can be represented is 28 i.e. "0" to "255". If 16 bits are used for representing then the range is 216 i.e. "0 to 65535".
  • Signed Integer : In this format negative numbers, zero, and positive numbers can be represented. A sign bit indicates the magnitude direction as positive or negative. There are three possible representations for signed integer and these are Sign Magnitude format, 1's Compliment format and 2's Complement format .

Signed Integer – Sign Magnitude format: Most Significant Bit (MSB) is reserved for indicating the direction of the magnitude (value). A "0" on MSB means a positive number and a "1" on MSB means a negative number. If n bits are used for representation, n-1 bits indicate the absolute value of the number. Examples for n=8:

Examples for n=8:

0010 1111 = + 47 Decimal (Positive number)

1010 1111 = - 47 Decimal (Negative Number)

0111 1110 = +126 (Positive number)

1111 1110 = -126 (Negative Number)

0000 0000 = + 0 (Postive Number)

1000 0000 = - 0 (Negative Number)

Although this method is easy to understand, Sign Magnitude representation has several shortcomings like

  • Zero can be represented in two ways causing redundancy and confusion.
  • The total range for magnitude representation is limited to 2n-1, although n bits were accounted.
  • The separate sign bit makes the addition and subtraction more complicated. Also, comparing two numbers is not straightforward.

Signed Integer – 1’s Complement format: In this format too, MSB is reserved as the sign bit. But the difference is in representing the Magnitude part of the value for negative numbers (magnitude) is inversed and hence called 1’s Complement form. The positive numbers are represented as it is in binary. Let us see some examples to better our understanding.

1101 0000 = - 47 Decimal (Negative Number)

1000 0001 = -126 (Negative Number)

1111 1111 = - 0 (Negative Number)

  • Converting a given binary number to its 2's complement form

Step 1 . -x = x' + 1 where x' is the one's complement of x.

Step 2 Extend the data width of the number, fill up with sign extension i.e. MSB bit is used to fill the bits.

Example: -47 decimal over 8bit representation

As you can see zero is not getting represented with redundancy. There is only one way of representing zero. The other problem of the complexity of the arithmetic operation is also eliminated in 2’s complement representation. Subtraction is done as Addition.

More exercises on number conversion are left to the self-interest of readers.

  • Floating Point Number system

The maximum number at best represented as a whole number is 2 n . In the Scientific world, we do come across numbers like Mass of an Electron is 9.10939 x 10-31 Kg. Velocity of light is 2.99792458 x 108 m/s. Imagine to write the number in a piece of paper without exponent and converting into binary for computer representation. Sure you are tired!!. It makes no sense to write a number in non- readable form or non- processible form. Hence we write such large or small numbers using exponent and mantissa. This is said to be Floating Point representation or real number representation. he real number system could have infinite values between 0 and 1.

Representation in computer

Unlike the two's complement representation for integer numbers, Floating Point number uses Sign and Magnitude representation for both mantissa and exponent . In the number 9.10939 x 1031, in decimal form, +31 is Exponent, 9.10939 is known as Fraction . Mantissa, Significand and fraction are synonymously used terms. In the computer, the representation is binary and the binary point is not fixed. For example, a number, say, 23.345 can be written as 2.3345 x 101 or 0.23345 x 102 or 2334.5 x 10-2. The representation 2.3345 x 101 is said to be in normalised form.

Floating-point numbers usually use multiple words in memory as we need to allot a sign bit, few bits for exponent and many bits for mantissa. There are standards for such allocation which we will see sooner.

  • IEEE 754 Floating Point Representation

We have two standards known as Single Precision and Double Precision from IEEE. These standards enable portability among different computers. Figure 3.3 picturizes Single precision while figure 3.4 picturizes double precision. Single Precision uses 32bit format while double precision is 64 bits word length. As the name suggests double precision can represent fractions with larger accuracy. In both the cases, MSB is sign bit for the mantissa part, followed by Exponent and Mantissa. The exponent part has its sign bit.

It is to be noted that in Single Precision, we can represent an exponent in the range -127 to +127. It is possible as a result of arithmetic operations the resulting exponent may not fit in. This situation is called overflow in the case of positive exponent and underflow in the case of negative exponent. The Double Precision format has 11 bits for exponent meaning a number as large as -1023 to 1023 can be represented. The programmer has to make a choice between Single Precision and Double Precision declaration using his knowledge about the data being handled.

The Floating Point operations on the regular CPU is very very slow. Generally, a special purpose CPU known as Co-processor is used. This Co-processor works in tandem with the main CPU. The programmer should be using the float declaration only if his data is in real number form. Float declaration is not to be used generously.

  • Decimal Numbers Representation

Decimal numbers (radix 10) are represented and processed in the system with the support of additional hardware. We deal with numbers in decimal format in everyday life. Some machines implement decimal arithmetic too, like floating-point arithmetic hardware. In such a case, the CPU uses decimal numbers in BCD (binary coded decimal) form and does BCD arithmetic operation. BCD operates on radix 10. This hardware operates without conversion to pure binary. It uses a nibble to represent a number in packed BCD form. BCD operations require not only special hardware but also decimal instruction set.

  • Exceptions and Error Detection

All of us know that when we do arithmetic operations, we get answers which have more digits than the operands (Ex: 8 x 2= 16). This happens in computer arithmetic operations too. When the result size exceeds the allotted size of the variable or the register, it becomes an error and exception. The exception conditions associated with numbers and number operations are Overflow, Underflow, Truncation, Rounding and Multiple Precision . These are detected by the associated hardware in arithmetic Unit. These exceptions apply to both Fixed Point and Floating Point operations. Each of these exceptional conditions has a flag bit assigned in the Processor Status Word (PSW). We may discuss more in detail in the later chapters.

  • Character Representation

Another data type is non-numeric and is largely character sets. We use a human-understandable character set to communicate with computer i.e. for both input and output. Standard character sets like EBCDIC and ASCII are chosen to represent alphabets, numbers and special characters. Nowadays Unicode standard is also in use for non-English language like Chinese, Hindi, Spanish, etc. These codes are accessible and available on the internet. Interested readers may access and learn more.

1. Track your progress [Earn 200 points]

Mark as complete

2. Provide your ratings to this chapter [Earn 100 points]

  • Software Testing Course
  • Software Engineering Tutorial
  • Software Development Life Cycle
  • Waterfall Model
  • Software Requirements
  • Software Measurement and Metrics
  • Software Design Process
  • System configuration management
  • Software Maintenance
  • Software Development Tutorial
  • Software Testing Tutorial
  • Product Management Tutorial
  • Project Management Tutorial
  • Agile Methodology
  • Selenium Basics

Different forms of data representation in today’s world

Overview : Data is can be anything which represents the specific result or any number, text, image, audio, video etc. For example, If you will take an example of human being then data for a human being such that name, personal id, country, profession, bank account details etc. are the important data. Data can be divide into three categories such that data can be personal, public and private. 

Forms of data representation :   At present Information comes in different forms such as follows.

Let’s discuss it one by one.

  • Audio –   Audio signal is a representation of sound or music. Audio differs from all i.e. from text, number and images. Audio is a series of binary numbers for digital signals. It is continuous but not discrete. Audio File Formats – MP3, M4A audio file type, FLAC, WAV, WMA, AAC, etc.  
  • Video –   Video refers to the recording, broadcasting, copying or playback. Video can either be produced or it is continuous and sometimes it is a combination of multiple images produced in motion. Video File Formats – MP4, MOV, AVI, FLV, etc.  
  • Images –   Images are also represented as bit patterns. An image is composed of matrix of pixels with different values of pixels each where each pixel is represented as dots. Size of the picture is dependent on its resolution. Consider a simple black and white image. If 1 is black (or on) and 0 is white (or off), then a simple black and white picture can be created using binary. Image File Formats –  Image can be in the format of jpeg, PNG, TIFF, GIF, etc.

Please Login to comment...

Similar reads.

  • Software Engineering
  • SUMIF in Google Sheets with formula examples
  • How to Get a Free SSL Certificate
  • Best SSL Certificates Provider in India
  • Elon Musk's xAI releases Grok-2 AI assistant
  • Content Improvement League 2024: From Good To A Great Article

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Most of us write numbers in Arabic form, ie, 1, 2, 3,..., 9. Some people write them differently, such as I, II, III, IV,..., IX. Nomatter what type of representation, most human beings can understand, at least the two types I mentioned. Unfortunately the computer doesn't. Computer is the most stupid thing you can ever encounter in your life.

Modern computers are built up with transistors. Whenever an electric current pass into the transistors either an or status will be established. Therefore the computer can only reconize two numbers, for OFF, and for ON, which can be referred to as . There is nothing in between Bit 0 and Bit 1 (eg Bit 0.5 doesn't exist). Hence computers can be said to be discrete machines. The number system consists only of two numbers is called . And to distinguish the different numbering systems, the numbers human use, ie 1,2,3,4..., will be called (since they are based 10 numbers) from now on.

How, therefore, can computer understand numbers larger than 1? The answer is simple, 2 is simply 1+1, (like 10 = 9+1 for human) the numbers are added and overflow digit is carred over to the left position. So (decimal) 2 is representated in Binary as 10. To further illustrate the relationship, I have listed the numbers 1 to 9 in both systems for compaison:

0 0000 0000
1 0000 0001
2 0000 0010
3 0000 0011
4 0000 0100
5 0000 0101
6 0000 0110
7 0000 0111
8 0000 1000
9 0000 1001

You may ask why do I always put 8 binary digits there. Well, the smallest unit in the computer's memory to store data is called a BYTE , which consists of 8 BITS. One Byte allows upto 256 different combinations of data representation (2 8 = 256). What happens when we have numbers greater than 256? The computer simply uses more Bytes to hold the value, 2 Bytes can hold values upto 65536 (2 16 ) and so forth.

Not only does the computer not understand the (decimal) numbers you use, it doesn't even understand letters like "ABCDEFG...". The fact is, it doesn't care. Whatever letters you input into the computer, the computer just saves it there and delivers to you when you instruct it so. It saves these letters in the same Binary format as digits, in accordance to a pattern. In PC (including DOS, Windows 95/98/NT, and UNIX), the pattern is called ASCII (pronounced ask-ee ) which stands for A merican S tandard C ode for I nformation I nterchange .

In this format, the letter "A" is represented by "0100 0001" ,or most often, referred to decimal 65 in the ASCII Table. The standard coding under ASCII is here . When performing comparison of characters, the computer actually looks up the associated ASCII codes and compare the ASCII values instead of the characters. Therefore the letter "B" which has ASCII value of 66 is greater than the letter "A" with ASCII value of 65.

The computer stores data in different formats or types . The number 10 can be stored as numeric value as in "10 dollars" or as character as in the address "10 Main Street" .  So how can the computer tell? Once again the computer doesn't care, it is your responsibility to ensure that you get the correct data out of it. (For illustration character 10 and numeric 10 are represented by 0011-0001-0011-0000 and 0000-1010 respectively — you can see how different they are.) Different programming launguages have different data types , although the foundamental ones are usually very similar.

C++ has many data types. The followings are some basic data types you will be facing in these chapters. Note that there are more complicated data types. You can even create your own data types. Some of these will be discussed later in the tutorial.

char 

1

ASCII -128 to127

 
unsigned char 

1

ASCII 0 to 255 

including high ASCII chars 
int

2

-32768 to 32767

Integer
unsigned (unsigned int)

2

0 to 65535

non-negative integer
long int 

4

� 2 billions

double sized integer
unsigned long int

4

0 to 4 billion

non-negative long integer 
float

4

3.4 �e38 

6 significant digits 
double 

8

1.7 �e308

15 significant digits 

char is basically used to store alphanumerics (numbers are stored in character form). Recall that character is stored as ASCII representation in PC. ASCII -128 to -1 do not exist, so char accomodates data from ASCII 0 (null zero) to ASCII 127 (DEL key). The original C++ does not have a String data type (but string is available through the inclusion of a library — to be discussed later). String can be stored as an one-dimensional array (list) with a "null zero" (ASCII 0) store in the last "cell" in the array. Unsigned char effectively accomodates the use of Extended ASCII characters which represent most special characters like the copyright sign �, registered trademark sign � etc plus some European letters like �, �, etc. Both char and unsigned char are stored internally as integers so they can effectively be compared (to be greater or less than).

Whenever you write a char (letter) in your program you must include it in single quotes. When you write strings (words or sentences) you must include them in double quotes. Otherwise C++ will treat these letters/words/sentences as tokens (to be discussed in Chapter 4). Remember in C/C++, A, 'A', "A" are all different. The first A (without quotes) means a variable or constant (discussed in Chapter 4), the second 'A' (in single quotes) means a character A which occupies one byte of memory. The third "A" (in double quotes) means a string containing the letter A followed by a null character which occupies 2 bytes of memory (will use more memory if store in a variable/constant of bigger size). See these examples: letter = 'A'; cout << 'A'; cout << "10 Main Street";

int (integer) represents all non-frational real numbers. Since int has a relatively small range (upto 32767), whenever you need to store value that has the possibility of going beyond this limit, long int should be used instead. The beauty of using int is that since it has no frational parts, its value is absolute and calculations of int are extremely accurate. However note that dividing an int by another may result in truncation, eg int 10 / int 3 will result in 3, not 3.3333 (more on this will be discussed later).

float , on the other hand, contains fractions. However real fractional numbers are not possible in computers since they are discrete machines (they can only handle the numbers 0 and 1, not 1.5 nor 1.75 or anything in between 0 and 1). No matter how many digits your calculator can show, you cannot produce a result of 2/3 without rounding, truncating, or by approximation. Mathameticians always write 2/3 instead of 0.66666.......... when they need the EXACT values. Since computer cannot produce real fractions the issue of significant digits comes to sight. For most applications a certain significant numbers are all you need. For example when you talk about money, $99.99 has no difference to $99.988888888888 (rounded to nearest cent); when you talk about the wealth of Bill Gates, it make little sense of saying $56,123,456,789.95 instead of just saying approximately $56 billions (these figures are not real, I have no idea how much money Bill has, although I wish he would give me the roundings). As you may see from the above table, float has only 6 significant digits, so for some applications it may not be sufficient, espically in scentific calculations, in which case you may want to use double or even long double to handle the numbers. There is also another problem in using float/double . Since numbers are represented internally as binary values, whenever a frational number is calculated or translated to/from binary there will be a rounding/truncaion error. So if you have a float 0, add 0.01 to it for 100 times, then minus 1.00 from it ( see the codes here or get the executable codes here ), you will not get 0 as it should be, rather you will get a value close to zero, but not really zero. Using double or long double will reduce the error but will not eliminate it. However as I mentioned earlier, the relevance may not affect our real life, just mean you may need to exercise caution when programming with floating point numbers.

There is another C++ data type I haven't included here — bool (boolean) data type which can only store a value of either 0 (false) or 1 (true). I will be using int (integer) to handle logical comparisons which poses more challenge and variety of use.

Escape Sequences are not data types but I feel I would better discuss them here. I mentioned earlier that you have to include a null zero at the end of a "string" in using an array of char to represent string. The easiest way to do this is to write the escape sequence '\0' which is understood by C++ as null zero. The followings are Escape Sequences in C++:

\a Alarm \t Tab \" Double Quote
\b Backspace \v Vertical Tab \000 Octal Num
\f Form Feed \\ Backslash \xhh Hex number
\n New Line \? Question Mark \0 Null Zero
\r Carriage Return \' Single Quote    

Earlier I said you can create your own data types. Here I will show you how. In fact you not only can create new data types but you can also create an alias of existing data type. For example you are writing a program which deals with dollar values. Since dollar values have fractional parts you have to either use float or double data types (eg assign float data type to salary by writing float salary . You can create an alias of the same data type MONEY and write MONEY salary. You do this by adding the following type definition into your program:

typedef double MONEY;

You can also create new data types. I will discuss more on this when we come to Arrays in Chapter 10. But the following illustrates how you create a new data type of array from a base data type:

TABLE OF CONTENTS (HIDE)

A tutorial on data representation, integers, floating-point numbers, and characters, number systems.

Human beings use decimal (base 10) and duodecimal (base 12) number systems for counting and measurements (probably because we have 10 fingers and two big toes). Computers use binary (base 2) number system, as they are made from binary digital components (known as transistors) operating in two states - on and off. In computing, we also use hexadecimal (base 16) or octal (base 8) number systems, as a compact form for representing binary numbers.

Decimal (Base 10) Number System

Decimal number system has ten symbols: 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , and 9 , called digit s. It uses positional notation . That is, the least-significant digit (right-most digit) is of the order of 10^0 (units or ones), the second right-most digit is of the order of 10^1 (tens), the third right-most digit is of the order of 10^2 (hundreds), and so on, where ^ denotes exponent. For example,

We shall denote a decimal number with an optional suffix D if ambiguity arises.

Binary (Base 2) Number System

Binary number system has two symbols: 0 and 1 , called bits . It is also a positional notation , for example,

We shall denote a binary number with a suffix B . Some programming languages denote binary numbers with prefix 0b or 0B (e.g., 0b1001000 ), or prefix b with the bits quoted (e.g., b'10001111' ).

A binary digit is called a bit . Eight bits is called a byte (why 8-bit unit? Probably because 8=2 3 ).

Hexadecimal (Base 16) Number System

Hexadecimal number system uses 16 symbols: 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , A , B , C , D , E , and F , called hex digits . It is a positional notation , for example,

We shall denote a hexadecimal number (in short, hex) with a suffix H . Some programming languages denote hex numbers with prefix 0x or 0X (e.g., 0x1A3C5F ), or prefix x with hex digits quoted (e.g., x'C3A4D98B' ).

Each hexadecimal digit is also called a hex digit . Most programming languages accept lowercase 'a' to 'f' as well as uppercase 'A' to 'F' .

Computers uses binary system in their internal operations, as they are built from binary digital electronic components with 2 states - on and off. However, writing or reading a long sequence of binary bits is cumbersome and error-prone (try to read this binary string: 1011 0011 0100 0011 0001 1101 0001 1000B , which is the same as hexadecimal B343 1D18H ). Hexadecimal system is used as a compact form or shorthand for binary bits. Each hex digit is equivalent to 4 binary bits, i.e., shorthand for 4 bits, as follows:

Hexadecimal Binary Decimal
0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
9 1001 9
A 1010 10
B 1011 11
C 1100 12
D 1101 13
E 1110 14
F 1111 15

Conversion from Hexadecimal to Binary

Replace each hex digit by the 4 equivalent bits (as listed in the above table), for examples,

Conversion from Binary to Hexadecimal

Starting from the right-most bit (least-significant bit), replace each group of 4 bits by the equivalent hex digit (pad the left-most bits with zero if necessary), for examples,

It is important to note that hexadecimal number provides a compact form or shorthand for representing binary bits.

Conversion from Base r to Decimal (Base 10)

Given a n -digit base r number: d n-1 d n-2 d n-3 ...d 2 d 1 d 0 (base r), the decimal equivalent is given by:

For examples,

Conversion from Decimal (Base 10) to Base r

Use repeated division/remainder. For example,

The above procedure is actually applicable to conversion between any 2 base systems. For example,

Conversion between Two Number Systems with Fractional Part

  • Separate the integral and the fractional parts.
  • For the integral part, divide by the target radix repeatably, and collect the remainder in reverse order.
  • For the fractional part, multiply the fractional part by the target radix repeatably, and collect the integral part in the same order.

Example 1: Decimal to Binary

Example 2: Decimal to Hexadecimal

Exercises (Number Systems Conversion)

  • 101010101010

Answers: You could use the Windows' Calculator ( calc.exe ) to carry out number system conversion, by setting it to the Programmer or scientific mode. (Run "calc" ⇒ Select "Settings" menu ⇒ Choose "Programmer" or "Scientific" mode.)

  • 1101100B , 1001011110000B , 10001100101000B , 6CH , 12F0H , 2328H .
  • 218H , 80H , AAAH , 536D , 128D , 2730D .
  • 10101011110011011110B , 1001000110100B , 100000001111B , 703710D , 4660D , 2063D .
  • ?? (You work it out!)

Computer Memory & Data Representation

Computer uses a fixed number of bits to represent a piece of data, which could be a number, a character, or others. A n -bit storage location can represent up to 2^ n distinct entities. For example, a 3-bit memory location can hold one of these eight binary patterns: 000 , 001 , 010 , 011 , 100 , 101 , 110 , or 111 . Hence, it can represent at most 8 distinct entities. You could use them to represent numbers 0 to 7, numbers 8881 to 8888, characters 'A' to 'H', or up to 8 kinds of fruits like apple, orange, banana; or up to 8 kinds of animals like lion, tiger, etc.

Integers, for example, can be represented in 8-bit, 16-bit, 32-bit or 64-bit. You, as the programmer, choose an appropriate bit-length for your integers. Your choice will impose constraint on the range of integers that can be represented. Besides the bit-length, an integer can be represented in various representation schemes, e.g., unsigned vs. signed integers. An 8-bit unsigned integer has a range of 0 to 255, while an 8-bit signed integer has a range of -128 to 127 - both representing 256 distinct numbers.

It is important to note that a computer memory location merely stores a binary pattern . It is entirely up to you, as the programmer, to decide on how these patterns are to be interpreted . For example, the 8-bit binary pattern "0100 0001B" can be interpreted as an unsigned integer 65 , or an ASCII character 'A' , or some secret information known only to you. In other words, you have to first decide how to represent a piece of data in a binary pattern before the binary patterns make sense. The interpretation of binary pattern is called data representation or encoding . Furthermore, it is important that the data representation schemes are agreed-upon by all the parties, i.e., industrial standards need to be formulated and straightly followed.

Once you decided on the data representation scheme, certain constraints, in particular, the precision and range will be imposed. Hence, it is important to understand data representation to write correct and high-performance programs.

Rosette Stone and the Decipherment of Egyptian Hieroglyphs

RosettaStone

Egyptian hieroglyphs (next-to-left) were used by the ancient Egyptians since 4000BC. Unfortunately, since 500AD, no one could longer read the ancient Egyptian hieroglyphs, until the re-discovery of the Rosette Stone in 1799 by Napoleon's troop (during Napoleon's Egyptian invasion) near the town of Rashid (Rosetta) in the Nile Delta.

The Rosetta Stone (left) is inscribed with a decree in 196BC on behalf of King Ptolemy V. The decree appears in three scripts: the upper text is Ancient Egyptian hieroglyphs , the middle portion Demotic script, and the lowest Ancient Greek . Because it presents essentially the same text in all three scripts, and Ancient Greek could still be understood, it provided the key to the decipherment of the Egyptian hieroglyphs.

The moral of the story is unless you know the encoding scheme, there is no way that you can decode the data.

Reference and images: Wikipedia.

Integer Representation

Integers are whole numbers or fixed-point numbers with the radix point fixed after the least-significant bit. They are contrast to real numbers or floating-point numbers , where the position of the radix point varies. It is important to take note that integers and floating-point numbers are treated differently in computers. They have different representation and are processed differently (e.g., floating-point numbers are processed in a so-called floating-point processor). Floating-point numbers will be discussed later.

Computers use a fixed number of bits to represent an integer. The commonly-used bit-lengths for integers are 8-bit, 16-bit, 32-bit or 64-bit. Besides bit-lengths, there are two representation schemes for integers:

  • Unsigned Integers : can represent zero and positive integers.
  • Sign-Magnitude representation
  • 1's Complement representation
  • 2's Complement representation

You, as the programmer, need to decide on the bit-length and representation scheme for your integers, depending on your application's requirements. Suppose that you need a counter for counting a small quantity from 0 up to 200, you might choose the 8-bit unsigned integer scheme as there is no negative numbers involved.

n -bit Unsigned Integers

Unsigned integers can represent zero and positive integers, but not negative integers. The value of an unsigned integer is interpreted as " the magnitude of its underlying binary pattern ".

Example 1: Suppose that n =8 and the binary pattern is 0100 0001B , the value of this unsigned integer is 1×2^0 + 1×2^6 = 65D .

Example 2: Suppose that n =16 and the binary pattern is 0001 0000 0000 1000B , the value of this unsigned integer is 1×2^3 + 1×2^12 = 4104D .

Example 3: Suppose that n =16 and the binary pattern is 0000 0000 0000 0000B , the value of this unsigned integer is 0 .

An n -bit pattern can represent 2^ n distinct integers. An n -bit unsigned integer can represent integers from 0 to (2^ n )-1 , as tabulated below:

n Minimum Maximum
8 0 (2^8)-1  (=255)
16 0 (2^16)-1 (=65,535)
32 0 (2^32)-1 (=4,294,967,295) (9+ digits)
64 0 (2^64)-1 (=18,446,744,073,709,551,615) (19+ digits)

Signed Integers

Signed integers can represent zero, positive integers, as well as negative integers. Three representation schemes are available for signed integers:

In all the above three schemes, the most-significant bit (msb) is called the sign bit . The sign bit is used to represent the sign of the integer - with 0 for positive integers and 1 for negative integers. The magnitude of the integer, however, is interpreted differently in different schemes.

n -bit Sign Integers in Sign-Magnitude Representation

In sign-magnitude representation:

  • The most-significant bit (msb) is the sign bit , with value of 0 representing positive integer and 1 representing negative integer.
  • The remaining n -1 bits represents the magnitude (absolute value) of the integer. The absolute value of the integer is interpreted as "the magnitude of the ( n -1)-bit binary pattern".

Example 1 : Suppose that n =8 and the binary representation is 0 100 0001B .    Sign bit is 0 ⇒ positive    Absolute value is 100 0001B = 65D    Hence, the integer is +65D

Example 2 : Suppose that n =8 and the binary representation is 1 000 0001B .    Sign bit is 1 ⇒ negative    Absolute value is 000 0001B = 1D    Hence, the integer is -1D

Example 3 : Suppose that n =8 and the binary representation is 0 000 0000B .    Sign bit is 0 ⇒ positive    Absolute value is 000 0000B = 0D    Hence, the integer is +0D

Example 4 : Suppose that n =8 and the binary representation is 1 000 0000B .    Sign bit is 1 ⇒ negative    Absolute value is 000 0000B = 0D    Hence, the integer is -0D

sign-magnitude representation

The drawbacks of sign-magnitude representation are:

  • There are two representations ( 0000 0000B and 1000 0000B ) for the number zero, which could lead to inefficiency and confusion.
  • Positive and negative integers need to be processed separately.

n -bit Sign Integers in 1's Complement Representation

In 1's complement representation:

  • Again, the most significant bit (msb) is the sign bit , with value of 0 representing positive integers and 1 representing negative integers.
  • for positive integers, the absolute value of the integer is equal to "the magnitude of the ( n -1)-bit binary pattern".
  • for negative integers, the absolute value of the integer is equal to "the magnitude of the complement ( inverse ) of the ( n -1)-bit binary pattern" (hence called 1's complement).

Example 1 : Suppose that n =8 and the binary representation 0 100 0001B .    Sign bit is 0 ⇒ positive    Absolute value is 100 0001B = 65D    Hence, the integer is +65D

Example 2 : Suppose that n =8 and the binary representation 1 000 0001B .    Sign bit is 1 ⇒ negative    Absolute value is the complement of 000 0001B , i.e., 111 1110B = 126D    Hence, the integer is -126D

Example 3 : Suppose that n =8 and the binary representation 0 000 0000B .    Sign bit is 0 ⇒ positive    Absolute value is 000 0000B = 0D    Hence, the integer is +0D

Example 4 : Suppose that n =8 and the binary representation 1 111 1111B .    Sign bit is 1 ⇒ negative    Absolute value is the complement of 111 1111B , i.e., 000 0000B = 0D    Hence, the integer is -0D

1's complement

Again, the drawbacks are:

  • There are two representations ( 0000 0000B and 1111 1111B ) for zero.
  • The positive integers and negative integers need to be processed separately.

n -bit Sign Integers in 2's Complement Representation

In 2's complement representation:

  • for negative integers, the absolute value of the integer is equal to "the magnitude of the complement of the ( n -1)-bit binary pattern plus one " (hence called 2's complement).

Example 2 : Suppose that n =8 and the binary representation 1 000 0001B .    Sign bit is 1 ⇒ negative    Absolute value is the complement of 000 0001B plus 1 , i.e., 111 1110B + 1B = 127D    Hence, the integer is -127D

Example 4 : Suppose that n =8 and the binary representation 1 111 1111B .    Sign bit is 1 ⇒ negative    Absolute value is the complement of 111 1111B plus 1 , i.e., 000 0000B + 1B = 1D    Hence, the integer is -1D

2's complement

Computers use 2's Complement Representation for Signed Integers

We have discussed three representations for signed integers: signed-magnitude, 1's complement and 2's complement. Computers use 2's complement in representing signed integers. This is because:

  • There is only one representation for the number zero in 2's complement, instead of two representations in sign-magnitude and 1's complement.
  • Positive and negative integers can be treated together in addition and subtraction. Subtraction can be carried out using the "addition logic".

Example 1: Addition of Two Positive Integers: Suppose that n=8, 65D + 5D = 70D

Example 2: Subtraction is treated as Addition of a Positive and a Negative Integers: Suppose that n=8, 65D - 5D = 65D + (-5D) = 60D

Example 3: Addition of Two Negative Integers: Suppose that n=8, -65D - 5D = (-65D) + (-5D) = -70D

Because of the fixed precision (i.e., fixed number of bits ), an n -bit 2's complement signed integer has a certain range. For example, for n =8 , the range of 2's complement signed integers is -128 to +127 . During addition (and subtraction), it is important to check whether the result exceeds this range, in other words, whether overflow or underflow has occurred.

Example 4: Overflow: Suppose that n=8, 127D + 2D = 129D (overflow - beyond the range)

Example 5: Underflow: Suppose that n=8, -125D - 5D = -130D (underflow - below the range)

The following diagram explains how the 2's complement works. By re-arranging the number line, values from -128 to +127 are represented contiguously by ignoring the carry bit.

signed integer

Range of n -bit 2's Complement Signed Integers

An n -bit 2's complement signed integer can represent integers from -2^( n -1) to +2^( n -1)-1 , as tabulated. Take note that the scheme can represent all the integers within the range, without any gap. In other words, there is no missing integers within the supported range.

n minimum maximum
8 -(2^7)  (=-128) +(2^7)-1  (=+127)
16 -(2^15) (=-32,768) +(2^15)-1 (=+32,767)
32 -(2^31) (=-2,147,483,648) +(2^31)-1 (=+2,147,483,647)(9+ digits)
64 -(2^63) (=-9,223,372,036,854,775,808) +(2^63)-1 (=+9,223,372,036,854,775,807)(18+ digits)

Decoding 2's Complement Numbers

  • Check the sign bit (denoted as S ).
  • If S=0 , the number is positive and its absolute value is the binary value of the remaining n -1 bits.
  • If S=1 , the number is negative. you could "invert the n -1 bits and plus 1" to get the absolute value of negative number. Alternatively, you could scan the remaining n -1 bits from the right (least-significant bit). Look for the first occurrence of 1. Flip all the bits to the left of that first occurrence of 1. The flipped pattern gives the absolute value. For example, n = 8, bit pattern = 1 100 0100B S = 1 → negative Scanning from the right and flip all the bits to the left of the first occurrence of 1 ⇒ 011 1 100B = 60D Hence, the value is -60D

Big Endian vs. Little Endian

Modern computers store one byte of data in each memory address or location, i.e., byte addressable memory. An 32-bit integer is, therefore, stored in 4 memory addresses.

The term"Endian" refers to the order of storing bytes in computer memory. In "Big Endian" scheme, the most significant byte is stored first in the lowest memory address (or big in first), while "Little Endian" stores the least significant bytes in the lowest memory address.

For example, the 32-bit integer 12345678H (305419896 10 ) is stored as 12H 34H 56H 78H in big endian; and 78H 56H 34H 12H in little endian. An 16-bit integer 00H 01H is interpreted as 0001H in big endian, and 0100H as little endian.

Exercise (Integer Representation)

  • What are the ranges of 8-bit, 16-bit, 32-bit and 64-bit integer, in "unsigned" and "signed" representation?
  • Give the value of 88 , 0 , 1 , 127 , and 255 in 8-bit unsigned representation.
  • Give the value of +88 , -88 , -1 , 0 , +1 , -128 , and +127 in 8-bit 2's complement signed representation.
  • Give the value of +88 , -88 , -1 , 0 , +1 , -127 , and +127 in 8-bit sign-magnitude representation.
  • Give the value of +88 , -88 , -1 , 0 , +1 , -127 and +127 in 8-bit 1's complement representation.
  • [TODO] more.
  • The range of unsigned n -bit integers is [0, 2^n - 1] . The range of n -bit 2's complement signed integer is [-2^(n-1), +2^(n-1)-1] ;
  • 88 (0101 1000) , 0 (0000 0000) , 1 (0000 0001) , 127 (0111 1111) , 255 (1111 1111) .
  • +88 (0101 1000) , -88 (1010 1000) , -1 (1111 1111) , 0 (0000 0000) , +1 (0000 0001) , -128 (1000 0000) , +127 (0111 1111) .
  • +88 (0101 1000) , -88 (1101 1000) , -1 (1000 0001) , 0 (0000 0000 or 1000 0000) , +1 (0000 0001) , -127 (1111 1111) , +127 (0111 1111) .
  • +88 (0101 1000) , -88 (1010 0111) , -1 (1111 1110) , 0 (0000 0000 or 1111 1111) , +1 (0000 0001) , -127 (1000 0000) , +127 (0111 1111) .

Floating-Point Number Representation

A floating-point number (or real number) can represent a very large value (e.g., 1.23×10^88 ) or a very small value (e.g., 1.23×10^-88 ). It could also represent very large negative number (e.g., -1.23×10^88 ) and very small negative number (e.g., -1.23×10^-88 ), as well as zero, as illustrated:

Representation_FloatingPointNumbers

A floating-point number is typically expressed in the scientific notation, with a fraction ( F ), and an exponent ( E ) of a certain radix ( r ), in the form of F×r^E . Decimal numbers use radix of 10 ( F×10^E ); while binary numbers use radix of 2 ( F×2^E ).

Representation of floating point number is not unique. For example, the number 55.66 can be represented as 5.566×10^1 , 0.5566×10^2 , 0.05566×10^3 , and so on. The fractional part can be normalized . In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number 123.4567 can be normalized as 1.234567×10^2 ; binary number 1010.1011B can be normalized as 1.0101011B×2^3 .

It is important to note that floating-point numbers suffer from loss of precision when represented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there are infinite number of real numbers (even within a small range of says 0.0 to 0.1). On the other hand, a n -bit binary pattern can represent a finite 2^ n distinct numbers. Hence, not all the real numbers can be represented. The nearest approximation will be used instead, resulted in loss of accuracy.

It is also important to note that floating number arithmetic is very much less efficient than integer arithmetic. It could be speed up with a so-called dedicated floating-point co-processor . Hence, use integers if your application does not require floating-point numbers.

In computers, floating-point numbers are represented in scientific notation of fraction ( F ) and exponent ( E ) with a radix of 2, in the form of F×2^E . Both E and F can be positive as well as negative. Modern computers adopt IEEE 754 standard for representing floating-point numbers. There are two representation schemes: 32-bit single-precision and 64-bit double-precision.

IEEE-754 32-bit Single-Precision Floating-Point Numbers

In 32-bit single-precision floating-point representation:

  • The most significant bit is the sign bit ( S ), with 0 for positive numbers and 1 for negative numbers.
  • The following 8 bits represent exponent ( E ).
  • The remaining 23 bits represents fraction ( F ).

float

Normalized Form

Let's illustrate with an example, suppose that the 32-bit pattern is 1 1000 0001 011 0000 0000 0000 0000 0000 , with:

  • E = 1000 0001
  • F = 011 0000 0000 0000 0000 0000

In the normalized form , the actual fraction is normalized with an implicit leading 1 in the form of 1.F . In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-2 + 1×2^-3 = 1.375D .

The sign bit represents the sign of the number, with S=0 for positive and S=1 for negative number. In this example with S=1 , this is a negative number, i.e., -1.375D .

In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-127). This is because we need to represent both positive and negative exponent. With an 8-bit E, ranging from 0 to 255, the excess-127 scheme could provide actual exponent of -127 to 128. In this example, E-127=129-127=2D .

Hence, the number represented is -1.375×2^2=-5.5D .

De-Normalized Form

Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot represent the number zero! Convince yourself on this!

De-normalized form was devised to represent zero and other numbers.

For E=0 , the numbers are in the de-normalized form. An implicit leading 0 (instead of 1) is used for the fraction; and the actual exponent is always -126 . Hence, the number zero can be represented with E=0 and F=0 (because 0.0×2^-126=0 ).

We can also represent very small positive and negative numbers in de-normalized form with E=0 . For example, if S=1 , E=0 , and F=011 0000 0000 0000 0000 0000 . The actual fraction is 0.011=1×2^-2+1×2^-3=0.375D . Since S=1 , it is a negative number. With E=0 , the actual exponent is -126 . Hence the number is -0.375×2^-126 = -4.4×10^-39 , which is an extremely small negative number (close to zero).

In summary, the value ( N ) is calculated as follows:

  • For 1 ≤ E ≤ 254, N = (-1)^S × 1.F × 2^(E-127) . These numbers are in the so-called normalized form. The sign-bit represents the sign of the number. Fractional part ( 1.F ) are normalized with an implicit leading 1. The exponent is bias (or in excess) of 127 , so as to represent both positive and negative exponent. The range of exponent is -126 to +127 .
  • For E = 0, N = (-1)^S × 0.F × 2^(-126) . These numbers are in the so-called denormalized form. The exponent of 2^-126 evaluates to a very small number. Denormalized form is needed to represent zero (with F=0 and E=0 ). It can also represents very small positive and negative number close to zero.
  • For E = 255 , it represents special values, such as ±INF (positive and negative infinity) and NaN (not a number). This is beyond the scope of this article.

Example 1: Suppose that IEEE-754 32-bit floating-point representation pattern is 0 10000000 110 0000 0000 0000 0000 0000 .

Example 2: Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110 100 0000 0000 0000 0000 0000 .

Example 3: Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110 000 0000 0000 0000 0000 0001 .

Example 4 (De-Normalized Form): Suppose that IEEE-754 32-bit floating-point representation pattern is 1 00000000 000 0000 0000 0000 0000 0001 .

Exercises (Floating-point Numbers)

  • Compute the largest and smallest positive numbers that can be represented in the 32-bit normalized form.
  • Compute the largest and smallest negative numbers can be represented in the 32-bit normalized form.
  • Repeat (1) for the 32-bit denormalized form.
  • Repeat (2) for the 32-bit denormalized form.
  • Largest positive number: S=0 , E=1111 1110 (254) , F=111 1111 1111 1111 1111 1111 . Smallest positive number: S=0 , E=0000 00001 (1) , F=000 0000 0000 0000 0000 0000 .
  • Same as above, but S=1 .
  • Largest positive number: S=0 , E=0 , F=111 1111 1111 1111 1111 1111 . Smallest positive number: S=0 , E=0 , F=000 0000 0000 0000 0000 0001 .

Notes For Java Users

You can use JDK methods Float.intBitsToFloat(int bits) or Double.longBitsToDouble(long bits) to create a single-precision 32-bit float or double-precision 64-bit double with the specific bit patterns, and print their values. For examples,

IEEE-754 64-bit Double-Precision Floating-Point Numbers

The representation scheme for 64-bit double-precision is similar to the 32-bit single-precision:

  • The following 11 bits represent exponent ( E ).
  • The remaining 52 bits represents fraction ( F ).

double

The value ( N ) is calculated as follows:

  • Normalized form: For 1 ≤ E ≤ 2046, N = (-1)^S × 1.F × 2^(E-1023) .
  • Denormalized form: For E = 0, N = (-1)^S × 0.F × 2^(-1022) . These are in the denormalized form.
  • For E = 2047 , N represents special values, such as ±INF (infinity), NaN (not a number).

More on Floating-Point Representation

There are three parts in the floating-point representation:

  • The sign bit ( S ) is self-explanatory (0 for positive numbers and 1 for negative numbers).
  • For the exponent ( E ), a so-called bias (or excess ) is applied so as to represent both positive and negative exponent. The bias is set at half of the range. For single precision with an 8-bit exponent, the bias is 127 (or excess-127). For double precision with a 11-bit exponent, the bias is 1023 (or excess-1023).
  • The fraction ( F ) (also called the mantissa or significand ) is composed of an implicit leading bit (before the radix point) and the fractional bits (after the radix point). The leading bit for normalized numbers is 1; while the leading bit for denormalized numbers is 0.

Normalized Floating-Point Numbers

In normalized form, the radix point is placed after the first non-zero digit, e,g., 9.8765D×10^-23D , 1.001011B×2^11B . For binary number, the leading bit is always 1, and need not be represented explicitly - this saves 1 bit of storage.

In IEEE 754's normalized form:

  • For single-precision, 1 ≤ E ≤ 254 with excess of 127. Hence, the actual exponent is from -126 to +127 . Negative exponents are used to represent small numbers (< 1.0); while positive exponents are used to represent large numbers (> 1.0).     N = (-1)^S × 1.F × 2^(E-127)
  • For double-precision, 1 ≤ E ≤ 2046 with excess of 1023. The actual exponent is from -1022 to +1023 , and     N = (-1)^S × 1.F × 2^(E-1023)

Take note that n-bit pattern has a finite number of combinations ( =2^n ), which could represent finite distinct numbers. It is not possible to represent the infinite numbers in the real axis (even a small range says 0.0 to 1.0 has infinite numbers). That is, not all floating-point numbers can be accurately represented. Instead, the closest approximation is used, which leads to loss of accuracy .

The minimum and maximum normalized floating-point numbers are:

Precision Normalized N(min) Normalized N(max)
Single 0080 0000H
0 00000001 00000000000000000000000B
E = 1, F = 0
N(min) = 1.0B × 2^-126
(≈1.17549435 × 10^-38)
7F7F FFFFH
0 11111110 00000000000000000000000B
E = 254, F = 0
N(max) = 1.1...1B × 2^127 = (2 - 2^-23) × 2^127
(≈3.4028235 × 10^38)
Double 0010 0000 0000 0000H
N(min) = 1.0B × 2^-1022
(≈2.2250738585072014 × 10^-308)
7FEF FFFF FFFF FFFFH
N(max) = 1.1...1B × 2^1023 = (2 - 2^-52) × 2^1023
(≈1.7976931348623157 × 10^308)

real numbers

Denormalized Floating-Point Numbers

If E = 0 , but the fraction is non-zero, then the value is in denormalized form, and a leading bit of 0 is assumed, as follows:

  • For single-precision, E = 0 ,     N = (-1)^S × 0.F × 2^(-126)
  • For double-precision, E = 0 ,     N = (-1)^S × 0.F × 2^(-1022)

Denormalized form can represent very small numbers closed to zero, and zero, which cannot be represented in normalized form, as shown in the above figure.

The minimum and maximum of denormalized floating-point numbers are:

Precision Denormalized D(min) Denormalized D(max)
Single 0000 0001H
0 00000000 00000000000000000000001B
E = 0, F = 00000000000000000000001B
D(min) = 0.0...1 × 2^-126 = 1 × 2^-23 × 2^-126 = 2^-149
(≈1.4 × 10^-45)
007F FFFFH
0 00000000 11111111111111111111111B
E = 0, F = 11111111111111111111111B
D(max) = 0.1...1 × 2^-126 = (1-2^-23)×2^-126
(≈1.1754942 × 10^-38)
Double 0000 0000 0000 0001H
D(min) = 0.0...1 × 2^-1022 = 1 × 2^-52 × 2^-1022 = 2^-1074
(≈4.9 × 10^-324)
001F FFFF FFFF FFFFH
D(max) = 0.1...1 × 2^-1022 = (1-2^-52)×2^-1022
(≈4.4501477170144023 × 10^-308)

Special Values

Zero : Zero cannot be represented in the normalized form, and must be represented in denormalized form with E=0 and F=0 . There are two representations for zero: +0 with S=0 and -0 with S=1 .

Infinity : The value of +infinity (e.g., 1/0 ) and -infinity (e.g., -1/0 ) are represented with an exponent of all 1's ( E = 255 for single-precision and E = 2047 for double-precision), F=0 , and S=0 (for +INF ) and S=1 (for -INF ).

Not a Number (NaN) : NaN denotes a value that cannot be represented as real number (e.g. 0/0 ). NaN is represented with Exponent of all 1's ( E = 255 for single-precision and E = 2047 for double-precision) and any non-zero fraction.

Character Encoding

In computer memory, character are "encoded" (or "represented") using a chosen "character encoding schemes" (aka "character set", "charset", "character map", or "code page").

For example, in ASCII (as well as Latin1, Unicode, and many other character sets):

  • code numbers 65D (41H) to 90D (5AH) represents 'A' to 'Z' , respectively.
  • code numbers 97D (61H) to 122D (7AH) represents 'a' to 'z' , respectively.
  • code numbers 48D (30H) to 57D (39H) represents '0' to '9' , respectively.

It is important to note that the representation scheme must be known before a binary pattern can be interpreted. E.g., the 8-bit pattern " 0100 0010B " could represent anything under the sun known only to the person encoded it.

The most commonly-used character encoding schemes are: 7-bit ASCII (ISO/IEC 646) and 8-bit Latin-x (ISO/IEC 8859-x) for western european characters, and Unicode (ISO/IEC 10646) for internationalization (i18n).

A 7-bit encoding scheme (such as ASCII) can represent 128 characters and symbols. An 8-bit character encoding scheme (such as Latin-x) can represent 256 characters and symbols; whereas a 16-bit encoding scheme (such as Unicode UCS-2) can represents 65,536 characters and symbols.

7-bit ASCII Code (aka US-ASCII, ISO/IEC 646, ITU-T T.50)

  • ASCII (American Standard Code for Information Interchange) is one of the earlier character coding schemes.
  • ASCII is originally a 7-bit code. It has been extended to 8-bit to better utilize the 8-bit computer memory organization. (The 8th-bit was originally used for parity check in the early computers.)
Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F
2 SP!" #$%&'()*+,-./
3 0123456789:;<=> ?
4 @ABCDEFGHIJKLMNO
5 PQRSTUVWXYZ[\]^_
6 `abcdefghijklmno
7 pqrstuvwxyz{|}~ 
Dec 0 1 2 3 4 5 6 7 8 9
3     SP ! " # $ % & '
4 ( ) * + , - . / 0 1
5 2 3 4 5 6 7 8 9 : ;
6 < = > ? @ A B C D E
7 F G H I J K L M N O
8 P Q R S T U V W X Y
9 Z [ \ ] ^ _ ` a b c
10 d e f g h i j k l m
11 n o p q r s t u v w
12 x y z { | } ~      
  • Code number 32D (20H) is the blank or space character.
  • '0' to '9' : 30H-39H (0011 0001B to 0011 1001B) or (0011 xxxxB where xxxx is the equivalent integer value )
  • 'A' to 'Z' : 41H-5AH (0101 0001B to 0101 1010B) or (010x xxxxB) . 'A' to 'Z' are continuous without gap.
  • 'a' to 'z' : 61H-7AH (0110 0001B to 0111 1010B) or (011x xxxxB) . 'A' to 'Z' are also continuous without gap. However, there is a gap between uppercase and lowercase letters. To convert between upper and lowercase, flip the value of bit-5.
  • 09H for Tab ( '\t' ).
  • 0AH for Line-Feed or newline (LF or '\n' ) and 0DH for Carriage-Return (CR or 'r' ), which are used as line delimiter (aka line separator , end-of-line ) for text files. There is unfortunately no standard for line delimiter: Unixes and Mac use 0AH (LF or " \n "), Windows use 0D0AH (CR+LF or " \r\n "). Programming languages such as C/C++/Java (which was created on Unix) use 0AH (LF or " \n ").
  • In programming languages such as C/C++/Java, line-feed ( 0AH ) is denoted as '\n' , carriage-return ( 0DH ) as '\r' , tab ( 09H ) as '\t' .
DECHEXMeaningDECHEXMeaning
000NULNull1711DC1Device Control 1
101SOHStart of Heading1812DC2Device Control 2
202STXStart of Text1913DC3Device Control 3
303ETXEnd of Text2014DC4Device Control 4
404EOTEnd of Transmission2115NAKNegative Ack.
505ENQEnquiry2216SYNSync. Idle
606ACKAcknowledgment2317ETBEnd of Transmission
707BELBell2418CANCancel
808BS Back Space 2519EMEnd of Medium
261ASUBSubstitute
271BESCEscape
110BVTVertical Feed281CIS4File Separator
120CFFForm Feed 291DIS3Group Separator
301EIS2Record Separator
140ESOShift Out311FIS1Unit Separator
150FSIShift In        
1610DLEDatalink Escape 127 7F DEL Delete

8-bit Latin-1 (aka ISO/IEC 8859-1)

ISO/IEC-8859 is a collection of 8-bit character encoding standards for the western languages.

ISO/IEC 8859-1, aka Latin alphabet No. 1, or Latin-1 in short, is the most commonly-used encoding scheme for western european languages. It has 191 printable characters from the latin script, which covers languages like English, German, Italian, Portuguese and Spanish. Latin-1 is backward compatible with the 7-bit US-ASCII code. That is, the first 128 characters in Latin-1 (code numbers 0 to 127 (7FH)), is the same as US-ASCII. Code numbers 128 (80H) to 159 (9FH) are not assigned. Code numbers 160 (A0H) to 255 (FFH) are given as follows:

Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F
A NBSP ¡¢ £¤¥¦§¨©ª«¬SHY®¯
B °±²³´µ·¸¹º»¼½¾ ¿
C ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
D ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
E àáâãäåæçèéêëìíîï
F ðñòóôõö÷øùúûüýþÿ

ISO/IEC-8859 has 16 parts. Besides the most commonly-used Part 1, Part 2 is meant for Central European (Polish, Czech, Hungarian, etc), Part 3 for South European (Turkish, etc), Part 4 for North European (Estonian, Latvian, etc), Part 5 for Cyrillic, Part 6 for Arabic, Part 7 for Greek, Part 8 for Hebrew, Part 9 for Turkish, Part 10 for Nordic, Part 11 for Thai, Part 12 was abandon, Part 13 for Baltic Rim, Part 14 for Celtic, Part 15 for French, Finnish, etc. Part 16 for South-Eastern European.

Other 8-bit Extension of US-ASCII (ASCII Extensions)

Beside the standardized ISO-8859-x, there are many 8-bit ASCII extensions, which are not compatible with each others.

ANSI (American National Standards Institute) (aka Windows-1252 , or Windows Codepage 1252): for Latin alphabets used in the legacy DOS/Windows systems. It is a superset of ISO-8859-1 with code numbers 128 (80H) to 159 (9FH) assigned to displayable characters, such as "smart" single-quotes and double-quotes. A common problem in web browsers is that all the quotes and apostrophes (produced by "smart quotes" in some Microsoft software) were replaced with question marks or some strange symbols. It it because the document is labeled as ISO-8859-1 (instead of Windows-1252), where these code numbers are undefined. Most modern browsers and e-mail clients treat charset ISO-8859-1 as Windows-1252 in order to accommodate such mis-labeling.

Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F
8   ƒ ˆ Š Œ   Ž  
9     š œ   ž Ÿ

EBCDIC (Extended Binary Coded Decimal Interchange Code): Used in the early IBM computers.

Unicode (aka ISO/IEC 10646 Universal Character Set)

Before Unicode, no single character encoding scheme could represent characters in all languages. For example, western european uses several encoding schemes (in the ISO-8859-x family). Even a single language like Chinese has a few encoding schemes (GB2312/GBK, BIG5). Many encoding schemes are in conflict of each other, i.e., the same code number is assigned to different characters.

Unicode aims to provide a standard character encoding scheme, which is universal, efficient, uniform and unambiguous. Unicode standard is maintained by a non-profit organization called the Unicode Consortium (@ www.unicode.org ). Unicode is an ISO/IEC standard 10646.

Unicode is backward compatible with the 7-bit US-ASCII and 8-bit Latin-1 (ISO-8859-1). That is, the first 128 characters are the same as US-ASCII; and the first 256 characters are the same as Latin-1.

Unicode originally uses 16 bits (called UCS-2 or Unicode Character Set - 2 byte), which can represent up to 65,536 characters. It has since been expanded to more than 16 bits, currently stands at 21 bits. The range of the legal codes in ISO/IEC 10646 is now from U+0000H to U+10FFFFH (21 bits or about 2 million characters), covering all current and ancient historical scripts. The original 16-bit range of U+0000H to U+FFFFH (65536 characters) is known as Basic Multilingual Plane (BMP), covering all the major languages in use currently. The characters outside BMP are called Supplementary Characters , which are not frequently-used.

Unicode has two encoding schemes:

  • UCS-2 (Universal Character Set - 2 Byte): Uses 2 bytes (16 bits), covering 65,536 characters in the BMP. BMP is sufficient for most of the applications. UCS-2 is now obsolete.
  • UCS-4 (Universal Character Set - 4 Byte): Uses 4 bytes (32 bits), covering BMP and the supplementary characters.

DataRep_Unicode.png

UTF-8 (Unicode Transformation Format - 8-bit)

The 16/32-bit Unicode (UCS-2/4) is grossly inefficient if the document contains mainly ASCII characters, because each character occupies two bytes of storage. Variable-length encoding schemes, such as UTF-8, which uses 1-4 bytes to represent a character, was devised to improve the efficiency. In UTF-8, the 128 commonly-used US-ASCII characters use only 1 byte, but some less-commonly characters may require up to 4 bytes. Overall, the efficiency improved for document containing mainly US-ASCII texts.

The transformation between Unicode and UTF-8 is as follows:

Bits Unicode UTF-8 Code Bytes
7 00000000 0xxxxxxx 0xxxxxxx 1 (ASCII)
11 00000yyy yyxxxxxx 110yyyyy 10xxxxxx 2
16 zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx 3
21 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx 4

In UTF-8, Unicode numbers corresponding to the 7-bit ASCII characters are padded with a leading zero; thus has the same value as ASCII. Hence, UTF-8 can be used with all software using ASCII. Unicode numbers of 128 and above, which are less frequently used, are encoded using more bytes (2-4 bytes). UTF-8 generally requires less storage and is compatible with ASCII. The drawback of UTF-8 is more processing power needed to unpack the code due to its variable length. UTF-8 is the most popular format for Unicode.

  • UTF-8 uses 1-3 bytes for the characters in BMP (16-bit), and 4 bytes for supplementary characters outside BMP (21-bit).
  • The 128 ASCII characters (basic Latin letters, digits, and punctuation signs) use one byte. Most European and Middle East characters use a 2-byte sequence, which includes extended Latin letters (with tilde, macron, acute, grave and other accents), Greek, Armenian, Hebrew, Arabic, and others. Chinese, Japanese and Korean (CJK) use three-byte sequences.
  • All the bytes, except the 128 ASCII characters, have a leading '1' bit. In other words, the ASCII bytes, with a leading '0' bit, can be identified and decoded easily.

Example : 您好 (Unicode: 60A8H 597DH)

UTF-16 (Unicode Transformation Format - 16-bit)

UTF-16 is a variable-length Unicode character encoding scheme, which uses 2 to 4 bytes. UTF-16 is not commonly used. The transformation table is as follows:

Unicode UTF-16 Code Bytes
xxxxxxxx xxxxxxxx Same as UCS-2 - no encoding 2
000uuuuu zzzzyyyy yyxxxxxx
(uuuuu≠0)
110110ww wwzzzzyy 110111yy yyxxxxxx
(wwww = uuuuu - 1)
4

Take note that for the 65536 characters in BMP, the UTF-16 is the same as UCS-2 (2 bytes). However, 4 bytes are used for the supplementary characters outside the BMP.

For BMP characters, UTF-16 is the same as UCS-2. For supplementary characters, each character requires a pair 16-bit values, the first from the high-surrogates range, ( \uD800-\uDBFF ), the second from the low-surrogates range ( \uDC00-\uDFFF ).

UTF-32 (Unicode Transformation Format - 32-bit)

Same as UCS-4, which uses 4 bytes for each character - unencoded.

Formats of Multi-Byte (e.g., Unicode) Text Files

Endianess (or byte-order) : For a multi-byte character, you need to take care of the order of the bytes in storage. In big endian , the most significant byte is stored at the memory location with the lowest address (big byte first). In little endian , the most significant byte is stored at the memory location with the highest address (little byte first). For example, 您 (with Unicode number of 60A8H ) is stored as 60 A8 in big endian; and stored as A8 60 in little endian. Big endian, which produces a more readable hex dump, is more commonly-used, and is often the default.

BOM (Byte Order Mark) : BOM is a special Unicode character having code number of FEFFH , which is used to differentiate big-endian and little-endian. For big-endian, BOM appears as FE FFH in the storage. For little-endian, BOM appears as FF FEH . Unicode reserves these two code numbers to prevent it from crashing with another character.

Unicode text files could take on these formats:

  • Big Endian: UCS-2BE, UTF-16BE, UTF-32BE.
  • Little Endian: UCS-2LE, UTF-16LE, UTF-32LE.
  • UTF-16 with BOM. The first character of the file is a BOM character, which specifies the endianess. For big-endian, BOM appears as FE FFH in the storage. For little-endian, BOM appears as FF FEH .

UTF-8 file is always stored as big endian. BOM plays no part. However, in some systems (in particular Windows), a BOM is added as the first character in the UTF-8 file as the signature to identity the file as UTF-8 encoded. The BOM character ( FEFFH ) is encoded in UTF-8 as EF BB BF . Adding a BOM as the first character of the file is not recommended, as it may be incorrectly interpreted in other system. You can have a UTF-8 file without BOM.

Formats of Text Files

Line Delimiter or End-Of-Line (EOL) : Sometimes, when you use the Windows NotePad to open a text file (created in Unix or Mac), all the lines are joined together. This is because different operating platforms use different character as the so-called line delimiter (or end-of-line or EOL). Two non-printable control characters are involved: 0AH (Line-Feed or LF) and 0DH (Carriage-Return or CR).

  • Windows/DOS uses OD0AH (CR+LF or " \r\n ") as EOL.
  • Unix and Mac use 0AH (LF or " \n ") only.

End-of-File (EOF) : [TODO]

Windows' CMD Codepage

Character encoding scheme (charset) in Windows is called codepage . In CMD shell, you can issue command "chcp" to display the current codepage, or "chcp codepage-number" to change the codepage.

Take note that:

  • The default codepage 437 (used in the original DOS) is an 8-bit character set called Extended ASCII , which is different from Latin-1 for code numbers above 127.
  • Codepage 1252 (Windows-1252), is not exactly the same as Latin-1. It assigns code number 80H to 9FH to letters and punctuation, such as smart single-quotes and double-quotes. A common problem in browser that display quotes and apostrophe in question marks or boxes is because the page is supposed to be Windows-1252, but mislabelled as ISO-8859-1.
  • For internationalization and chinese character set: codepage 65001 for UTF8, codepage 1201 for UCS-2BE, codepage 1200 for UCS-2LE, codepage 936 for chinese characters in GB2312, codepage 950 for chinese characters in Big5.

Chinese Character Sets

Unicode supports all languages, including asian languages like Chinese (both simplified and traditional characters), Japanese and Korean (collectively called CJK). There are more than 20,000 CJK characters in Unicode. Unicode characters are often encoded in the UTF-8 scheme, which unfortunately, requires 3 bytes for each CJK character, instead of 2 bytes in the unencoded UCS-2 (UTF-16).

Worse still, there are also various chinese character sets, which is not compatible with Unicode:

  • GB2312/GBK: for simplified chinese characters. GB2312 uses 2 bytes for each chinese character. The most significant bit (MSB) of both bytes are set to 1 to co-exist with 7-bit ASCII with the MSB of 0. There are about 6700 characters. GBK is an extension of GB2312, which include more characters as well as traditional chinese characters.
  • BIG5: for traditional chinese characters BIG5 also uses 2 bytes for each chinese character. The most significant bit of both bytes are also set to 1. BIG5 is not compatible with GBK, i.e., the same code number is assigned to different character.

For example, the world is made more interesting with these many standards:

  Standard Characters Codes
Simplified GB2312 和谐 BACD  D0B3
UCS-2 和谐 548C  8C10
UTF-8 和谐 E5928C  E8B090
Traditional BIG5 和諧 A94D  BFD3
UCS-2 和諧 548C  8AE7
UTF-8 和諧 E5928C  E8ABA7

Notes for Windows' CMD Users : To display the chinese character correctly in CMD shell, you need to choose the correct codepage, e.g., 65001 for UTF8, 936 for GB2312/GBK, 950 for Big5, 1201 for UCS-2BE, 1200 for UCS-2LE, 437 for the original DOS. You can use command " chcp " to display the current code page and command " chcp codepage_number " to change the codepage. You also have to choose a font that can display the characters (e.g., Courier New, Consolas or Lucida Console, NOT Raster font).

Collating Sequences (for Ranking Characters)

A string consists of a sequence of characters in upper or lower cases, e.g., "apple" , "BOY" , "Cat" . In sorting or comparing strings, if we order the characters according to the underlying code numbers (e.g., US-ASCII) character-by-character, the order for the example would be "BOY" , "apple" , "Cat" because uppercase letters have a smaller code number than lowercase letters. This does not agree with the so-called dictionary order , where the same uppercase and lowercase letters have the same rank. Another common problem in ordering strings is "10" (ten) at times is ordered in front of "1" to "9" .

Hence, in sorting or comparison of strings, a so-called collating sequence (or collation ) is often defined, which specifies the ranks for letters (uppercase, lowercase), numbers, and special symbols. There are many collating sequences available. It is entirely up to you to choose a collating sequence to meet your application's specific requirements. Some case-insensitive dictionary-order collating sequences have the same rank for same uppercase and lowercase letters, i.e., 'A' , 'a' ⇒ 'B' , 'b' ⇒ ... ⇒ 'Z' , 'z' . Some case-sensitive dictionary-order collating sequences put the uppercase letter before its lowercase counterpart, i.e., 'A' ⇒ 'B' ⇒ 'C' ... ⇒ 'a' ⇒ 'b' ⇒ 'c' ... . Typically, space is ranked before digits '0' to '9' , followed by the alphabets.

Collating sequence is often language dependent, as different languages use different sets of characters (e.g., á, é, a, α) with their own orders.

For Java Programmers - java.nio.Charset

JDK 1.4 introduced a new java.nio.charset package to support encoding/decoding of characters from UCS-2 used internally in Java program to any supported charset used by external devices.

Example : The following program encodes some Unicode texts in various encoding scheme, and display the Hex codes of the encoded byte sequences.

For Java Programmers - char and String

The char data type are based on the original 16-bit Unicode standard called UCS-2. The Unicode has since evolved to 21 bits, with code range of U+0000 to U+10FFFF. The set of characters from U+0000 to U+FFFF is known as the Basic Multilingual Plane ( BMP ). Characters above U+FFFF are called supplementary characters. A 16-bit Java char cannot hold a supplementary character.

Recall that in the UTF-16 encoding scheme, a BMP characters uses 2 bytes. It is the same as UCS-2. A supplementary character uses 4 bytes. and requires a pair of 16-bit values, the first from the high-surrogates range, ( \uD800-\uDBFF ), the second from the low-surrogates range ( \uDC00-\uDFFF ).

In Java, a String is a sequences of Unicode characters. Java, in fact, uses UTF-16 for String and StringBuffer . For BMP characters, they are the same as UCS-2. For supplementary characters, each characters requires a pair of char values.

Java methods that accept a 16-bit char value does not support supplementary characters. Methods that accept a 32-bit int value support all Unicode characters (in the lower 21 bits), including supplementary characters.

This is meant to be an academic discussion. I have yet to encounter the use of supplementary characters!

Displaying Hex Values & Hex Editors

At times, you may need to display the hex values of a file, especially in dealing with Unicode characters. A Hex Editor is a handy tool that a good programmer should possess in his/her toolbox. There are many freeware/shareware Hex Editor available. Try google "Hex Editor".

I used the followings:

  • NotePad++ with Hex Editor Plug-in: Open-source and free. You can toggle between Hex view and Normal view by pushing the "H" button.
  • PSPad: Freeware. You can toggle to Hex view by choosing "View" menu and select "Hex Edit Mode".
  • TextPad: Shareware without expiration period. To view the Hex value, you need to "open" the file by choosing the file format of "binary" (??).
  • UltraEdit: Shareware, not free, 30-day trial only.

Let me know if you have a better choice, which is fast to launch, easy to use, can toggle between Hex and normal view, free, ....

The following Java program can be used to display hex code for Java Primitives (integer, character and floating-point):

System.out.println("Hex is " + Integer.toHexString(i)); // 3039 System.out.println("Binary is " + Integer.toBinaryString(i)); // 11000000111001 System.out.println("Octal is " + Integer.toOctalString(i)); // 30071 System.out.printf("Hex is %x\n", i); // 3039 System.out.printf("Octal is %o\n", i); // 30071 char c = 'a'; System.out.println("Character is " + c); // a System.out.printf("Character is %c\n", c); // a System.out.printf("Hex is %x\n", (short)c); // 61 System.out.printf("Decimal is %d\n", (short)c); // 97 float f = 3.5f; System.out.println("Decimal is " + f); // 3.5 System.out.println(Float.toHexString(f)); // 0x1.cp1 (Fraction=1.c, Exponent=1) f = -0.75f; System.out.println("Decimal is " + f); // -0.75 System.out.println(Float.toHexString(f)); // -0x1.8p-1 (F=-1.8, E=-1) double d = 11.22; System.out.println("Decimal is " + d); // 11.22 System.out.println(Double.toHexString(d)); // 0x1.670a3d70a3d71p3 (F=1.670a3d70a3d71 E=3) } }

In Eclipse, you can view the hex code for integer primitive Java variables in debug mode as follows: In debug perspective, "Variable" panel ⇒ Select the "menu" (inverted triangle) ⇒ Java ⇒ Java Preferences... ⇒ Primitive Display Options ⇒ Check "Display hexadecimal values (byte, short, char, int, long)".

Summary - Why Bother about Data Representation?

Integer number 1 , floating-point number 1.0 character symbol '1' , and string "1" are totally different inside the computer memory. You need to know the difference to write good and high-performance programs.

  • In 8-bit signed integer , integer number 1 is represented as 00000001B .
  • In 8-bit unsigned integer , integer number 1 is represented as 00000001B .
  • In 16-bit signed integer , integer number 1 is represented as 00000000 00000001B .
  • In 32-bit signed integer , integer number 1 is represented as 00000000 00000000 00000000 00000001B .
  • In 32-bit floating-point representation , number 1.0 is represented as 0 01111111 0000000 00000000 00000000B , i.e., S=0 , E=127 , F=0 .
  • In 64-bit floating-point representation , number 1.0 is represented as 0 01111111111 0000 00000000 00000000 00000000 00000000 00000000 00000000B , i.e., S=0 , E=1023 , F=0 .
  • In 8-bit Latin-1, the character symbol '1' is represented as 00110001B (or 31H ).
  • In 16-bit UCS-2, the character symbol '1' is represented as 00000000 00110001B .
  • In UTF-8, the character symbol '1' is represented as 00110001B .

If you "add" a 16-bit signed integer 1 and Latin-1 character '1' or a string "1", you could get a surprise.

Exercises (Data Representation)

For the following 16-bit codes:

Give their values, if they are representing:

  • a 16-bit unsigned integer;
  • a 16-bit signed integer;
  • two 8-bit unsigned integers;
  • two 8-bit signed integers;
  • a 16-bit Unicode characters;
  • two 8-bit ISO-8859-1 characters.

Ans: (1) 42 , 32810 ; (2) 42 , -32726 ; (3) 0 , 42 ; 128 , 42 ; (4) 0 , 42 ; -128 , 42 ; (5) '*' ; '耪' ; (6) NUL , '*' ; PAD , '*' .

REFERENCES & RESOURCES

  • (Floating-Point Number Specification) IEEE 754 (1985), "IEEE Standard for Binary Floating-Point Arithmetic".
  • (ASCII Specification) ISO/IEC 646 (1991) (or ITU-T T.50-1992), "Information technology - 7-bit coded character set for information interchange".
  • (Latin-I Specification) ISO/IEC 8859-1, "Information technology - 8-bit single-byte coded graphic character sets - Part 1: Latin alphabet No. 1".
  • (Unicode Specification) ISO/IEC 10646, "Information technology - Universal Multiple-Octet Coded Character Set (UCS)".
  • Unicode Consortium @ http://www.unicode.org .

Last modified: January, 2014

Javatpoint Logo

  • Computer Fundamentals

Computer Network

Control System

  • Interview Q

COA Tutorial

Basic co and design, computer instructions, digital logic circuits, map simplification, combinational circuits, flip - flops, digital components, register transfer, micro-operations, memory organization.

JavaTpoint

In computer organization, data refers to the symbols that are used to represent events, people, things and ideas.

The data can be represented in the following ways:

can be anything like a number, a name, notes in a musical composition, or the color in a photograph. Data representation can be referred to as the form in which we stored the data, processed it and transmitted it. In order to store the data in digital format, we can use any device like computers, smartphones, and iPads. Electronic circuitry is used to handle the stored data.

is a type of process in which we convert information like photos, music, number, text into digital data. Electronic devices are used to manipulate these types of data. The digital revolution has evolved with the help of 4 phases, starting with the big, expensive standalone computers and progressing to today's digital world. All around the world, small and inexpensive devices are spreading everywhere.

The or bits are used to show the digital data, which is represented by 0 and 1. The binary digits can be called the smallest unit of information in a computer. The main use of binary digit is that it can store the information or data in the form of 0s and 1s. It contains a value that can be on/off or true/false. On or true will be represented by the 1, and off or false will be represented by the 0. The digital file is a simple file, which is used to collect data contained by the storage medium like the flash drive, CD, hard disk, or DVD.

The number can be represented in the following way:

is used to contain numbers, which helps us to perform arithmetic operations. The digital devices use a binary number system so that they can represent numeric data. The binary number system can only be represented by two digits 0 and 1. There can't be any other digits like 2 in the system. If we want to represent number 2 in binary, then we will write it as 10.

The text can be represented in the following ways:

can be formed with the help of symbols, letters, and numerals, but they can?t be used in calculations. Using the character data, we can form our address, hair colour, name, etc. Character data normally takes the data in the form of text. With the help of the text, we can describe many things like our father name, mother name, etc.

Several types of codes are employed by the to represent character data, including Unicode, ASCII, and other types of variants. The full form of ASCII is American Standard Code for Information Interchange. It is a type of character encoding standard, which is used for electronic communication. With the help of telecommunication equipment, computers and many other devices, ASCII code can represent the text. The ASCII code needs 7 bits for each character, where the unique character is represented by every single bit. For the uppercase letter A, the ASCII code is represented as 1000001.

can be described as a superset of ASCII. The ASCII set uses 7 bits to represent every character, but the Extended ASCII uses 8 bits to represent each character. The extended ASCII contains 7 bits of ASCII characters and 1 bit for additional characters. Using the 7 bits, the ASCII code provides code for 128 unique symbols or characters, but Extended ASCII provides code for 256 unique symbols or characters. For the uppercase letter A, the Extended ASCII code is represented as 01000001.

is also known as the universal character encoding standard. Unicode provides a way through which an individual character can be represented in the form of web pages, text files, and other documents. Using ASCII, we can only represent the basic English characters, but with the help of Unicode, we can represent characters from all languages around the World.

ASCII code provides code for 128 characters, while Unicode provide code for roughly 65,000 characters with the help of 16 bits. In order to represent each character, ASCII code only uses 1 bit, while Unicode supports up to 4 bytes. The Unicode encoding has several different types, but UTF-8 and UTF-16 are the most commonly used. UTF-8 is a type of variable length coding scheme. It has also become the standard character encoding, which is used on the web. Many software programs also set UTF-8 as their default encoding.

can be used for numerals like phone numbers and social security numbers. ASCII text contains plain and unformatted text. This type of file will be saved in a text file format, which contains a name ending with .txt. These files are labelled differently on different systems, like Windows operating system labelled these files as "Text document" and Apple devices labelled these files as "Plain Text". There will have no formatting in the ASCII text files. If we want to make the documents with styles and formats, then we have to embed formatting codes in the text.

Microsoft word is used to create formatted text and documents. It uses the to do this. If we create a new document using the Microsoft Word 2007 or later version, then it always uses DOCX as the default file format. use to produce the documents. As compared to Microsoft Word, it is simpler to create and edit documents using page format. uses the to create the documents. The files that saved in the PDF format cannot be modified. But we can easily print and share these files. If we save our document in PDF format, then we cannot change that file into the Microsoft Office file or any other file without specified software.

is the hypertext markup language. It is used for document designing, which will be displayed in a web browser. It uses to design the documents. In HTML, hypertext is a type of text in any document containing links through which we can go to other places in the document or in other documents also. The markup language can be called as a computer language. In order to define the element within a document, this language uses tags.

The bits and bytes can be represented in the following ways:

In the field of digital communication or computers, bits are the most basic unit of information or smallest unit of data. It is short of binary digit, which means it can contain only one value, either 0 or 1. So bits can be represented by 0 or 1, - or +, false or true, off or on, or no or yes. Many technologies are based on bits and bytes, which is extensively useful to describe the network access speed and storage capacity. The bit is usually abbreviated as a lowercase b.

In order to execute the instructions and store the data, the bits are grouped into multiple bits, which are known as bytes. Bytes can be defined as a group of eight bits, and it is usually abbreviated as an uppercase B. If we have four bytes, it will equal 32 bits (4*8 = 32), and 10 bytes will equal 80 bits (8*10 = 80).

Bits are used for data rates like speeds while movie download, speed while internet connection, etc. Bytes are used to get the storage capacity and file sizes. When we are reading something related to digital devices, it will be frequently encountered references like 90 kilobits per second, 1.44 megabytes, 2.8 gigahertz, and 2 terabytes. To quantify digital data, we have many options such as Kilo, Mega, Giga, Tera and many more similar terms, which are described as follows:

Kb is also called a kilobyte or Kbyte. It is mostly used while referring to the size of small computer files.

Kbps is also called kilobit, Kbit or Kb. The 56 kbps means 56 kilobits per second which are used to show the slow data rates. If our internet speed is 56 kbps, we have to face difficulty while connecting more than one device, buffering while streaming videos, slow downloading, and many other internet connectivity problems.

Mbps is also called Megabit, MB or Mbit. The 50 Mbps means 50 Megabit per second, which are used to show the faster data rates. If our internet speed is 50 Mbps, we will experience online activity without any buffering, such as online gaming, downloading music, streaming HD, web browsing, etc. 50 Mbps or more than that will be known as fast internet speed. With the help of fast speed, we can easily handle more than one online activity for more than one user at a time without major interruptions in services.

3.2 MB is also called Megabyte, MB or MByte. It is used when we are referring to the size of files, which contains videos and photos.

100 Gbit is also called Gigabit or GB. It is used to show the really fast network speeds.

16 GB is also called Gigabyte, GB or GByte. It is used to show the storage capacity.

The digital data is compressed to reduce transmission times and file size. Data compression is the process of reducing the number of bits used to represent data. Data compression typically uses encoding techniques to compress the data. The compressed data will help us to save storage capacity, reduce costs for storage hardware, increase file transfer speed.

Compression uses some programs, which also uses algorithms and functions to find out the way to reduce the data size. Compression can be referred "zipping". The process of reconstructing files will be known as unzipping or extracting. The compressed files will contain .gz, or.tar.gz, .pkg, or .zip at the end of the files. Compression can be divided into two techniques: Lossless compression and Lossy compression.

As the name implies, lossless compression is the process of compressing the data without any loss of information or data. If we compressed the data with the help of lossless compression, then we can exactly recover the original data from the compressed data. That means all the information can be completely restored by lossless compression.

Many applications want to use data loss compression. For example, lossless compression can be used in the format of ZIP files and in the GNU tool gzip. The lossless data compression can also be used as a component within the technologies of lossy data compression. It is generally used for discrete data like word processing files, database records, some images, and information of the video.

Lossy compression is the process of compressing the data, but that data cannot be recovered 100% of original data. This compression is able to provide a high degree of compression, and the result of this compression will be in smaller compressed files. But in this process, some number of video frames, sound waves and original pixels are removed forever.

If the compression is greater, then the size of files will be smaller. Business data and text, which needs a full restoration, will never use lossy compression. Nobody likes to lose the information, but there are a lot of files that are very large, and we don't have enough space to maintain all of the original data or many times, we don't require all the original data in the first place. For example, videos, photos and audio recording files to capture the beauty of our world. In this case, we use lossy compression.





Youtube

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Interview Questions

Company Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Artificial Intelligence

AWS Tutorial

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

Machine Learning

DevOps Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

Talk to our experts

1800-120-456-456

  • Introduction to Data Representation
  • Computer Science

ffImage

About Data Representation

Data can be anything, including a number, a name, musical notes, or the colour of an image. The way that we stored, processed, and transmitted data is referred to as data representation. We can use any device, including computers, smartphones, and iPads, to store data in digital format. The stored data is handled by electronic circuitry. A bit is a 0 or 1 used in digital data representation.

Data Representation Techniques

Data Representation Techniques

Classification of Computers

Computer scans are classified broadly based on their speed and computing power.

1. Microcomputers or PCs (Personal Computers): It is a single-user computer system with a medium-power microprocessor. It is referred to as a computer with a microprocessor as its central processing unit.

Microcomputer

Microcomputer

2. Mini-Computer: It is a multi-user computer system that can support hundreds of users at the same time.

Types of Mini Computers

Types of Mini Computers

3. Mainframe Computer: It is a multi-user computer system that can support hundreds of users at the same time. Software technology is distinct from minicomputer technology.

Mainframe Computer

Mainframe Computer

4. Super-Computer: With the ability to process hundreds of millions of instructions per second, it is a very quick computer. They  are used for specialised applications requiring enormous amounts of mathematical computations, but they are very expensive.

Supercomputer

Supercomputer

Types of Computer Number System

Every value saved to or obtained from computer memory uses a specific number system, which is the method used to represent numbers in the computer system architecture. One needs to be familiar with number systems in order to read computer language or interact with the system. 

Types of Number System

Types of Number System

1. Binary Number System 

There are only two digits in a binary number system: 0 and 1. In this number system, 0 and 1 stand in for every number (value). Because the binary number system only has two digits, its base is 2.

A bit is another name for each binary digit. The binary number system is also a positional value system, where each digit's value is expressed in powers of 2.

Characteristics of Binary Number System

The following are the primary characteristics of the binary system:

It only has two digits, zero and one.

Depending on its position, each digit has a different value.

Each position has the same value as a base power of two.

Because computers work with internal voltage drops, it is used in all types of computers.

Binary Number System

Binary Number System

2. Decimal Number System

The decimal number system is a base ten number system with ten digits ranging from 0 to 9. This means that these ten digits can represent any numerical quantity. A positional value system is also a decimal number system. This means that the value of digits will be determined by their position. 

Characteristics of Decimal Number System

Ten units of a given order equal one unit of the higher order, making it a decimal system.

The number 10 serves as the foundation for the decimal number system.

The value of each digit or number will depend on where it is located within the numeric figure because it is a positional system.

The value of this number results from multiplying all the digits by each power.

Decimal Number System

Decimal Number System

Decimal Binary Conversion Table

Decimal 

Binary

0

0000

1

0001

2

0010

3

0011

4

0100

5

0101

6

0110

7

0111

8

1000

9

1001

10

1010

11

1011

12

1100

13

1101

14

1110

15

1111

3. Octal Number System

There are only eight (8) digits in the octal number system, from 0 to 7. In this number system, each number (value) is represented by the digits 0, 1, 2, 3,4,5,6, and 7. Since the octal number system only has 8 digits, its base is 8.

Characteristics of Octal Number System:

Contains eight digits: 0,1,2,3,4,5,6,7.

Also known as the base 8 number system.

Each octal number position represents a 0 power of the base (8). 

An octal number's last position corresponds to an x power of the base (8).

Octal Number System

Octal Number System

4. Hexadecimal Number System

There are sixteen (16) alphanumeric values in the hexadecimal number system, ranging from 0 to 9 and A to F. In this number system, each number (value) is represented by 0, 1, 2, 3, 5, 6, 7, 8, 9, A, B, C, D, E, and F. Because the hexadecimal number system has 16 alphanumeric values, its base is 16. Here, the numbers are A = 10, B = 11, C = 12, D = 13, E = 14, and F = 15.

Characteristics of Hexadecimal Number System:

A system of positional numbers.

Has 16 symbols or digits overall (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F). Its base is, therefore, 16.

Decimal values 10, 11, 12, 13, 14, and 15 are represented by the letters A, B, C, D, E, and F, respectively.

A single digit may have a maximum value of 15. 

Each digit position corresponds to a different base power (16).

Since there are only 16 digits, any hexadecimal number can be represented in binary with 4 bits.

Hexadecimal Number System

Hexadecimal Number System

So, we've seen how to convert decimals and use the Number System to communicate with a computer. The full character set of the English language, which includes all alphabets, punctuation marks, mathematical operators, special symbols, etc., must be supported by the computer in addition to numerical data. 

Learning By Doing

Choose the correct answer:.

1. Which computer is the largest in terms of size?

Minicomputer

Micro Computer

2. The binary number 11011001 is converted to what decimal value?

Solved Questions

1. Give some examples where Supercomputers are used.

Ans: Weather Prediction, Scientific simulations, graphics, fluid dynamic calculations, Nuclear energy research, electronic engineering and analysis of geological data.

2. Which of these is the most costly?

Mainframe computer

Ans: C) Supercomputer

arrow-right

FAQs on Introduction to Data Representation

1. What is the distinction between the Hexadecimal and Octal Number System?

The octal number system is a base-8 number system in which the digits 0 through 7 are used to represent numbers. The hexadecimal number system is a base-16 number system that employs the digits 0 through 9 as well as the letters A through F to represent numbers.

2. What is the smallest data representation?

The smallest data storage unit in a computer's memory is called a BYTE, which comprises 8 BITS.

3. What is the largest data unit?

The largest commonly available data storage unit is a terabyte or TB. A terabyte equals 1,000 gigabytes, while a tebibyte equals 1,024 gibibytes.

Computer Concepts Tutorial

  • Computer Concepts Tutorial
  • Computer Concepts - Home
  • Introduction to Computer
  • Introduction to GUI based OS
  • Elements of Word Processing
  • Spread Sheet
  • Introduction to Internet, WWW, Browsers
  • Communication & Collaboration
  • Application of Presentations
  • Application of Digital Financial Services
  • Computer Concepts Resources
  • Computer Concepts - Quick Guide
  • Computer Concepts - Useful Resources
  • Computer Concepts - Discussion
  • Selected Reading
  • UPSC IAS Exams Notes
  • Developer's Best Practices
  • Questions and Answers
  • Effective Resume Writing
  • HR Interview Questions
  • Computer Glossary

Representation of Data/Information

Computers do not understand human language; they understand data within the prescribed form. Data representation is a method to represent data and encode it in a computer system. Generally, a user inputs numbers, text, images, audio, and video etc types of data to process but the computer converts this data to machine language first and then processes it.

Some Common Data Representation Methods Include

Methods

Data representation plays a vital role in storing, process, and data communication. A correct and effective data representation method impacts data processing performance and system compatibility.

Computers represent data in the following forms

Number system.

A computer system considers numbers as data; it includes integers, decimals, and complex numbers. All the inputted numbers are represented in binary formats like 0 and 1. A number system is categorized into four types −

  • Binary − A binary number system is a base of all the numbers considered for data representation in the digital system. A binary number system consists of only two values, either 0 or 1; so its base is 2. It can be represented to the external world as (10110010) 2 . A computer system uses binary digits (0s and 1s) to represent data internally.
  • Octal − The octal number system represents values in 8 digits. It consists of digits 0,12,3,4,5,6, and 7; so its base is 8. It can be represented to the external world as (324017) 8 .
  • Decimal − Decimal number system represents values in 10 digits. It consists of digits 0, 12, 3, 4, 5, 6, 7, 8, and 9; so its base is 10. It can be represented to the external world as (875629) 10 .

The below-mentioned table below summarises the data representation of the number system along with their Base and digits.

Number System
System Base Digits
Binary 2 0 1
Octal 8 0 1 2 3 4 5 6 7
Decimal 10 0 1 2 3 4 5 6 7 8 9
Hexadecimal 16 0 1 2 3 4 5 6 7 8 9 A B C D E F

Bits and Bytes

A bit is the smallest data unit that a computer uses in computation; all the computation tasks done by the computer systems are based on bits. A bit represents a binary digit in terms of 0 or 1. The computer usually uses bits in groups. It's the basic unit of information storage and communication in digital computing.

A group of eight bits is called a byte. Half of a byte is called a nibble; it means a group of four bits is called a nibble. A byte is a fundamental addressable unit of computer memory and storage. It can represent a single character, such as a letter, number, or symbol using encoding methods such as ASCII and Unicode.

Bytes are used to determine file sizes, storage capacity, and available memory space. A kilobyte (KB) is equal to 1,024 bytes, a megabyte (MB) is equal to 1,024 KB, and a gigabyte (GB) is equal to 1,024 MB. File size is roughly measured in KBs and availability of memory space in MBs and GBs.

Bytes

The following table shows the conversion of Bits and Bytes −

Byte Value Bit Value
1 Byte 8 Bits
1024 Bytes 1 Kilobyte
1024 Kilobytes 1 Megabyte
1024 Megabytes 1 Gigabyte
1024 Gigabytes 1 Terabyte
1024 Terabytes 1 Petabyte
1024 Petabytes 1 Exabyte
1024 Exabytes 1 Zettabyte
1024 Zettabytes 1 Yottabyte
1024 Yottabytes 1 Brontobyte
1024 Brontobytes 1 Geopbytes

A Text Code is a static code that allows a user to insert text that others will view when they scan it. It includes alphabets, punctuation marks and other symbols. Some of the most commonly used text code systems are −

Extended ASCII

EBCDIC stands for Extended Binary Coded Decimal Interchange Code. IBM developed EBCDIC in the early 1960s and used it in their mainframe systems like System/360 and its successors. To meet commercial and data processing demands, it supports letters, numbers, punctuation marks, and special symbols. Character codes distinguish EBCDIC from other character encoding methods like ASCII. Data encoded in EBCDIC or ASCII may not be compatible with computers; to make them compatible, we need to convert with systems compatibility. EBCDIC encodes each character as an 8-bit binary code and defines 256 symbols. The below-mentioned table depicts different characters along with their EBCDIC code.

EBCDIC

ASCII stands for American Standard Code for Information Interchange. It is an 8-bit code that specifies character values from 0 to 127. ASCII is a standard for the Character Encoding of Numbers that assigns numerical values to represent characters, such as letters, numbers, exclamation marks and control characters used in computers and communication equipment that are using data.

ASCII originally defined 128 characters, encoded with 7 bits, allowing for 2^7 (128) potential characters. The ASCII standard specifies characters for the English alphabet (uppercase and lowercase), numerals from 0 to 9, punctuation marks, and control characters for formatting and control tasks such as line feed, carriage return, and tab.

ASCII Tabular column
ASCII Code Decimal Value Character
0000 0000 0 Null prompt
0000 0001 1 Start of heading
0000 0010 2 Start of text
0000 0011 3 End of text
0000 0100 4 End of transmit
0000 0101 5 Enquiry
0000 0110 6 Acknowledge
0000 0111 7 Audible bell
0000 1000 8 Backspace
0000 1001 9 Horizontal tab
0000 1010 10 Line Feed

Extended American Standard Code for Information Interchange is an 8-bit code that specifies character values from 128 to 255. Extended ASCII encompasses different character encoding normal ASCII character set, consisting of 128 characters encoded in 7 bits, some additional characters that utilise full 8 bits of a byte; there are a total of 256 potential characters.

Different extended ASCII exist, each introducing more characters beyond the conventional ASCII set. These additional characters may encompass symbols, letters, and special characters to a specific language or location.

Extended ASCII Tabular column

Extended ASCII

It is a worldwide character standard that uses 4 to 32 bits to represent letters, numbers and symbols. Unicode is a standard character encoding which is specifically designed to provide a consistent way to represent text in nearly all of the world's writing systems. Every character is assigned a unique numeric code, program, or language. Unicode offers a wide variety of characters, including alphabets, ideographs, symbols, and emojis.

Unicode Tabular Column

Unicode

  • Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
  • Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
  • OverflowAI GenAI features for Teams
  • OverflowAPI Train & fine-tune LLMs
  • Labs The future of collective knowledge sharing
  • About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

How exactly are data types represented in a computer?

I'm a beginning programmer reading K&R, and I feel as if the book assumes a lot of previous knowledge. One aspect that confuses me is the actual representation, or should I say existence, of variables in memory. What exactly does a data type specify for a variable? I'm not too sure of how to word this question... but I'll ask a few questions and perhaps someone can come up with a coherent answer for me.

When using getchar(), I was told that it is better to use type "int" than type "char" due to the fact that "int" can hold more values while "char" can hold only 256 values. Since we may need the variable to hold the EOF value, we will need more than 256 or the EOF value will overlap with one of the 256 characters. In my mind, I view this as a bunch of boxes with empty holes. Could someone give me a better representation? Do these "boxes" have index numbers? When EOF overlaps with a value in the 256 available values, can we predict which value it will overlap with?

Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?

Also, what is the actual important difference between "char" and "int"? If we can use "int" type instead of "char" type, why do we decide to use one over the other at certain times? Is it to save "memory" (I use quotes as I do not actually how "memory" exactly works).

Lastly, how exactly is the 256 available values of type char obtained? I read something about modulo 2^n, where n = 8, but why does that work (something to do with binary?). What is the modulo portion of "modulo 2^n" mean (if it has any relevance to modular arithmetic, I can't see the relation...)?

  • kernighan-and-ritchie

Brian Tompsett - 汤莱恩's user avatar

  • "when we definitely know that we will only have 256 possible ASCII characters?" nit-pick: There's only 128 characters in ASCII. –  kusma Commented Jan 9, 2010 at 17:29
  • 1 there is more than just "int".. there is unsigned int (0-65535) and signed int (-32767 to 32767)... plain char in most implementations is 0 to 255 in unsigned. You also have short and long. short is two bytes, int is 4 bytes, and long is 8 bytes. See here: home.att.net/~jackklein/c/inttypes.html –  user195488 Commented Jan 9, 2010 at 17:29
  • Sorry :s Then could I ask why we cannot use the 256 available values of type "char" if we are also using the function getchar() and expecting an EOF at some point? –  withchemicals Commented Jan 9, 2010 at 17:30
  • 1 withchemicals: Because "char" doesn't really imply "ASCII". In C, "char" implies "smallest addressable piece of memory", otherwise known as "byte". getchar() simply gives you a byte from a stream (or EOF). –  kusma Commented Jan 9, 2010 at 17:41
  • 2 Many of the answers below assume two's complement representation of numbers. C makes no such guarantees. It can run (and does run) on ones' complement machines, and sign-magnitude machines (or any other weird encodings there are!). –  Alok Singhal Commented Jan 10, 2010 at 2:51

11 Answers 11

Great questions. K&R was written back in the days when there was a lot less to know about computers, and so programmers knew a lot more about the hardware. Every programmer ought to be familiar with this stuff, but (understandably) many beginning programmers aren't.

At Carnegie Mellon University they developed an entire course to fill in this gap in knowledge, which I was a TA for. I recommend the textbook for that class: "Computer Systems: A Programmer's Perspective" http://amzn.com/013034074X/

The answers to your questions are longer than can really be covered here, but I'll give you some brief pointers for your own research.

Basically, computers store all information--whether in memory (RAM) or on disk--in binary, a base-2 number system (as opposed to decimal, which is base 10). One binary digit is called a bit. Computers tend to work with memory in 8-bit chunks called bytes.

A char in C is one byte. An int is typically four bytes (although it can be different on different machines). So a char can hold only 256 possible values, 2^8. An int can hold 2^32 different values.

For more, definitely read the book, or read a few Wikipedia pages:

  • http://en.wikipedia.org/wiki/Binary_numeral_system
  • http://en.wikipedia.org/wiki/Twos_complement

Best of luck!

UPDATE with info on modular arithmetic as requested:

First, read up on modular arithmetic: http://en.wikipedia.org/wiki/Modular_arithmetic

Basically, in a two's complement system, an n-bit number really represents an equivalence class of integers modulo 2^n.

If that seems to make it more complicated instead of less, then the key things to know are simply:

  • An unsigned n-bit number holds values from 0 to 2^n-1. The values "wrap around", so e.g., when you add two numbers and get 2^n, you really get zero. (This is called "overflow".)
  • A signed n-bit number holds values from -2^(n-1) to 2^(n-1)-1. Numbers still wrap around, but the highest number wraps around to the most negative, and it starts counting up towards zero from there.

So, an unsigned byte (8-bit number) can be 0 to 255. 255 + 1 wraps around to 0. 255 + 2 ends up as 1, and so forth. A signed byte can be -128 to 127. 127 + 1 ends up as -128. (!) 127 + 2 ends up as -127, etc.

jasoncrawford's user avatar

  • Thanks! Could you explain the "modulo" portion of 2^n? –  withchemicals Commented Jan 9, 2010 at 17:51
  • I would rather have said "back in the days when programming was a lot lower level, closer to the hardware, so learning programming quickly required (and resulted in) a good basic understanding of the underlying hardware". –  L. Cornelius Dol Commented Jan 9, 2010 at 18:55
  • Software Monkey: well-said, I think that's more exact than what I wrote. –  jasoncrawford Commented Jan 10, 2010 at 2:25
  • withchemicals: Updated answer with some info on modular arithmetic; hope this helps. –  jasoncrawford Commented Jan 10, 2010 at 2:38
  • BTW, on some platforms there is at least one character in the C character set whose value exceeds the maximum value for a signed char. For example, in EBCDIC, '0' is 0xF0. On such machines, 'char' must be unsigned. On some other platforms (e.g. some DSP's), sizeof(char)==sizeof(int), and both are able to hold values -32767..32767 (and perhaps other values as well). On such machines, 'char' must be signed. Note further that on such machines it would be possible for -1 to be a valid character value, and for EOF to be some value other than -1. –  supercat Commented Feb 15, 2011 at 20:41
One aspect that confuses me is the actual representation, or should I say existence, of variables in memory. What exactly does a data type specify for a variable?

At the machine level, the difference between int and char is only the size, or number of bytes, of the memory allocated for it by the programming language. In C, IIRC, a char is one byte while an int is 4 bytes. If you were to "look" at these inside the machine itself, you would see a sequence of bits for each. Being able to treat them as int or char depends on how the language decides to interpret them (this is also why its possible to convert back and forth between the two types).

When using getchar(), I was told that it is better to use type "int" than type "char" due to the fact that "int" can hold more values while "char" can hold only 256 values.

This is because there are 2^8, or 256 combinations of 8 bits (because a bit can have two possible values), whereas there are 2^32 combinations of 32 bits. The EOF constant (as defined by C) is a negative value, not falling within the range of 0 and 255. If you try to assign this negative value to a char (this squeezing its 4 bytes into 1), the higher-order bits will be lost and you will end up with a valid char value that is NOT the same as EOF. This is why you need to store it into an int and check before casting to a char.

Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as 0char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?

Yes, especially since in that case you are assigning a character literal.

Also, what is the actual important difference between "char" and "int"? If we can use "int" type instead of "char" type, why do we decide to use one over the other at certain times?

Most importantly, you would pick int or char at the language level depending on whether you wanted to treat the variable as a number or a letter (to switch, you would need to cast to the other type). If you wanted an integer value that took up less space, you could use a short int (which I believe is 2 bytes), or if you were REALLY concerned with memory usage you could use a char , though mostly this is not necessary.

Edit : here's a link describing the different data types in C and modifiers that can be applied to them. See the table at the end for sizes and value ranges.

danben's user avatar

  • Nitpick: to handle characters you'd stay the hell away from char and use a higher-level abstraction from a library like GLib. –  Tobu Commented Jan 9, 2010 at 17:34
  • 2 Sure, but I still think its important to understand what's actually going on at the lower levels. –  danben Commented Jan 9, 2010 at 17:36
  • 2 In C, an int can be 4 bytes, or more, or less. int must be able to represent values between -32767 and +32767 . –  Alok Singhal Commented Jan 9, 2010 at 17:40
  • 3 int is not necessary 4 bytes. All C says is: short <= int <= long and short >= 2 bytes and long >= 4 bytes. See "The C Programming Language", ANSI C Version, by K&R, page 36. –  Dave O. Commented Jan 9, 2010 at 17:49
  • 2 Also, in C, char may be signed, in which case it can store EOF , but of course char may be unsigned as well, and that's why we use int in this case. –  Alok Singhal Commented Jan 9, 2010 at 17:50

Basically, system memory is one huge series of bits, each of which can be either "on" or "off". The rest is conventions and interpretation.

First of all, there is no way to access individual bits directly; instead they are grouped into bytes, usually in groups of 8 (there are a few exotic systems where this is not the case, but you can ignore that for now), and each byte gets a memory address. So the first byte in memory has address 0, the second has address 1, etc.

A byte of 8 bits has 2^8 possible different values, which can be interpreted as a number between 0 and 255 (unsigned byte), or as a number between -128 and +127 (signed byte), or as an ASCII character. A variable of type char per C standard has a size of 1 byte.

But bytes are too small for a lot of things, so other types have been defined that are larger (i.e. they consist of multiple bytes), and CPUs support these different types through special hardware constructs. An int is typically 4 bytes nowadays (though the C standard does not specify it and ints can be smaller or bigger on different systems) because 4 bytes are 32 bits, and until recently that was what mainstream CPUs supported as their "word size".

So a variable of type int is 4 bytes large. That means when its memory address is e.g. 1000, then it actually covers the bytes at addresses 1000, 1001, 1002, and 1003. In C, it is possible to address those individual bytes as well at the same time, and that is how variables can overlap.

As a sidenote, most systems require larger types to be "word-aligned", i.e. their addresses have to be multiples of the word size, because that makes things easier for the hardware. So it is not possible to have an int variable start at address 999, or address 17 (but 1000 and 16 are OK).

Michael Borgwardt's user avatar

  • 1 Again, int may be 4 bytes or 2 or even 1, or anything. It must be able to represent the range +-32767. –  Alok Singhal Commented Jan 9, 2010 at 17:52
  • wouldn't it be 2^7 and not 2^8? –  user195488 Commented Jan 9, 2010 at 17:57
  • @Alok, yes that's what I say one paragraph higher. @Roboto: Nope. 8 bits means 2^8 different values. One bit has 2 values (2^1), each additional bit doubles this. –  Michael Borgwardt Commented Jan 9, 2010 at 18:06

I'm not going to completely answer Your question, but I would like to help You understand variables, as I had the same problems understanding them, when I began to program by myself.

For the moment, don't bother with the electronic representation of variables in memory. Think of memory as a continuous block of 1-byte-cells, each storing an bit-pattern (consisting of 0s and 1s).

By solely looking at the memory, You can't determine, what the bits in it represent! They are just arbitrary sequences of 0s and 1s. It is YOU, who specifies, HOW to interpret those bit patterns! Take a look at this example:

You could have written the following as well:

In both cases, the variables a, b and c are stored somewhere in the memory (and You can't tell their type). Now, when the compiler compiles Your code (that is translating Your program into machine instructions), it makes sure, to translate the "+" into integer_add in the first case and float_add in the second case, thus the CPU will interpret the bit patterns correctly and perform, what You desired.

Variable types are like glasses , that let the CPU look at a bit patterns from different perspectives.

Dave O.'s user avatar

To go deeper, I'd highly recommend Charles Petzold's excellent book " Code "

It covers more than what you ask, all of which leads to a better understanding of what's actually happening under the covers.

Rob Wells's user avatar

Really, datatypes are an abstraction that allows your programming language to treat a few bytes at some address as some kind of numeric type. Consider the data type as a lens that lets you see a piece of memory as an int, or a float. In reality, it's all just bits to the computer.

dicroce's user avatar

  • The OP framed it in terms of hardware, but I agree with you; the individual questions were all better answered with a short introduction to type theory. –  Tobu Commented Feb 27, 2011 at 17:04
  • In C, EOF is a "small negative number".
  • In C, char type may be unsigned, meaning that it cannot represent negative values.
  • For unsigned types, when you try to assign a negative value to them, they are converted to an unsigned value. If MAX is the maximum value an unsigned type can hold, then assigning -n to such a type is equivalent to assigning MAX - (n % MAX) + 1 to it. So, to answer your specific question about predicting, "yes you can". For example, let's say char is unsigned, and can hold values 0 to 255 inclusive. Then assigning -1 to a char is equivalent to assigning 255 - 1 + 1 = 255 to it.

Given the above, to be able to store EOF in c , c can't be char type. Thus, we use int , because it can store "small negative values". Particularly, in C, int is guaranteed to store values in the range -32767 and +32767 . That is why getchar() returns int .

If you are assigning values directly, then the C standard guarantees that expressions like 'a' will fit in a char . Note that in C, 'a' is of type int , not char, but it's okay to do char c = 'a' , because 'a' is able to fit in a char type.

About your question as to what type a variable should hold, the answer is: use whatever type that makes sense. For example, if you're counting, or looking at string lengths, the numbers can only be greater than or equal to zero. In such cases, you should use an unsigned type. size_t is such a type.

Note that it is sometimes hard to figure out the type of data, and even the "pros" may make mistakes. gzip format for example, stores the size of the uncompressed data in the last 4 bytes of a file. This breaks for huge files > 4GB in size, which are fairly common these days.

You should be careful about your terminology. In C, a char c = 'a' assigns an integer value corresponding to 'a' to c , but it need not be ASCII. It depends upon whatever encoding you happen to use.

About the "modulo" portion, and 256 values of type char : if you have n binary bits in a data type, each bit can encode 2 values: 0 and 1. So, you have 2*2*2...*2 ( n times) available values, or 2 n . For unsigned types, any overflow is well-defined, it is as if you divided the number by (the maximum possible value+1), and took the remainder. For example, let's say unsigned char can store values 0..255 (256 total values). Then, assigning 257 to an unsigned char will basically divide it by 256, take the remainder (1), and assign that value to the variable. This relation holds true for unsigned types only though. See my answer to another question for more.

Finally, you can use char arrays to read data from a file in C, even though you might end up hitting EOF , because C provides other ways of detecting EOF without having to read it in a variable explicitly, but you will learn about it later when you have read about arrays and pointers (see fgets() if you're curious for one example).

Community's user avatar

According to "stdio.h" getchars() return value is int and EOF is defined as -1. Depending on the actual encoding all values between 0..255 can occur, there for unsigned char is not enough to represent the -1 and int is used. Here is a nice table with detailed information http://en.wikipedia.org/wiki/ISO/IEC_8859

stacker's user avatar

The beauty of K&R is it's conciseness and readability, writers always have to make concessions for their goals; rather than being a 2000 page reference manual it serves as a basic reference and an excellent way to learn the language in general. I recommend Harbinson and Steele "C: A Reference Manual" for an excellent C reference book for details, and the C standard of course.

You need to be willing to google this stuff. Variables are represented in memory at specific locations and are known to the program of which they are a part of within a given scope. A char will typically be stored in 8 bits of memory (on some rare platforms this isn't necessarily true). 2^8 represents 256 distinct posibilities for variables. Different CPU/compilers/etc represent the basic types int, long of varying sizes. I think the C standard might specify minimum sizes for these, but not maximum sizes. I think for double it specifies at least 64 bits, but this doesn't preclude intel from using 80 bits in a floating point unit. Anyway, typical sizes in memory on 32bit intel platforms would be 32 bits (4 bytes) for unsigned/signed int and float, 64 bits (8 bytes) for double, 8 bits for char (signed/unsigned). You should also look up memory alignment if you are really interested on the topic. You can also at the exact layout in your debugger by getting the address of your variable with the "&" operator and then peeking at that address. Intel platforms may confuse you a little when looking at values in memory so please look up little endian/big endian as well. I am sure stack overflow has some good summaries of this as well.

dudez's user avatar

All of the characters needed in a language are respresented by ASCII and Extended ASCII. So there is no character beyond the Extended ASCII.

While using char, there is probability of getting garbage value as it directly stores the character but using int, there is less probability of it as it stores the ASCII value of the character.

Fahad Naeem's user avatar

For your last question about modulo:

Lastly, how exactly is the 256 available values of type char obtained? I read something about modulo 2^n, where n = 8, but why does that work (something to do with binary?). What is the modulo portion of "modulo 2^n" mean (if it has any relevance to modular arithmetic, I can't see the relation...)?

Think about modulo as a clock, where adding hours eventually results in you starting back at 0. Adding an hour for each step, you go from 00:00 to 01:00 to 02:00 to 03:00 to ... to 23:00 and then add one more hour to get back to 00:00. The "wrap-around" or "roll-over" is called modulo, and in this case is modulo 24.

With modulo, that largest number is never reached; as soon as you "reach" that number, the number wraps around to the beginning (24:00 is really 00:00 in the time example).

As another example, modern humanity's numbering system is Base 10 (i.e., Decimal), where we have digits 0 through 9. We don't have a singular digit that represents value 10. We need two digits to store 10.

Let's say we only have a one-digit adder, where the output can only store a single digit. We can add any two single-digit numbers together, like 1+2 , or 5+4 . 1+2=3 , as expected. 5+4=9 , as expected. But what happens if we add 5+5 or 9+1 or 9+9? To calculate 5+5 , our machine computes 10 , but it can't store the 1 due to its lack of memory capabilities, so the computer treats the 1 as an "overflow digit" and throws it away, only storing the 0 as the result. So, looking at your output for the computation 5+5 , you see the result is 0 , which probably isn't what you were expecting. To calculate 9+9 , your single-digit adding machine would correctly calculate 18 , but, it again, due to hardware memory limitations of storing a maximum of one digit, doesn't have the ability to store the 1 , so it throws it away. The adder CAN however store the 8 , so your result of 9+9 produces 8 .Your single-digit adder is modulo'ing by 10. Notice how you can never reach the number 10 in your output, even when your result should be 10 or bigger. The same issue occurs in binary, but with different modulo values.

As an aside, this "overflow" issue is especially bad with multiplication since you need twice the length of your biggest input to multiply two numbers (whether the numbers are binary or decimal or some other standard base) with all the result's digits intact. I.e., if you're multiplying a 32-bit number by another 32-bit number, your result might take 64 bits of memory instead of a convenient 32! E.g., 999 (3 digit input) times 999 (3 digit input) = 998,001 (6 digit output). Notice how the output requires double the number of digits of storage compared to one of the inputs' lengths (number of digits).

Back to binary modulo, A char in the programming language C is defined as the smallest accessible unit (source: I made it up based off what I've been told a hundred times). AFAIK, a single char is always the length of a single Byte. A byte is 8 bits, which is to say, a byte is an ordered group of eight 1s and/or 0s. E.g., 11001010 is a byte. Again, the order matters, meaning that 01 is not the same as 10, much like how 312 is not the same as 321 in Base 10.

Each bit you add gives you twice as many possible states. With 1 bit, you have 2^ 1 = 2 possible states (0,1). With 2 bits, you have 2^ 2 = 4 states (00,01,10,11), With 3 bits, you have 2^ 3 = 8 states (000,001,010,011,100,101,110,111). With 4 bits, you have 2^4 = 16 states, etc. With 8 bits (the length of a Byte, and also the length of a char), you have 2^ 8 = 256 possible states.

The largest value you can store in a char is 255 because a char has only 8 bits, meaning you can store all 1s to get the maximum value, which will be 11111111(bin) = 255(dec). As soon as you try to store a larger number, like by adding 1, we get the same overflow issue mentioned in the 1-digit adder example. 255+1 = 256 = 1 0000 0000 (spaces added for readability). 256 takes 9 bits to represent, but only the lower 8 bits can be stored since we're dealing with chars, so the most significant bit (the only 1 in the sequence of bits) gets truncated and we're left with 0000 0000 = 0. We could've added any number to the char, but the resulting char will always be between values 0 (all bits are 0s) and 255 (all bits are 1s).

Since a maximum value of 255 can be stored, we can say that operations that output/result in a char are mod256 (256 can never be reached, but everything below that value can (see the clock example)). Even if we add a million to a char, the final result will be between 0 and 255 (after a lot of truncation happens). Your compiler may give you a warning if you do a basic operation that causes overflow, but don't depend on it.

I said earlier that char can store up to 256 values, 0 through 255 - this is only partially true. You might notice that you get strange numbers when you try to do operations like char a = 255; .Printing out the number as an integer ( char a = 128; printf("%d",a); ) should tell you that the result is -128. Did the computer think I added a negative by accident? No. This happened because a char is naturally signed, meaning it's able to be negative. 128 is actually overflow because the range of 0 to 255 is split roughly in half, from -128 to +127. The maximum value being +127, and then adding 1 to reach 128 (which makes it overflow by exactly 1) tells us that the resulting number of char a = 128; will be the minimum value a char can store, which is -128. If we had added 2 instead of 1 (like if we tried to do char a = 129; ), then it would overflow by 2, meaning the resulting char would have stored -127. The maximum value will always wrap around to the minimum value in non-floating point numbers.

  • Floating-point numbers still work based on place values (but it's called a mantissa and functions differently) like ints, shorts, chars, and longs, but floating-point numbers also have a dedicated sign bit (which has no additive value) and an exponent.

If you choose to look at the raw binary when setting variables equal to literal values like 128 or -5000

For signed non-floating-point numbers, the largest place value get assigned a 1 when the overall number is negative, and that place value gets treated as a negative version of that typical place value. E.g., -5 (Decimal) would be 1xx...x in binary (where each x is a placeholder for either a 0 or 1). As another example, instead of place values being 8,4,2,1 for an unsigned number, they become -8,4,2,1 for a signed number, meaning you now have a "negative 8's place".

2's Complement to switch between + and - values : Flip (i.e., "Complement") all bits (i.e., each 1 gets flipped to a 0, and, simultaneously, each 0 gets flipped to a 1)(e.g., -12 = -16 + 4 = 10100 -> 01011). After flipping, add value 1 (place value of 1). (e.g., 01011 + 1 = 01100 = 0+8+4+0+0 = +12). Summary: Flip bits, then add 1.

Examples of using 2's Complement to convert binary numbers into EQUIVALENT signed decimal numbers :

  • 11 = (-2)+1 = -1
  • 111 = (-4)+2+1 = -1
  • 1111 = (-8)+4+2+1 = -1
  • 1000 = (-8)+0+0+0 = -8
  • 0111 = 0+4+2+1 = 7
  • Notice how 0111 and 111 are treated differently when they're designated as "signed" numbers (instead of "unsigned" (always positive) numbers). Leading 0s are important.

If you see the binary number 1111, you might think, "Oh, that's 8+4+2+1 = 15". However, you don't have enough information to assume that. It could be a negative number. If you see "(signed) 1111", then you still don't know the number for certain due to One's Complement existing, but you can assume it means "(signed 2's Complement) 1111", which would be (-8)+4+2+1 = -1. The same sequence of bits, 1111, can be interpreted as either -1 or 15, depending on its signedness. This is why the unsigned keyword in unsigned char is important. When you write char , you are implicitly telling the computer that you want a signed char.

  • unsigned char - Can store numbers between 0 and 255
  • (signed) char - Can store number between -128 and +127 (same span, but shifted to allow negatives)

Stev's user avatar

Your Answer

Reminder: Answers generated by artificial intelligence tools are not allowed on Stack Overflow. Learn more

Sign up or log in

Post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged c types kernighan-and-ritchie kr-c or ask your own question .

  • The Overflow Blog
  • Where does Postgres fit in a world of GenAI and vector databases?
  • Featured on Meta
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Bringing clarity to status tag usage on meta sites
  • What does a new user need in a homepage experience on Stack Overflow?
  • Feedback requested: How do you use tag hover descriptions for curating and do...
  • Staging Ground Reviewer Motivation

Hot Network Questions

  • The meaning of "by" in "swear by God"
  • Is it possible to have a planet that's gaslike in some areas and rocky in others?
  • Can a 2-sphere be squashed flat?
  • What explanations can be offered for the extreme see-sawing in Montana's senate race polling?
  • How would you say a couple of letters (as in mail) if they're not necessarily letters?
  • How long does it take to achieve buoyancy in a body of water?
  • How does the summoned monster know who is my enemy?
  • Do the amplitude and frequency of gravitational waves emitted by binary stars change as the stars get closer together?
  • Is Intuitionism Indispensable in Mathematics?
  • Trying to find an old book (fantasy or scifi?) in which the protagonist and their romantic partner live in opposite directions in time
  • Why was this lighting fixture smoking? What do I do about it?
  • What are the French verbs that have invariable past participles?
  • How can these humans cross the ocean(s) at the first possible chance?
  • Raspberry Screen Application
  • Is there a faster way of expanding multiple polynomials with power?
  • "TSA regulations state that travellers are allowed one personal item and one carry on"?
  • Stuck on Sokoban
  • What to call a test that consists of running a program with only logging?
  • I'm trying to remember a novel about an asteroid threatening to destroy the earth. I remember seeing the phrase "SHIVA IS COMING" on the cover
  • Why do National Geographic and Discovery Channel broadcast fake or pseudoscientific programs?
  • Optimal Bath Fan Location
  • Completely introduce your friends
  • Safety pictograms
  • Regression techniques for a “triangular” scatterplot

data representation types

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 29 August 2024

Graph Fourier transform for spatial omics representation and analyses of complex organs

  • Yuzhou Chang   ORCID: orcid.org/0000-0003-4893-1886 1 , 2   na1 ,
  • Jixin Liu 3   na1 ,
  • Yi Jiang 1 ,
  • Anjun Ma   ORCID: orcid.org/0000-0001-6269-398X 1 , 2 ,
  • Yao Yu Yeo   ORCID: orcid.org/0000-0002-7604-2296 4 , 5 ,
  • Megan McNutt 1 ,
  • Jordan E. Krull   ORCID: orcid.org/0000-0001-6507-8085 1 , 2 ,
  • Scott J. Rodig 6 , 7 ,
  • Dan H. Barouch   ORCID: orcid.org/0000-0001-5127-4659 4 , 8 ,
  • Garry P. Nolan   ORCID: orcid.org/0000-0002-8862-9043 9 ,
  • Dong Xu   ORCID: orcid.org/0000-0002-4809-0514 10 ,
  • Sizun Jiang   ORCID: orcid.org/0000-0001-6149-3142 4 , 5 , 6 ,
  • Zihai Li   ORCID: orcid.org/0000-0003-4603-927X 2 ,
  • Bingqiang Liu   ORCID: orcid.org/0000-0002-5734-1135 3 &
  • Qin Ma   ORCID: orcid.org/0000-0002-3264-8392 1 , 2  

Nature Communications volume  15 , Article number:  7467 ( 2024 ) Cite this article

Metrics details

  • Bioinformatics
  • Computational models
  • Machine learning
  • Transcriptomics

Spatial omics technologies decipher functional components of complex organs at cellular and subcellular resolutions. We introduce Spatial Graph Fourier Transform (SpaGFT) and apply graph signal processing to a wide range of spatial omics profiling platforms to generate their interpretable representations. This representation supports spatially variable gene identification and improves gene expression imputation, outperforming existing tools in analyzing human and mouse spatial transcriptomics data. SpaGFT can identify immunological regions for B cell maturation in human lymph nodes Visium data and characterize variations in secondary follicles using in-house human tonsil CODEX data. Furthermore, it can be integrated seamlessly into other machine learning frameworks, enhancing accuracy in spatial domain identification, cell type annotation, and subcellular feature inference by up to 40%. Notably, SpaGFT detects rare subcellular organelles, such as Cajal bodies and Set1/COMPASS complexes, in high-resolution spatial proteomics data. This approach provides an explainable graph representation method for exploring tissue biology and function.

Introduction

Advancements in spatial omics offer a comprehensive view of the molecular landscape within the native tissue microenvironment, including genome, transcriptome, microbiome, T cell receptor (TCR) 1 , epigenome, proteome, transcriptome-protein markers co-profiling, and epigenome–transcriptome co-profiling 2 (Fig.  1a and Supplementary Fig.  1 ). These approaches enable the investigation and elucidation of functional tissue units (FTUs) 3 , which are defined as over-represented multicellular functional regions with a unique physiologic function, with both cell-centric and gene-centric approaches. Specifically, cell-centric approaches involve the identification of spatial domains with coherent gene expression and histology 4 , studying cell composition and neighborhoods within specific domains 5 , 6 , 7 , and understanding inter-cellular mechanisms. In parallel, gene-centric approaches characterized FTUs by imputing gene expression 8 and identifying spatially variable genes (SVG) 9 , 10 , 11 in a highly complementary manner to cell-centric approaches.

figure 1

a The panel showcases spatial omics technologies, including single and multi-modality methods. b–d the panels display the calculation of Fourier modes (FM), and the transformation of original graph signals into Fourier coefficients (FC) with different resolutions of technologies. b The figure presents pixel graphs with nodes at the subcellular level and edges denoting short Euclidean distances between connected pixels. This graph represents technologies like stere-seq and most spatial proteomics data, e.g., 4i. The two figures following panel b illustrate a k -bandlimited signal (e.g., Afp ) and a non- k -bandlimited signal (e.g., Xbp1 ). c and d Cell graphs and spot graphs are composed of nodes at the cellular level resolution and multicellular level resolution, respectively, with edges representing short Euclidean distances between nodes in two panels . e The figure exhibits multi-modal data from a technology called SPOTS, which can measure both proteins and genes simultaneously. The k -bandlimited signal shown is for the Ly6a gene and its corresponding protein, while the non- k -bandlimited signal is for the Klrb1c gene. f – h The panels show examples of signals from Slide-DNA-seq, Slide-TCR-seq, and spatial epigenome–transcriptome co-profiling of mouse embryo-13. i The panel shows subcellular spatial proteome (i.e., 4i) are k -bandlimited signals. j This panel demonstrates data augmentation for sequencing-based spatial transcriptomics (e.g., Visium). The first step of augmentation involves using H&E images and Cellpose for cell segmentation and counting the number of nuclei in each spot. The next step involves mapping reads to the microbiome genome, which then allows for the determination of microbiome abundance. Finally, gene lists (e.g., MSigDB) can be used to calculate the pathway activity score for each spot. k This panel displays the signals mentioned in panel j , including cell density, microbiome abundance, and pathway activity. Panel a is created with BioRender.com, created with BioRender.com, released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.

Classic statistical methods, such as SPARK 9 , SPARK-X 12 , and SpatialDE 11 , have effectively modeled molecular variations and spatial relationships within a tissue. However, they did not fully explore the capacity to translate these relationships into understandable and analyzable features. In contrast, graph-based methods present a powerful alternative method that efficiently encodes and leverages spatial relationships within tissue in spatial omics data representation 13 . We postulate that an FTU can be intuitively considered a graph; its nodes represent spots or cells, and edges connect spatially adjacent or functionally related nodes. Within this representation of FTUs, a binary graph signal (e.g., 0,1), representing discrete two-state information at each node, and cellular or subcellular composition or omics features (e.g., genes) constitute continuous graph signals, encoding a range of values across the graph’s nodes. These graph signals define the FTU’s characterization, connect cell-centric and gene-centric analyses, and offer mutual interpretatibility 14 , through the generation of a graph embedding that harmonizes the graph structure and signal magnitude. Furthermore, while graph-based machine learning methods are available to learn graph embeddings and carry out downstream tasks (e.g., graph classification), their learning progress is usually a “black box” and relies on an inductive bias (i.e., a hypothesis for a particular question) to train the model 15 . The characteristics of the produced graph embeddings are specifically tailored to perform optimally in certain targeted downstream tasks. Therefore, there is a need for a generic graph signal representation framework with a solid mathematic foundation to reveal intricate relations between molecular signatures and FTUs across multiple resolutions of spatial omics data.

To achieve this, we present the Spatial Graph Fourier Transform (SpaGFT), an analytical feature representation approach to encode smooth graph signals for representing biological processes within tissues and cells. It bridges graph signal processing techniques and spatial omics data, enabling various downstream analyses and facilitating insightful biological findings. Computationally, SpaGFT outperformed other tools in identifying SVGs with hundred-fold efficiency and gene expression imputation across human/mouse Visium data. Biologically, SpaGFT identified key immunological areas for B cell maturation processes from human lymph nodes Visium data and further illustrated secondary follicle cellular, morphological, and molecular diversity from exclusively in-house human tonsils CODEX data. Moreover, SpaGFT can be seamlessly integrated into other machine learning frameworks regarding domain identification (e.g., SpaGCN 4 ), annotation transfer from cell types to spots (e.g., TACCO 6 ), the cell-to-spot alignments (e.g., Tangram 7 ), and subcellular hallmark inference (e.g., CAMPA 16 ). Notably, enhanced CAMPA has enabled the discovery of rare subcellular structures like the Cajal body and Set1/COMPASS complex based on iterative indirect immunofluorescence image (4i) data 17 , enhancing our understanding of cellular function using spatial omics technologies.

SpaGFT reliably represents the smooth signal of spatial omics data

We summarize current spatially resolved omics as three types of spatial graphs related to the granularity of nodes, ranging from subcellular level (i.e., pixel-level) to broader cellular (i.e., cell-level) and multicellular scales (i.e., spot-level) based on the spatial resolutions (Fig.  1b–k ). This granularity can range from subcellular levels to broader cellular and multicellular scales. For example, based on the spatial graph of a spatially resolved transcriptomics (SRT) dataset, the transcriptomic profile of a specific gene is a graph signal and can be represented by the linear combination of its Fourier modes (FMs, Terminology Box). To elaborate, a low-frequency FM contributes to a low and smooth graph signal variation, representing a spatially organized pattern, while a high-frequency FM contributes to rapid graph signal variation and usually refers to noises in spatial omics data 18 . For example, if a gene exhibits a spatially organized pattern in SRT data, the Fourier coefficients (FCs) of corresponding low-frequency FMs are more dominant than FCs of high-frequency FMs in the graph Fourier representation. Notably, FMs are associated with graph structure and do not assume any predefined patterns 18 , ensuring flexibility in representing both well-defined and irregular spatial signal patterns. Thus, regardless of single- (Fig.  1 b–d, f, g, and i ), multi-modalities (Fig.  1 e and h ), or augmented features (Fig.  1 j and k ), the spatial omics can be analytically transformed into FCs to quantify the contribution of FMs in the frequency domain 19 , a feature space for enhancing the interpretability and generalizability in downstream analyses.

SpaGFT identifies spatially variable genes and enhances gene and protein signals

Using the representation framework of SpaGFT (Fig.  2a ), the mathematical formulation of SVG identification can be derived as a k- bandlimited signal recognition problem, which determines the first k low-frequency FMs to best approximate the original graph signal (Fig.  2b , Supplementary Fig.  2 , and S1 of Supplementary Note  1 ). This formulation can overcome three main limitations of SVG identification methods: (i) no pre-assumption of regular patterns in model design (e.g., radial hotspot, curve belt, or gradient streak) 9 ; (ii) interpretable representation of SVG patterns 20 with spatial context; and (iii) high computational efficiency 12 when processing large-scale datasets. Essentially, we defined and implemented a GFTscore for each gene to quantify the contribution of low-frequency FMs by determining the first k low-frequency FMs, weighting, and summing corresponding FCs (S 2 of Supplementary Note  1 ). Based on the definition, a gene is identified as an SVG if (i) its GFTscore is greater than the inflexion point based on the distribution of all genes’ GFTscore and (ii) its FCs of the first k low-frequency FMs are significantly higher than FCs of high-frequency FMs (S 3 of Supplementary Note  1 ). Consequently, we evaluated the performance of SVG identification using 31 public SRT datasets from human and mouse brains (Supplementary Data  1 ) 21 , 22 , 23 , 24 . As no golden-standard SVG database was available, we collected 849 SVG candidates from five existing studies 24 , 25 , 26 , 27 , 28 , and 458 of them were used as curated benchmarking SVGs based on cross-validation with the in situ hybridization (ISH) database of Allen Brain Atlas 29 (Supplementary Data  2 and 3 , see the “Methods” section). The SVG prediction performance of SpaGFT was compared with SPARK 9 , SPARK-X 12 , MERINGUE 30 , SpatialDE 11 , SpaGCN 4 , and scGCO 31 in terms of six reference-based and two reference-free metrics (Supplementary Note  2 ). The grid search of parameter combinations was conducted on three high-quality brain datasets to evaluate each tool’s performance, in which SpaGFT showed the highest median and peak scores (Fig.  2c and Supplementary Data  4 ). In addition, the computational speed of SpaGFT was two-fold faster than that of SPARK-X and hundreds-fold faster than those of the other four tools on the two Visium datasets (Supplementary Data  5 ). Although SpaGFT was slower than SPARK-X on the Slide-seqV2 dataset, it showed a remarkably enhanced SVG prediction performance compared to SPARK-X. We then performed an independent test on 28 independent datasets using the parameter combination with the highest median Jaccard Index among three datasets from the above grid-search test. The results revealed that SpaGFT promised supreme performance among the investigated tools based on the evaluation metrics (Fig.  2d , Supplementary Fig.  3a–d , and Supplementary Data  6 ). Within the top 500 SVGs from each of the above six tools, SpaGFT identified SVGs shared with other tools and also unique SVGs that were validated as the ground truth (Supplementary Fig.  3e and Supplementary Data  7 ). For example, Nsmf and Tbr1 were identified by all six tools and showed clear structures of the hippocampus, cortical region, and cerebral cortex. On the other hand , Cartpt, Cbln2, Ttr , and Pmch were uniquely identified by SpaGFT and showed key functions in the brain, such as Cartpt participating in dopamine metabolism 29 (Fig.  2e , Supplementary Fig.  4 , and Annotation 1 of Supplementary Note  3 ). These benchmarking results suggested that SpaGFT is capable of leveraging upon the FM representation of gene expression for robust and accurate identification of SVGs from SRT data. SpaGFT takes advantage of FM representation of gene expression patterns in SVG identification, and the SVGs identified by SpaGFT were distinguishably separated from non-SVGs on the FM-based UMAP with a clear boundary, whereas SVGs were irregularly distributed on the principal component-based gene UMAP (Fig.  2f ).

figure 2

a SpaGFT considers a gene-spot expression count matrix ( \(m\times n\) ) and spatial locations as input data, with ENC1 , MOBP , and GPS1 listed as examples. b Two known SVGs ( MOBP and ENC1 ) and one non-SVG ( GPS1 ) are shown as examples. The FMs can be separated into low-frequency (red) and high-frequency (blue) domains. c The SVG prediction evaluation was compared to five benchmarking tools. The running time is represented as red lines. In addition, the other evaluation scores of all parameter combinations for each tool are shown as heatmaps. The two-sided Wilcox rank-sum test was used to calculate the p -value for the highest two tools (i.e., N  = 16 and N  = 53 for SpaGFT and MERINGUE in HE_coronal data; N = 16 and N = 54 for SpaGFT and MERINGUE in 151673 data; N  = 16 and N  = 18 for SpaGFT and SPARK-X in Puck-200115-08 data). The methods are not able to identify SVG in a reasonable time, showing NA in this panel. d The box plot shows independent test results. The two-sided Wilcox rank-sum test is used to calculate the p -values for the highest two tools ( N  = 28). Each box showcases the minimum, first quartile, median, third quartile, and maximum Jaccard scores in panels c and d . e SVG examples that all tools can identify (left panel) are uniquely identified by SpaGFT (middle and right panel). Green genes are reported in the literature, while orange is not. Expression of Nsmf , Tbr1, Cartpt, Cbln2, Ttr , and Pmch in adult mouse brain. Allen Mouse Brain Atlas, mouse.brain-map.org/experiment/show/74821712, mouse.brain-map.org/experiment/show/79591351, mouse.brain-map.org/experiment/show/72077479, mouse.brain-map.org/experiment/show/68632172, and mouse.brain-map.org/experiment/show/55. f Comparison of the UMAPs of the HE-coronal data in Principle Component features space and Fourier space. g Boxplot showcases the performance of SVG signal enhancement for grid search (top) and independent test using 151509 (bottom), where the y-axis is the ARI value. The two-sided Wilcox rank-sum test is used to calculate the p -values for the highest two tools ( N  = 27). Each box showcases the minimum, first quartile, median, third quartile, and maximum ARI scores. h and i The spatial map shows the signals before and after enhancement and noise removal for spatial omics features. Source data are provided as a Source Data file.

In addition, the distorted graph signal correction can be used as the mathematical formulation to impute a low-expressed gene or denoise a high-intensity but noisy protein in SpaGFT. Essentially, FCs are shifted towards a specific bandwidth by implementing a low-pass filter and are inversely transformed to an enhanced graph signal using an inverse graph Fourier transform (iGFT) 32 . To enhance the main signal and mitigate noise, a low-passing filter is employed to weigh and shift all FCs toward the low-frequency bandwidth (see the “Methods” section). In the end, these weighted FCs are transformed back to a corrected graph signal via iGFT (Supplementary Fig.  5a ). In assessing the performance of gene expression correction, we used 16 human brains SRT datasets with well-annotated spatial domains 23 , 24 and utilized adjusted rand index (ARI) to measure the accuracy of predicting spatial domains using corrected gene expression. As a result, SpaGFT outperformed other gene enhancement tools in terms of ARI, including Sprod 8 , SAVER-X, scVI, netNMF-sc, MAGIC, and DCA 33 , 34 (Fig.  2g , Supplementary Fig.  5b , and Supplementary Data  8 ). For example, SpaGFT enhanced the low-intensity spatial omics signal broadly across different technologies and species, such as gene TNFRSF13C for human lymph node, gene Ano2 for mouse brain 29 (Supplementary Fig.  5c ), cell density for human prostate tumor (from data of Fig.  1k ), protein I-A, and corresponding gene H2ab1 for mouse breast tumor. Similarly, the noisy background can also be removed, such as protein LY6A/E and corresponding gene Ly6a and protein CD19 (Fig.  2h, i and Annotation 2 of Supplementary Note  3 ).

SpaGFT identifies the germinal center, T cell zone, B cell zone, and crosstalking regions in the human lymph node

As low-frequency FC can represent smooth spatially variable patterns, they can be used for SVG clustering, and gene clusters can correlate with distinct FTUs from a gene perspective (Supplementary Fig.  6a ). To demonstrate the application, we implemented SpaGFT in the publicly available Visium data of human lymph nodes, which, as secondary lymphoid organs contain well-known recurrent functional regions, such as T cell zones, B cell zones, and germinal center (GC) 20 . First, SpaGFT identified 1,346 SVGs and characterized nine SVG clusters (Fig.  3a and Supplementary Data  9 ). To recognize the FTUs of the T cell zone, B cell zone, and GC, we first used cell2location 35 , 36 to determine the cell proportion (Supplementary Fig.  6b and Supplementary Data  10 ) for the nine SVG clusters and investigate function enrichment (Supplementary Fig.  6c–e ) for three selected FTUs. Based on the molecular, cellular, and functional signatures of three regions 35 , we found that SVG clusters 3, 5, and 7 (Fig.  3b ) were associated with the T cell zone, GC, and B cell zone, respectively (Annotation 3 of Supplementary Note  3 ).

figure 3

a UMAP visualization of nine SVG clusters from the human lymph node. Each dot represented SVGs. Upright UMAP showed SVGs in red and non-SVG in gray. b Clusters 3, 5, and 7 were highly associated with the T cell zone, GC, and B follicle cell components based on molecular and functional signatures. The heatmap visualized the FTU-cell type correlation matrix. c The spatial map overlaid three FTUs and displayed the overlapped spots and unique spots. As different colors corresponded to spots, we selected four areas to showcase the region-to-region interaction. A1 showcased GC, GC-B interaction region, and B follicle. A2 showcased the B follicle, B–T interaction region, and T cell zone. A3 showcased the GC, GC-T interaction zone, and T cell zone. A4 displayed a B-GC-T interaction zone. d The barycentric coordinate plot shows cell-type components and the abundance of spots in interactive and functional regions. If the spot is closer to the vertical of the equilateral triangle, the cell type composition of the spot tends to be signature cell types of the functional region. The spots were colored by functional region and interactive region categories. e and f The three plots displayed changes in enriched functions and cell type components across seven regions (GC, GC-B, B, B–T, T, T–GC, T–GC–B). The P -value was calculated using one-way ANOVA to test the differences among the means of seven regions. The number of sample sizes (i.e., spots) in the GC zone, B cell zone, T cell zone, GC and T zone, GC and B zone, T&B zone, and GC, T, and B zone are 116, 1367, 667, 158, 614, 93, and 26. The error bars show the standard deviation of enrichment scores. Source data are provided as a Source Data file.

In contrast to spatial domain detection tools, SpaGFT is not restricted to a rigid boundary for tissue-level identification of microenvironments 5 . Instead, SpaGFT allows overlapping regions to infer the functional coherence and collaboration among different FTUs. We therefore projected three FTUs represented by SVG clusters 3, 5, and 7 on the spatial map for visual inspection, and identified their close spatial proximity (Fig.  3c ). These results are highly indicative of tissue regions of polyfunctionality amongst these three TFUs (four representative subregions are shown in Fig.  3c ). To further investigate the crosstalk amongst these three TFUs, we projected spots (assigned to all three regions) to the Barycentric coordinates (the equilateral triangle in Fig.  3d ), which displayed relations and abundance of the unique and overlapped regions regarding cell type components 37 . We identified 614 spots overlapped with B cell zone and GC, 158 spots overlapped with GC and T zone, 93 spots overlapped with T zone and B cell zone, and 26 spots overlapped across three FTUs (Supplementary Data  11 ), in support of the complex interactions within these three TFUs. We next hypothesized that the spots from the overlapped region would vary in functions and cell components to support the polyfunctionality of these regions. We thus investigated the changes in enriched functions (Supplementary Data  12 ) and cell types (Supplementary Fig.  7 ) across seven regions (i.e., GC, GC–B, B, B–T, T, T–GC, and T–B–GC). Our results identified lymph node-relevant pathways and cell types, such as B and T cell activity and functions, as significantly varied across those regions (Fig.  3e, f , Annotation 4 of Supplementary Note  3 ), in support of our hypothesis.

SpaGFT reveals secondary follicle variability based on CODEX data

The results of Visium in Fig.  3 showcased the ability of SpaGFT to identify FTUs via SVG clustering. Given that the current resolution of Visium (~50 μm per pixel) limited our ability to interpret the variability of finer follicle structures and their corresponding functions at the cellular level, we next performed single-cell level spatial proteomics on a human tonsil using a 49-plex CODEX panel at a ~0.37 μm per pixel resolution (Fig.  4a ) to better characterize and interpret the follicle variability we observed and inferred using SpaGFT on the Visium data. Based on the anatomical patterns highlighted by B (e.g., CD20) and T cell (e.g., CD4) lineage markers, we selected fields of view (FOV) that would allow for a good representation of the complex tissue structures present in the tonsil (i.e., GC and interfollicular space 38 ) while still highlighting the variability in follicle structure 39 . We first performed cell segmentation with DeepCell 40 , followed by clustering with FlowSOM 41 and Marker Enrichment modeling 42 to identify the diverse cell phenotypes present in the data (Fig.  4b ). Interestingly, we observed that the clear arrangement of T and B cell patterns (e.g., A3, A5, and A6) informed identifiable GC regions within the follicular structure, compared to others (e.g., A4) without clear T and B cell spatial organization (Fig.  4b ). We, therefore, postulated that A4 is comprised of multiple follicles, unlike A5 and A6, to represent a more spatially complex FOV.

figure 4

a A 49-plex CODEX data was generated from human tonsil tissue at a 0.37 μm/pixel resolution. Six FOVs were selected based on their varying tissue microenvironment and cellular organization. b Cell phenotype maps for each of the six FOVs, depicting the cellular composition and organization. c The results showed the characterization FTUs based on the gradient pixel-level images for A6. The heatmap depicts the SSIM score, where a higher score corresponds to a lighter color and greater structural similarity. d A heatmap showcasing the protein expression of each FTUs represented by the six SVP clusters, which were identified as FTUs resembling secondary follicles. The values in the heatmap are scaled by z -scores of protein expression. e–h Overlays of CODEX images for SVPs for FOVs 1, 3, 4, and 6, respectively. i . Spatial maps depicting the patterns of secondary follicle FTUs from six FOVs. Dash rectangles indicate the identified follicle regions. Note that panels d to h are ordered by FOV 1, 2, 3, 4, 5, and 6. j Cell phenotype maps of the FTUs identified in ( i ). k Barplots depicting the cell components of the identified FTU in ( i ). The cell type colors were depicted in ( b ). l The graph network depicting the spatial proximity of the top 5 abundant cell types in the FTU identified in i , as calculated by \(\frac{1}{1+d}\) , where d represents the average distance between any two cell types. m Dumbbell plots indicated significant cell–cell interaction among B cells and others. If the observed distance is significantly smaller than the expected distance, the two cell types tend to be contacted and interact. Line length represents relative distances, subtracting the expected distance from the observed distance. An empirical permutation test was used to calculate the p -value, and the point size was scaled using an adjusted p -value.

We investigated this further by directly using the raw CODEX images as inputs to identify FTUs formed from spatially variable protein (SVP) clusters within the tissue environment 43 . To verify whether downsampling the CODEX image (Supplementary Fig.  8a ) would result in a loss in the power of characterizing FTUs, we first used FOV 6 to generate three images across different resolutions (with downsampling), resulting in a (1) 1000-by-1000 pixel image (~0.8 μm per pixel size), (2) 500-by-500 pixel image (~1.6 μm per pixel size), and 3) 200-by-200 pixel image. Our results show that despite the generation of diverse low- and high-frequency FMs from three pixel-level images (as illustrated in Supplementary Fig.  8b ), SpaGFT was stable to resolution changes, characterizing FTUs across different resolutions with consistent patterns (Supplementary Fig.  8c ). We subsequently calculated the structural similarity score (SSIM) to quantitatively evaluate pattern similarity among identified FTUs. Each gradient pixel size image identified six FTUs, and those patterns of FTUs showed pairwise consistency (Fig.  4c and Supplementary Fig.  8d ), suggesting that 200-by-200 pixel downsampled images (an approximate factor of 105-fold from the original pixel size) were sufficient in characterizing FTUs to balance between computational efficiency and biological insights.

We next implemented SpaGFT to characterize FTUs for the six FOVs with 200-by-200 pixel images and annotated follicles for each FOV based on cell components (Supplementary Fig.  9 ) and protein signatures (Supplementary Data  13 ; Supplementary Figs.  10a, b ). Specifically, FTUs represented by SVP cluster 1 of A1 and SVP cluster 1 of A2 displayed morphological features akin to that of a mantle zone (MZ). Molecularly, we uncovered that the B cell-specific marker 44 (CD20) and anti-apoptotic factor (BCL-2) 45 were SVPs for these two FTUs of A1 and A2 (Fig.  4d, e and Supplementary Fig.  10c ). Our results confirmed the presence of CD20 in delineating the MZ structure, and additionally suggest that the presence of BCL-2 as an additional feature of MZ structures 46 . In another case, FTUs represented SVP cluster 4 of A3, SVP cluster 9 of A4, and SVP cluster 4 of A5 displayed GC-specific T cell signatures (Fig.  4 f, g and Supplementary Fig.  10d ) and corresponding molecular features, including PD-1 47 and CD57 48 , indicating the presence of well-characterized GC-specific T follicular helper cells 49 . For FTUs represented by SVP cluster 2 of A6, we observed a complex molecular environment, where Podoplanin, CD11c, and CD11b were SVPs, thus showcasing the existence of follicular dendritic cell (FDC) 50 and GC-centric macrophages 51 networks (Fig.  4h ). In addition to molecular heterogeneity, we further captured their variability in terms of length-scale and morphology (Fig.  4i ), cell type (Fig.  4j and k ), cell–cell distance (Fig.  4l ), and cell-cell interactions (Fig.  4m ). For example, from the tissue morphology perspective, A3–A6 captured clear oval shape patterns with different length-scales, but A1 and A2 captured multiple partial MZ patterns (Fig.  4i ). Although visual inspection was unable to distinguish between the morphological patterns of GCs in A4 (Fig.  4b ), SpaGFT was able to determine three small length-scale GC patterns at the molecular level (Fig.  4i ).

Regarding cellular characteristics, six FTUs (i.e., two MZ from A1 and A2; four GCs from A3 to A6) were dominated by B and CD4 T cells with varying proportions (Fig.  4j-k ; Supplementary Data  14 ). Specifically, MZs from A1 and A2 showed an average composition of 58% B and 10% CD4 T cells. GC from A3 and A5 with similar length-scale showed an average of 54% B and 32% CD4 T cells. A4 captured three length-scale GCs and showcased 43% B and 46% CD4 T cells, while the large-scale GC from A6 contained 70% B and 12% T cells, indicating B and T cell proportions varying in different length-scale GC. We could also infer cell–cell interaction based on distance (Fig.  4l, m ). In general, MZ from A1 and A2 show that the observed B–B distance was smaller than the expected distance, which suggests the homogeneous biological process of the significant B–B interaction in the GC region. In addition, cell-cell interaction also shows heterogeneity for two MZ. The interactions between CD4 T cells and B cells were observed in two MZ from A1 and A2, showcasing the infiltration of CD4 T cells into the B cell right mantle zone 52 . DC-B and CD4 T-B cell interactions in A3 and A4 suggest light zone functions for B cell selection 53 , 54 . Macrophage-B cell interactions in GC in A6 potentially indicated macrophage regulation on B cells (e.g., B cells that failed to trigger the cell proliferation signals during the B cell selection process underwent apoptosis and were subsequently engulfed by macrophages 55 ). Our results demonstrate the applicability of SpaGFT at an initial subsampled lower resolution from high-plex spatial proteomics, thus efficiently identifying and characterizing high-attention tissue regions, including secondary follicles, to uncover cellular and molecular variability that can be further confirmed at the original single-cell resolution. We also affirmed that FTUs identified by SpaGFT were not simply regions of cell aggregation but reflected both the cellular and regional activity and cell–cell interactions based on spatially orchestrated molecular signatures.

SpaGFT can generate new features and be implemented as an explainable regularizer for machine-learning algorithms

SpaGFT can also be beneficial to enhance the performance of existing methods as an explainable regularizer through feature or objective engineering. To elucidate its applicative power, we exemplified three representative analyses of SRT as follows (Supplementary Fig.  11 and see the “Methods” section).

First, we showcase how spot clustering can identify spatial domains spatially coherently in both gene expression and histology. Here, we selected SpaGCN 4 as the demonstration to showcase the implementation of FC from the feature engineering aspect (Fig.  5a ). To illustrate FCs being a feature, we extended the spatial expression matrix by concatenating a spot-by-FC matrix derived from the spot-spot similarity. Subsequently, the new feature matrix was input into the original SpaGCN model and predicted spatial domains. Same as the SpaGCN study, we utilized 12 datasets 24 of human brain SRT data for training (two datasets from the same tissue section) the number of new features and testing for improving SpaGCN on 10 datasets. The results indicated improvements in eight out of ten datasets (Supplementary Data  15 ) in identifying the spatial domains of the dorsolateral prefrontal cortex. Notably, the top five datasets exhibited enhancements between 7.8% and 42.6%.

figure 5

a Spot clustering can be formulated as a many-to-one mapping problem. Regarding the modified workflow of SpaGCN, we changed the original input of SpaGCN. A newly formed matrix was then placed into the frozen SpaGCN model for computation. The top 5 performance-increased samples are distinctly showcased, where the y -axis is the ARI value, and the x -axis is the sample number. b Annotation transfer is formulated as a many-to-many mapping problem. Regarding the modified workflow of TACCO, we modified the cost matrix for optimal transport. In the new cost matrix calculation method, we use weighted FCs as the feature to calculate the distance between CT and spots and then optimize the baseline mapping matrix (e.g., TACCO output). In the evaluation, we refer to TACCO methods to simulate spots with different bead sizes using scRNA-seq data and use L2 error to measure differences between predicted and known cell composition in each simulated spot. The y-axis is the bead size for a simulation data value, and the x-axis is the L2 error. Lower L2 error scores indicate better performance. c The cell-spot alignment can be formulated as a many-to-many mapping problem. Regarding the modified workflow of Tangram, we have added two additional constraint terms to its original objective function. The first constraint is designed from a gene-centric perspective, calculating the cosine similarity of the gene by FC matrix between the reconstructed and the original matrix. The second constraint is designed from a cell-centric perspective, calculating the cosine similarity on the spot by the FC matrix between the reconstructed and the original matrix. In the evaluation, we first simulate spatial gene expression data using different window sizes based on STARmap data. Subsequently, we measure the similarity between predicted and known cell proportions in each simulated spot using the Pearson correlation coefficient. A higher PPC indicates better performance (Source data are provided as a Source Data file).

Second, annotation transfer will solve the challenge of insufficient data labeling for the increasing emergence of SRT. We used TACCO 6 as an annotation transfer example tool to showcase the application of FC as a regularizer for the optimal transport (OT) method, which is a machine learning method that aimed to find the most efficient way (i.e., minimizing the overall cost associated with the movement) to move a probability distribution from one configuration to another. Specifically, TACCO allowed the transfer of phenotype-level annotation labels (e.g., cell type) from scRNA-seq to SRT using such an OT framework. Although TACCO has demonstrated algorithm effectiveness in consideration of cell similarity over all genes, we hypothesized that projecting cell similarity to the frequency domain and strengthening a topological regularization in OT’s objective function will be a potential avenue for performance enhancement. In our modification, we integrated a topological regularization term into the original cost matrix to derive a new cost matrix (Fig.  5b and see the “Methods” section). Leveraging the evaluation metrics of the original TACCO study, our tests underscored an 8.7–14.9% L2-error decrease across five simulated bead sizes in terms of transferring annotated labels from scRNA-seq to unannotated SRT mouse brain data (Supplementary Data  16 ).

Third, aligning single-cell data (e.g., scRNA-seq) to low-/high-resolution SRT data was important to mutually benefit each other regarding spatial resolution and molecular diversity. We selected Tangram 7 as an alignment tool to demonstrate the topological regulation of genes and spots in the frequency domain. Tangram optimized the cell-to-spot mapping matrix through the gradient-based method, aiming to ensure the similarity between the reconstructed SRT based on scRNA-seq and the original SRT. The objective function of Tangram is to measure cell density, gene-level similarity, and cell-level similarity in the vertex domain, respectively. In alignment with the hypothesis proposed in Fig.  5b , we constrained the similarity at both the gene- and cell-level in the frequency domain (Fig.  5c ). As a result, our tests illustrated 7.4–15.9% Pearson correlation coefficient increase improvement regarding aligning scRNA-seq on simulated STARmap 56 mouse brain SRT data (Supplementary Data  17 ).

SpaGFT introduces an inductive bias to regularize the deep learning method and identify rare subcellular organelles

We applied SpaGFT to obtain an interpretable spreading entropy regulation for a conditional variational autoencoder framework, CAMPA, to identify conserved subcellular organelles across multiple perturbed conditions on pixel-level 4i data (165 nm/pixel) 16 , 17 . To modify the model, we introduced an entropy term in the original reconstruction loss of CAMPA to regularize the spreading of graph signals 19 . Specifically, we constrained the entropy within the first k bandwidth and provided an inductive assumption for CAMPA to learn embeddings that represented k -bandlimited signals (Supplementary Fig.  12a ). Consequently, compared to the validation loss calculated from validation datasets (see the “Methods” section), the loss curve from the modified model showed a reduction and entered a stable state earlier (Fig.  6a ). We observed that, by introducing the entropy term as a regularizer, the model enhanced the training efficacy in capturing and minimizing the reconstruction error and promoted faster convergence of the model.

figure 6

a The first column shows pixel clustering concepts. In this bipartite graph (second column), pixel clustering can be formulated as a many-to-one mapping problem, where the source node represents the pixel, the target node represents the subcellular organelle, and the edges denote the corresponding mapping relationships. Regarding the modified workflow of CAMPA (third column), we have made a modification to the original loss function. The modified term aims to measure the spreading of graph signals in the reconstructed image. In the frequency domain, this spreading can be quantified using spreading entropy (see the “Methods” section). A spreading graph signal corresponds to high entropy, while a non-spread graph signal corresponds to low entropy. Therefore, the new regularizer term aims to minimize the spreading entropy. In the evaluation (fourth column), we used the validation loss, which was calculated using the same loss function and validation dataset to examine the contribution of the spreading entropy to the model training. The y -axis is the validation loss value, and the x -axis is the number of epochs for training the CAMPA model. b . UMAP shows five-pixel clusters predicted by the baseline model using the Leiden clustering algorithm at 0.2 resolution. c UMAP shows seven-pixel clusters predicted by the modified model using the Leiden clustering algorithm at 0.2 resolution. Two rare clusters were circled in this panel. d The Sanky plot shows the cluster changes from baseline model prediction to modified model prediction. e The heatmap shows the annotation of each cluster (modified model at resolution 0.2) using a human protein atlas. The column of the heatmap is the protein intensity in the cell nucleus, and the row corresponds to clusters. f–h The three figures showcase the overview of predicted pixel clusters, cluster 6, and marker protein for cell 224081. i–k The three figures showcase the overview of predicted pixel clusters, cluster 5, and marker protein for cell 367420.

Furthermore, we validated that the modified model significantly ( p -value = 0.035) improved the baseline model regarding batch effect removal (Supplementary Fig.  12b–e ) using kBET testing 57 , indicating that the learned embeddings retained conserved structures of subcellular organelles across multiple perturbations. Next, compared with the baseline (Fig.  6b–d ), the modified model additionally identified two rare clusters (Supplementary Data  18 ), including cluster 5 (with an average of 0.16% pixels per cell) and cluster 6 (with an average of 0.10% pixels per cell). Notably, the pixels assigned to these two clusters are very stable (not random signals computationally) regardless of different resolution parameters of the Leiden clustering algorithm (Supplementary Data  19 and Supplementary Fig.  12f ). Subsequently, clusters 5 and 6 were annotated as Cajal bodies 58 and set1/COMPASS 59 , respectively (Fig.  6e ). Cluster 6 and its corresponding protein signature, SETD1A (Fig.  6f–h ), displaying a highly concentrated pattern (with an average of 0.16% pixels per cell), were strongly shown as a k -bandlimited signal in the frequency domain. Furthermore, we also observed similar characteristics of cluster 5 and the corresponding marker protein, COIL (Fig.  6i–k ). Therefore, by integrating the regularization of low-frequency signals from SpaGFT, the CAMPA model’s stability was enhanced in learning embeddings that represented subcellular organelles with k -bandlimited characteristics. This approach, which we term “explainable regularization,” refines the detection and characterization of finer structures exhibiting spatially organized patterns.

SpaGFT provides a reliable feature representation through graph Fourier transform that enhances our biological understanding of complex tissues. This method aligns with the advanced analytical capabilities required to dissect the intricate spatial components of tissue biology, from subcellular to multicellular scales. It eliminates the need for pre-defined expression patterns and significantly improves computational efficiency, as demonstrated in the benchmarking across 31 human/mouse Visium and Slide-seq V2 datasets. In addition, we also highlight our manually curated 458 mouse and human brain genes as close-to-optimization standard SVGs. This will bring an alternative evaluation metric based on realistic human/mouse data, which is complementary to simulation-based evaluation methods, such as BSP 60 , SPARK-X, SpatialDE, SPARK, scGCO, and other benchmarking work 61 . Furthermore, implementing a low-pass filter and inverse GFT effectively impute low-expressed gene expression and denoise high-noisy protein intensity, leading to more precise spatial domain predictions, as showcased in the human dorsolateral prefrontal cortex. Notably, SpaGFT advances the interpretation of spatial omics data by enabling more accurate machine learning predictions. It has notably improved the performance of existing frameworks by 8–40% in terms of the accuracy of spatial domain identification, lower error of annotation transfer from cell types to spots, the correctness of the cell-to-spot alignments, and the validation loss of subcellular hallmark inference, respectively.

From a computational standpoint, SpaGFT and scGCO are two graph representation methods, among others, for spatial omics data analysis, with the former focusing on omic feature representation and the latter focusing on SVG detection. scGCO employs a graph-cut method to segment the tissue and compare the consistency between segmentations and gene expressions in support of SVG detection. SpaGFT uses the graph Fourier transform to find a novel latent space to represent gene expression and achieve various downstream tasks, including but not limited to, SVG identification, gene expression enhancement, and functional tissue unit inference.

In addition, there is a good potential for implementing SpaGFT into existing explainable spatial multi-modalities frameworks 2 , such as UnitedNet 62 , MUSE 63 , and modalities-autoencoder 64 . Considering UnitedNet 62 as an example, it incorporates explainable machine learning techniques to dissect the trained network and quantify the relevance of features across different modalities, specifically looking at cell-type-specific relationships. To bring more spatial insight into UnitedNet, SpaGFT can provide (1) augmented features (e.g., modified SpaGCN in Fig.  5a ) and (2) an explainable regularizer (e.g., modified CAMPA in Fig.  6 ). To generate the augmented spatial omics features, SpaGFT can first calculate cell-cell relations (e.g., calculation from H&E features, gene expression, or protein intensity) in the vertex domain and transform the relations to FCs, The FCs encode and quantify cell–cell variation patterns, which can be regarded as one of the inputs for UnitedNet. Regarding implementing SpaGFT as an explainable regularizer, the spreading entropy can be introduced into UnitedNet’s reconstruction loss function, as UnitedNet has an encoder-decoder structure. By regularizing the entropy of encoded and decoded spatial omics features on the Fourier domain, the UnitedNet may be guided to learn spatially organized regions that present a low-frequency signal (e.g., one functional tissue unit with a specific pattern and function). These enhancements are pivotal in characterizing complex biological structures using explainable regularization for deep learning framework, including identifying rare subcellular organelles, thus providing deeper insights into the cellular machinery.

Regarding the biological implications, SpaGFT offers alternative perspectives on spatial biology questions. Specifically, by grouping SVGs identified by SpaGFT, we can uncover distinct FTUs within organs. This has led to the identification of critical immunological regions in the human lymph node Visium data, enhancing our knowledge of B cell maturation and the polyfunctional areas it encompasses, such as the B cell zone, T cell zone, GC, B–T zone, GC–B zone, T–GC, and tri-zone Additionally, exclusive in-house CODEX data, SpaGFT has revealed secondary follicle differences in the morphology, molecular signatures, and cellular interactions in the human tonsil, offering a more nuanced understanding of B cell maturation. Additionally, SpaGFT introduces k -bandlimited signal entropy within the CAMPA framework. This has led to the groundbreaking identification of rare subcellular organelles, which are the Cajal body and the Set1/COMPASS complex. The former is integral to the regulation of gene expression, while the latter plays a critical role in epigenetic modifications. By enabling the investigation of these organelles with unprecedented detail, SpaGFT propels us closer to a comprehensive understanding of the spatial dynamics of gene expression and the epigenetic landscape within cells.

However, there is still room for improving prediction performance and understanding the FTU mechanism. First, SpaGFT discusses low-frequency signals in the frequency domain, but there is a lack of discussion on medium- and high-frequency signals. Although a previous study 65 described that most functionally related biological signals are presented in the low-frequency region, certain special signals are also found in the medium and high-frequency region. For instance, in the human brain fMRI (functional magnetic resonance imaging, a technique that measures brain activity by detecting changes associated with blood flow), low-frequency FMs capture the global variation signals (e.g., daydreaming and retrieving memories). Medium-frequency FMs capture brain networks with less global variation but more rapid processing (e.g., working memory or executive functions). High-frequency FMs capture responses to new or complex stimuli that involve local connections between close brain regions (e.g., acute, localized brain activities). Analogous to spatial omics data, we assume that medium and high-frequency signals may also have corresponding special biological signals with more local and less global variation (e.g., regions stimulation from the environment), complementing the current k -bandlimited signal approach of representing smooth global variation. Therefore, in future studies, we might focus more on multi-frequency signal interpretation. Second, although the SpaGFT computation speed is very competitive, it can be further enhanced by reducing the computational complexity from \(O({n}^{2})\) to \(O(n\times \log (n))\) using fast Fourier transform algorithms 66 . Third, the alteration of the spot graph and FTU topology represents a potential challenge in identifying FTUs across spatial samples from different tissues or experiments, which results in diverse FM spaces and renders the FCs incomparable. This is similar to the “batch effect” issue in multiple single-cell RNA sequencing (scRNA-seq) integration analyses. One possible solution to this challenge is to embed and align spatial data points to a fixed topological space using machine learning frameworks, such as optimal transport. Another possibility is to use H&E images as a common reference for all to make the embedding tissue-aware. Fourth, SpaGFT implementation on the CODEX image relies on experts’ knowledge to pre-select functional regions. The future direction of analyzing multiplexed images is to develop a topological learning framework to automatically detect and segment functional objects based on SpaGFT feature representation. Overall, we believe the value of our study is to bring an alternative view for explainable artificial intelligence in spatial omics modeling, including multi-resolution spatial omics data integration and pattern analysis across spatiotemporal data 13 .

We introduce Spatial Graph Fourier Transform (SpaGFT) to represent spatial omics features. The core concept of SpaGFT is to transform spatial omics features into Fourier coefficients (FC) for downstream analyses, such as SVG identification, expression signal enhancement, and topological regularization for other machine algorithms. SpaGFT framework provides graph signal transform and seven downstream tasks: SVG identification, gene expression imputation, protein signal denoising, spatial domain characterization, cell type annotation, cell-spot alignment, and subcellular landmark inference. The detailed theoretical foundation of k -bandlimited signal recognition can be found in Supplementary Note  1 .

Graph signal transform

K -nearest neighbor ( k nn) graph construction.

Given a gene expression matrix containing n spots, including their spatial coordinates and m genes, SpaGFT calculates the Euclidean distances between each pair of spots based on spatial coordinates first. In the following, an undirected graph \(G=\left(V,\,E\right)\) will be constructed, where \(V=\{{v}_{1},\,{v}_{2},\ldots,\,{v}_{n}\}\) is the node set corresponding to n spots; E is the edge set while there exists an edge \({e}_{{ij}}\) between \({v}_{i}\) and \({v}_{j}\) in \(E\) if and only if \({v}_{i}\) is the KNN of v j or v j is the KNN of \({v}_{i}\) based on Euclidean distance, where \(i,\,{j}=1,\,2,\,\ldots,{n}\) ; and \(i\,\ne\, j\) . Based on the benchmarking results in Supplementary Data  4 , the default K is defined as 1* \(\sqrt{n}\) among 0.5* \(\sqrt{n,\,}\) 1* \(\sqrt{n}\) , 1.5* \(\sqrt{n}\) , and 2* \(\sqrt{n}\) . Note that all the notations of matrices and vectors are bolded, and all the vectors are treated as column vectors in the following description. An adjacent binary matrix \({{{\bf{A}}}}=({a}_{{ij}})\) with rows and columns as n spots is defined as:

A diagonal matrix \({{{\bf{D}}}}={{{\rm{diag}}}}(d_{1},\,{d}_{2},\,\ldots,\,{d}_{n})\) , where \({d}_{i}={\sum}_{j=1}^{n}{a}_{ij}\) represents the degree of \({v}_{i}\) .

Fourier mode calculation

Using matrices \({{{\bf{A}}}}\) and \({{{\bf{D}}}}\) , a Laplacian matrix \({{{\bf{L}}}}\) can be obtained by

\({{{\bf{L}}}}\) can be decomposed using spectral decomposition:

where the diagonal elements of \({{{\bf{\Lambda }}}}\) are the eigenvalues of \({{{\bf{L}}}}\) with \({\lambda }_{1}\le {\lambda }_{2}\le \ldots \le {\lambda }_{n},\) where \({\lambda }_{1}\) is always equal to 0 regardless of graph topology. Thus, \({\lambda }_{1}\) is excluded from the following analysis. The columns of \({{{\bf{U}}}}\) are the unit eigenvector of \({{{\bf{L}}}}\) . μ k is the k th Fourier mode (FM), \({{{{\boldsymbol{\mu }}}}}_{{k}}\in {{\mathbb{R}}}^{n},\) \(k=1,\,2,\,\ldots,{n}\) , and the set { μ 1 , μ 2 , ..., μ k } is an orthogonal basis for the linear space. For \({{{{\boldsymbol{\mu }}}}}_{{k}}=\left({\mu }_{k}^{1},\,{\mu }_{k}^{2},\,\ldots,\,{\mu }_{k}^{n}\right)\) , where \({\mu }_{k}^{i}\) indicates the value of the k th FM on node \({v}_{i}\) , the smoothness of μ k reflects the total variation of the k th FM in all mutual adjacent spots, which can be formulated as

The form can be derived by matrix multiplication as

where \({{{{\mathbf{\mu }}}}}_{{{{\bf{k}}}}}^{{{{\rm{T}}}}}\) is the transpose of μ k . According to the definition of smoothness, if an eigenvector corresponds to a small eigenvalue, it indicates the variation of FM values on adjacent nodes is low. The increasing trend of eigenvalues corresponds to an increasing trend of oscillations of eigenvectors; hence, the eigenvalues and eigenvectors of L are used as frequencies and FMs in our SpaGFT, respectively. Intuitively, a small eigenvalue corresponds to a low-frequency FM, while a large eigenvalue corresponds to a high-frequency FM.

Graph Fourier transform

The graph signal of a gene g is defined as \({{{{\bf{f}}}}}_{{{{\bf{g}}}}}=\left({f}_{{g}}^{1},\,{f}_{{g}}^{2},\,\ldots,\,{f}_{{g}}^{n}\right)\in {{\mathbb{R}}}^{n},\) which is a n -dimensional vector and represents the gene expression values across n spots. The graph signal f g is transformed into a Fourier coefficient \({\hat{{{{\bf{f}}}}}}_{{{{\bf{g}}}}}\) by

In such a way, \({{\hat{f}}_{{g}}^{{k}}}\) is the projection of f g on FM μ k , representing the contribution of FM μ k to graph signal f g , k is the index of f g (e.g., \(k=1,\,2,\,\ldots,{n}\) ). This Fourier transform harmonizes gene expression and its spatial distribution to represent gene g in the frequency domain. The details of SVG identification using \({\hat{{{{\bf{f}}}}}}_{{{{\bf{g}}}}}\) can be found below.

SVG identification

Gftscore definition.

We designed a GFTscore to quantitatively measure the randomness of gene expressions distributed in the spatial domain, defined as

where \({\lambda }_{k}\) is the pre-calculated eigenvalue of L , and \({{\rm {e}}}^{-{\lambda }_{k}}\) is used to weigh the \({\widetilde{f}}_{g}^{k}\) to further enhance the smoothness of the spatial omics variation signal and reduce its noisy components (Supplementary Note  1 S2.3) 18 , 67 . The normalized Fourier coefficient \({\widetilde{f}}_{g}^{k}\) is defined as

The gene with a high GFTscore tends to be SVG and vice versa. Therefore, all m genes are decreasingly ranked based on their GFTscore from high to low and denote these GFTscore values as \({y}_{1}\ge {y}_{2}\ge \ldots \ge {y}_{m}\) . In order to determine the cutoff y z to distinguish SVG and non-SVGs based on GFTscore, we applied the Kneedle algorithm 68 to search for the inflection point of a GFTscore curve described in Supplementary Note  1 .

Wilcoxon rank-sum test implementation for determining SVGs

Although the above GFTscore is an indicator to rank and evaluate the potential SVGs, a rigorous statistical test is needed to calculate the p -value for SVGs and control type I error. First, SpaGFT determines low-frequency FM and high-frequency FMs and corresponding FCs by applying the Kneedle algorithm to the eigenvalues of L . The inflection points are used for determining the low-frequency FMs and high-frequency FMs when the direction parameters are ‘increasing’ and ‘decreasing’, respectively. Second, the Wilcoxon rank-sum test is utilized to test the differences between low-frequency FCs and high-frequency FCs to obtain statistical significance. If a gene has a high GFTscore and significantly adjusted p -value, the gene can be regarded as an SVG. We use \(f=({f}_{1},\,{f}_{2},\ldots,{f}_{n})\) to represent the expression of a random signal on n spots. If the gene corresponding to the graph signal is a non-SVG, the gene expressions on neighboring spots are independent. Otherwise, it will exhibit spatial dependence. Hence, we can assume that \(({f}_{1},\ldots,\,{f}_{n}) \sim N({\mu }_{f},\,{\sigma }_{f}^{2}I)\) , similar in SpatialDE 11 , where \({\mu }_{f}\) , \({\sigma }_{f}^{2}\) and I are the mean, variance, and identity matrix, respectively. In this case, each \({f}_{i}\) follows a Gaussian distribution, which is independent and identically distributed. By implementing GFT on \(({f}_{1},\,{f}_{2},\ldots,\,{f}_{n})\) , we obtain Fourier coefficients \({{F{C}}}_{1},\,{{{F}{C}}}_{2},\,\cdots,{{{F}{C}}}_{p}\) , where \(p\) is the number of low-frequency FCs and reflects the contributions from low-frequency FMs. We also obtain the \({{F{C}}}_{p+1},{{{F}{C}}}_{p+2},\,\cdots,{{{F}{C}}}_{p+q}\) , where \(q\) is the number of high-frequency FCs and reflects the contributions from noise. Hence, we form the null hypothesis that no difference exists between low-frequency FCs and high-frequency FCs (Proof can be found in S 3 of Supplementary Note  1 ). Accordingly, a non-parametrical test (i.e., Wilcoxon rank-sum test) is used for testing the difference between median values of low-frequency FCs and high-frequency FCs. Especially, the null hypothesis is that the median of low-frequency FCs of an SVG is equal to or lower than the median of high-frequency FCs. The alternative hypothesis is that the median of low-frequency FCs of an SVG is higher than the median of high-frequency FCs. The p -value of each gene is calculated based on the Wilcoxon one-sided rank-sum test and then adjusted using the false discovery rate (FDR) method. Eventually, a gene with GFTscore higher than \({y}_{z}\) and adjusted p -value less than 0.05 is considered an SVG.

Benchmarking data setup

Dataset description.

Thirty-two spatial transcriptome datasets were collected from the public domain, including 30 10X Visium datasets (18 human brain data, 11 mouse brain data, and one human lymph node data) and two Slide-seqV2 datasets (mouse brain). More details can be found in Supplementary Data  1 . Those samples were sequenced by two different SRT technologies: 10X Visium measures ~55 μm diameter per spot, and Slide-seqV2 measures ~10 μm diameter per spot. Three datasets were selected as the training sets for grid-search parameter optimization in SpaGFT, including two highest read-depth datasets in Visium (HE-coronal) and Slide-seqV2 (Puck-200115-08), one signature dataset in Maynard’s study 24 . The remaining 28 datasets (excluding lymph node data) were used as independent test datasets.

Data preprocessing

For all 32 datasets, we adopt the same preprocessing steps based on squidpy (version 1.2.1), including filtering genes that have expression values in <10 spots, normalizing the raw count matrix by counts per million reads method, and implementing log-transformation to the normalized count matrix. No specific preprocessing step was performed on the spatial location data.

Benchmarking SVG collection

We collected SVG candidates from five publications 24 , 25 , 26 , 27 , 28 , with data from either human or mouse brain subregions. (i) A total of 130 layer signature genes were collected from Maynard’s study 24 . These genes are potential multiple-layer markers validated in the human dorsolateral prefrontal cortex region. (ii) A total of 397 cell-type-specific (CTS) genes in the adult mouse cortex were collected from Tasic’s study (2016 version) 28 . The authors performed scRNA-seq on the dissected target region, identified 49 cell types, and constructed a cellular taxonomy of the primary visual cortex in the adult mouse. (iii) A total of 182 CTS genes in mouse neocortex were collected from Tasic’s study 27 . Altogether, 133 cell types were identified from multiple cortical areas at single-cell resolution. (iv) A total of 260 signature genes across different major regions of the adult mouse brain were collected from Ortiz’s study 25 . The authors’ utilized spatial transcriptomics data to systematically profile subregions and delivered the subregional genes using consecutive coronal tissue sections. (v) A total of 86 signature genes in the cortical region shared by humans and mice were collected from Hodge’s study 26 . Collectively, a total of 849 genes were obtained, among which 153 genes were documented by multiple papers. More details, such as gene names, targeted regions, and sources, can be found in Supplementary Data  2 .

Next, the above 849 genes were manually validated on the in-situ hybridization (ISH) database deployed on the Allen Brain Atlas ( https://mouse.brain-map.org/ ). The ISH database provided ISH mouse brain data across 12 anatomical structures (i.e., Isocortex, Olfactory area, Hippocampal formation, Cortical subplate, Striatum, Pallidum, Thalamus, Hypothalamus, Midbrain, Pons, Medulla, and Cerebellum). We filtered the 849 genes as follows: (i) If a gene is showcased in multiple anatomical plane experiments (i.e., coronal plane and sagittal plane), it will be counted multiple times with different expressions in the corresponding experiments, such that 1327 genes were archived (Supplementary Data  3 ). (ii) All 1327 genes were first filtered by low gene expressions (cutoff is 1.0), and the FindVariableFeatures function (“vst” method) in the Seurat (v4.0.5) was used for identifying highly variable genes across twelve anatomical structures. Eventually, 458 genes were kept and considered as curated benchmarking SVGs. The evaluation criteria can be found in Supplementary Note  2 .

Statistics and reproducibility

In our benchmarking experiment, we implemented a two-sided Wilcoxon-rank sum test to conduct a significance test. No data were excluded from the analyses. The experiments were not randomized. Randomization is not relevant to this study since each data was analyzed separately. We then computed the key evaluation metrics, including the Jaccard index, odds ratio, precision, recall, F1 score, Tversky index, Moran’s Index, and Geary’s C .

SpaGFT implementation and grid search of parameter optimization

A grid-search was set to test for six parameters, including ratio_neighbors (0.5, 1, 1.5, 2) for KNN selection and S (4, 5, 6, 8) for the inflection point coefficient, resulting in 16 parameter combinations. We set \(K=\sqrt{n\,}\) as the default parameter for constructing the KNN graphs in SpaGFT. SVGs were determined by genes with high GFTscore via the KneeLocator function (curve=’convex’, direction=’deceasing’, and S  = 6) in the kneed package (version 0.7.0) and FDR (cutoff is less than 0.05).

Parameter setting of other tools

(i) SpatialDE (version 1.1.3) is a method for identifying and describing SVGs based on Gaussian process regression used in geostatistics. SpatialDE consists of four steps, establishing the SpatialDE model, predicting statistical significance, selecting the model, and expressing histology automatically. We selected two key parameters, design_formula (‘0’ and ‘1’) in the NaiveDE.regress_out function and kernel_space (“{‘SE’:[5.,25.,50.],‘const’:0}”, “{‘SE’:[6.,16.,36.],‘const’:0}”, “{‘SE’:[7.,47.,57.],‘const’:0}”, “{‘SE’:[4.,34.,64.],‘const’:0}”, “{‘PER’:[5.,25.,50.],‘const’:0}”, “{‘PER’:[6.,16.,36.],‘const’:0}”, “{‘PER’:[7.,47.,57.],‘const’:0}”, “{‘PER’:[4.,34.,64.],‘const’:0}”, and “{‘linear’:0,‘const’:0}”) in the SpatialDE.run function for parameter tunning, resulting in 18 parameter combinations.

(ii) SPARK (version 1.1.1) is a statistical method for spatial count data analysis through generalized linear spatial models. Relying on statistical hypothesis testing, SPARX identifies SVGs via predefined kernels. First, raw count and spatial coordinates of spots were used to create the SPARK object via filtering low-quality spots (controlled by min_total_counts) or genes (controlled by percentage). Then the object was followed by fitting the count-based spatial model to estimate the parameters via spark.vc function, which is affected by the number of iterations (fit.maxiter) and models (fit.model). Lastly, ran spark.test function to test multiple kernel matrices and obtain the results. We selected four key parameters, percentage (0.05, 0.1, 0.15), min_total_counts (10, 100, 500) in CreateSPARKObject function, fit.maxiter (300, 500, 700), and fit.model (“poisson”, “gaussian”) in spark.vc function for parameter tuning, resulting in 54 parameter combinations.

(iii) SPARK-X (version 1.1.1) is a non-parametric method that tests whether the expression level of the gene displays any spatial expression pattern via a general class of covariance tests. We selected three key parameters, percentage (0.05, 0.1, 0.15), min_total_counts (10, 100, 500) in the CreateSPARKObject function, and option (“single”, “mixture”) in the sparkx function for parameter tuning, resulting in 18 parameter combinations.

(iv) SpaGCN (version 1.2.0) is a graph convolutional network approach that integrates gene expression, spatial location, and histology in spatial transcriptomics data analysis. SpaGCN consisted of four steps, integrating data into a chart, setting the graph convolutional layer, detecting spatial domains by clustering, and identifying SVGs in spatial domains. We selected two parameters, the value of the ratio (1/3, 1/2, 2/3, and 5/6) in the find_neighbor_cluster function and res (0.8, 0.9, 1.0, 1.1, and 1.2) in the SpaGCN.train function for parameter tuning, resulting in 20 parameter combinations.

(v) MERINGUE (version 1.0) is a computational framework based on spatial autocorrelation and cross-correlation analysis. It is composed of three major steps to identify SVGs. Firstly, Voronoi tessellation was utilized to partition the graph to reflect the length scale of cellular density. Secondly, the adjacency matrix is defined using geodesic distance and the partitioned graph. Finally, gene-wise autocorrelation (e.g., Moran’s I) is conducted, and a permutation test is performed for significance calculation. We selected min.read (100, 500, 1000), min.lib.size (100, 500, 1000) in the cleanCounts function and filterDist (1.5, 2.5, 3.5, 7.5, 12.5, 15.5) in the getSpatialNeighbors function for parameter tuning, resulting in 54 parameter combinations.

(vi) scGCO (version 1.1.2) is a graph-cut approach that integrates gene expression and spatial location in spatial transcriptomics data analysis. scGCO consists of four steps: representing a gene’s spatial expression with hidden Markov random field (HMRF), optimizing HMRF with graph cuts with varying hyperparameters, identifying best graph cuts, and calculating the significance of putative SVGs. We selected three parameters, the value of unary_scale_factor (50, 100, and 150) and smooth_factor (5, 10, and 15) in the identify_spatial_genes function for parameter tuning and fdr_cutoff (0.025, 0.05, and 0.075) in the final pipeline for identification of SVG, resulting in 27 parameter combinations.

Visualization of frequency signal of SVGs in PCA and UMAP

Mouse brain (i.e., HE coronal sample) with 2702 spots was used for demonstrating FCs on distinguishing SVG and non-SVG in the 2D UMAP space. SpaGFT determined 207 low-frequency FMs using the Kneedle Algorithm and computed corresponding FCs. PCA was also used for producing low-dimension representation. The transposed and normalized expression matrix was decomposed by using the sc.tl.pca function from the scanpy package (version 1.9.1). The first 207 principal components (PC) were selected for UMAP dimension reduction and visualization. The function sc.tl.umap was applied to conduct UMAP dimension reduction for FCs and PCs.

SVG signal enhancement

An SVG may suffer from low expression or dropout issues due to technical bias 8 . To solve this problem, SpaGFT implemented the low-pass filter to enhance the SVG expressions. For an SVG with an observed expression value \({{{{\bf{f}}}}}_{{{{\bf{g}}}}}\in {{\mathbb{R}}}^{n}\) , we define \({\bar{{{{\bf{f}}}}}}_{{{{\bf{g}}}}}\in {{\mathbb{R}}}^{n}\) as the expected gene expression value of this SVG, and \({{{{\bf{f}}}}}_{{{{\bf{g}}}}}{{{\boldsymbol{=}}}}{\bar{{{{\bf{f}}}}}}_{{{{\bf{g}}}}}{{{\boldsymbol{+}}}}{{{{\boldsymbol{\epsilon }}}}}_{{{{\bf{g}}}}}\) , where \({{{{\boldsymbol{\epsilon }}}}}_{{{{\boldsymbol{g}}}}}\in {{\mathbb{R}}}^{n}\) represents noises. SpaGFT estimates an approximated FCs \({{{{\bf{f}}}}}_{{{{\bf{g}}}}}^{{{{\boldsymbol{\star }}}}}\) to expected gene expression \({\bar{{{{\bf{f}}}}}}_{{{{\bf{g}}}}}\) in the following way, resisting the noise \({{{{\boldsymbol{\epsilon }}}}}_{{{{\boldsymbol{g}}}}}\) . The approximation has two requirements (i) the expected gene expression after enhancement should be similar to the originally measured gene expression, and (ii) keep low variation within estimated gene expression to prevent inducing noises. Therefore, the following optimization problem is proposed to find an optimal solution \({{{{\bf{f}}}}}_{{{{\bf{g}}}}}^{{{{\boldsymbol{\star }}}}}\) for \({\bar{{{{\bf{f}}}}}}_{{{{\bf{g}}}}}\)

where || ∙ || is the \(L2\) -norm, \({{{\bf{f}}}}=\left({f}^{1},\,{f}^{2},\,\ldots,\,{f}^{n}\right)\in {{\mathbb{R}}}^{n}\) is the variable in solution space, and \(i,\,{j}=1,\,2,\,\ldots,{n}\) . \(c\) is a coefficient to determine the importance of variation of the estimated signals, and \(c\, > \,0\) . According to convex optimization, the optimal solution \({{{{\bf{f}}}}}_{{{{\bf{g}}}}}^{\star}\) can be formulated as

where \({{{\mathbf{\Lambda }}}}={{{\rm{diag}}}}\left({\lambda }_{1},\,{\lambda }_{2},\,\ldots,\,{\lambda }_{n}\right)\) , and I is an identity matrix. \({\left({{{\bf{I}}}}+c{{{\mathbf{\Lambda }}}}\right)}^{-1}\) is the low-pass filter and \({\left({{{\bf{I}}}}+c{{{\mathbf{\Lambda }}}}\right)}^{-1}{\hat{{{{\bf{f}}}}}}_{{{{\bf{g}}}}}\) is the enhanced FCs. \({{{{\bf{f}}}}}_{{{{\bf{g}}}}}^{\star}={{{{\bf{U}}}}\left({{{\bf{I}}}}+c{{{\mathbf{\Lambda }}}}\right)}^{-1}{\hat{{{{\bf{f}}}}}}_{{{{\bf{g}}}}}\) represents the enhanced SVG expression using inverse graph Fourier transform. Specifically, in HE-coronal mouse brain data analysis, we selected 1300 ( \(=25\sqrt{n},{n}=2702\) ) low-frequency FCs for enhancing signal and recovering spatial pattern by using iGFT with \(c=0.0001\) .

Data preprocessing on the human prostate cancer Visium data

Cell segmentation.

The Visium image of human prostate cancer (adenocarcinoma with invasive carcinoma) from the 10X official website was cropped into patches according to spot center coordinates and diameter length. Each patch is processed by Cellpose for nuclei segmentation using the default parameter. Cell density in each patch is determined using the number of segmented cells.

Microbial alignment

Following the tutorial 69 , the corresponding bam files were processed via Kraken packages by (1) removing host sequences and retaining microbial reads, (2) assigning microbial reads to a taxonomic category (e.g., species and genus), and (3) computing the relative abundance of different species in each spot.

SVG signal enhancement benchmarking

Sixteen human brain datasets with well-annotated labels were used for enhancement benchmarking 23 , 24 . Samples 151510, 151672, and 151673 were used for grid search. Other 13 datasets were used for independent tests. SpaGFT can transform graph signals to FCs, and apply correspondence preprocessing in the frequency domain to realize signal enhancement of genes. Briefly, it is composed of three major steps. Firstly, SpaGFT is implemented to obtain FCs. Secondly, a low-pass filter is applied to weigh and recalculate FCs. Lastly, SpaGFT implements iGFT to recover the enhanced FCs to graph signals. We select c (0.003, 0.005, 0.007) and ratio_fms (13, 15, 17) in the low_pass_enhancement function, resulting in 9 parameter combinations. c  = 0.005 and ratio_fms = 15 were selected for the independent test. For the parameters used for other computational tools, the details can be found as follows.

SAVER-X (version 1.0.2) is designed to improve data quality, which extracts gene-gene relationships by adopting a deep auto-encoder and a Bayesian model simultaneously. SAVER-X is composed of three major steps roughly. Firstly, train the target data with an autoencoder without a chosen pretraining model. Secondly, filter unpredictable genes using cross-validation. Lastly, estimate the final denoised values with empirical Bayesian shrinkage. Two parameters were considered to explore the performance as well as the robustness of SAVER-X, including batch_size (32, 64, 128) in the saverx function and fold (4, 6, 8) in the autoFilterCV function, resulting in 9 parameter combinations.

Sprod (version 1.0) is a computational framework based on latent graph learning of matched location and imaging data by leveraging information from the physical locations of sequencing to impute accurate SRT gene expression. The framework of Sprod can be divided into two major steps roughly, which are building a graph and optimizing objective function for such a graph to obtain the de-noised gene expression matrix. To validate its robustness, two parameters were adjusted, including sprod_R (0.1, 0.5) and sprod_latent_dim (8, 10, 12), to generate nine parameter combinations.

DCA (version 0.3.1) is a deep count autoencoder network with specialized loss functions targeted to denoise scRNA-seq datasets. It uses the autoencoder framework to estimate three parameters \(({{{\rm{\mu }}}},{{{\rm{\theta }}}},{{{\rm{\pi }}}})\) of zero-inflated negative binomial distribution conditioned on the input data for each gene. In particular, the autoencoder gives three output layers, representing for each gene the three parameters that make up the gene-specific loss function to compare to the original input of this gene. Finally, the mean \(({{{\rm{\mu }}}})\) of the negative binomial distribution represents denoised data as the main output. We set neurons of all hidden layers except for the bottleneck to (48, 64, 80) and neurons of bottleneck to (24, 32, 40) for parameter tuning, resulting in 9 parameter combinations.

MAGIC (version 3.0.0) is a method that shares information across similar cells via data diffusion to denoise the cell count matrix and fill in missing transcripts. It is composed of two major steps. Firstly, it builds its affinity matrix in four steps which include a data preprocessing step, converting distances to affinities using an adaptive Gaussian Kernel, converting the affinity matrix A into a Markov transition matrix M , and data diffusion through exponentiation of M . Once the affinity matrix is constructed, the imputation step of MAGIC involves sharing information between cells in the resulting neighborhoods through matrix multiplication. We applied the knn settings (3, 5, 7) and the level of diffusion (2, 3, 4) in the MAGIC initialization function for parameter tuning, resulting in 9 parameter combinations.

scVI (version 0.17.3) is a hierarchical Bayesian model based on a deep neural network, which is used for probabilistic representation and analysis of single-cell gene expression. It consists of two major steps. Firstly, the gene expression is compressed into a low-dimensional hidden space by the encoder, and then the hidden space is mapped to the posterior estimation of the gene expression distribution parameters by the neural network of the decoder. It uses random optimization and deep neural networks to gather information on similar cells and genes, approximates the distribution of observed expression values, and considers the batch effect and limited sensitivity for batch correction, visualization, clustering, and differential expression. We selected n_hidden (64, 128, 256) and gene_likelihood (‘zinb’, ‘nb’, ‘poisson’) in the model.SCVI function for parameter tuning, resulting in 9 parameter combinations.

netNMF-sc (version 0.0.1) is a non-negative matrix decomposition method for network regularization, which is designed for the imputation and dimensionality reduction of scRNA-seq analysis. It uses a priori gene network to obtain a more meaningful low-dimensional representation of genes, and network regularization uses a priori knowledge of gene–gene interaction to encourage gene pairs with known interactions to approach each other in low-dimensional representation. We selected d (8, 10, 12) and alpha (80, 100, 120) in the netNMFGD function for parameter tuning, resulting in 9 parameter combinations.

SVG clustering and FTU identification

The pipeline is visualized in Supplementary Fig.  6a . As the pattern of one SVG cluster can demonstrate specific functions of one FTU, the FTU may not necessarily display a clear boundary to its neighbor FTUs. On the contrary, the existence of overlapped regions showing polyfunctional regions is allowed. Computationally, the process of FTU identification is to optimize the resolution parameter of the Louvain algorithm for obtaining a certain number of biology-informed FTUs, which minimizes the overlapped area. Denote G' as the set of SVGs identified by SpaGFT. For each resolution parameter \({{{\rm{res}}}}\, > \,0\) , G' can be partitioned to { \({G}_{1}^{{\prime} },\,{G}_{2}^{{\prime} },\,\ldots,\,{G}_{{n}_{{{\rm {res}}}}}^{{\prime} }\left.\right\}\) (i.e., \({\bigcup }_{k}{G}_{i}^{{\prime} }={G}^{{\prime} }\) and \({G}_{k}^{{\prime} }\bigcap {G}_{l}^{{\prime} }\,=\, \varnothing,\,\forall k\,\ne\,l.\) ) by applying the Louvain algorithm on FCs, and the resolution will be optimized by the loss function below. Denote \(X=({x}_{s,g})\in {{\mathbb{R}}}^{\left|S\right|\times \left|{G}^{{\prime} }\right|}\) as the gene expression matrix, where \(S\) is the set of all spots. In the following, for each SVG group \({G}_{k}^{{\prime} }\) , \({{{\rm{pseudo}}}}({s}_{s,{G}_{k}^{{\prime} }})={\sum }_{g\in {G}_{k}^{{\prime} }}\log ({x}_{s,g})\) represents the pseudo expression value 4 for spot \(i\) . Apply k -means algorithms with k  = 2 on \(\{{{{\rm{pseudo}}}}\left({s}_{1,{G}_{k}^{{\prime} }}\right),{{{\rm{pseudo}}}}\left({s}_{2,{G}_{k}^{{\prime} }}\right),\,\ldots,\,{{{\rm{pseudo}}}}\left({s}_{\left|S\right|,{G}_{k}^{{\prime} }}\right)\}\) to pick out one spot cluster whose spots highly express genes in SVG group \({G}_{k}^{{\prime} }\) and such spot cluster is identified as a FTU, denoted as \({S}_{i}\in S\) . Our objective function aims to find the best partition of \({G}^{{\prime} }\) such that the average overlap between any two \({S}_{i},\,{S}_{j}\) is minimized:

\({{{{\rm{argmin}}}}}_{{{{res}}} > 0}\frac{2\times {\sum}_{k\ne l}\left|{S}_{k}\cap {S}_{l}\right|}{{n}_{{{{res}}}}\times ({n}_{{{{res}}}}-1)}\)

SpaGFT implementation on the lymph node Visium data and interpretation

Lymph node svg cluster identification and ftu interpretation.

SVGs were identified on the human lymph node data (Visium) with the default setting of SpaGFT. To demonstrate the relations between cell composition and annotated FTUs, cell2location 35 was implemented to deconvolute spot and resolve fine-grained cell types in spatial transcriptomic data. Cell2location was first used to generate the spot-cell type proportion matrix as described above, resulting in a cell proportion of 34 cell types. Then, pseudo-expression values across all spots for one FTU were computed using the method from the FTU identification section. Then, an element of the FTU-cell type correlation matrix was calculated by computing the Pearson correlation coefficient between the proportion of a cell type and the pseudo-expression of an FTU across all the spots. Subsequently, the FTU-cell type correlation matrix was obtained by calculating all elements as described above, with rows representing FTUs and columns representing cell types. Lastly, the FTU-cell type matrix was generated and visualized on a heatmap, and three major FTUs in the lymph node were annotated, i.e., the T cell zone, GC, and B follicle.

Visualization of GC, T cell zone, and B follicles in the Barycentric coordinate system

Spot-cell proportion matrix was used to select and merge signature cell types of GC, T cell zone, and B follicles for generating a merged spot-cell type proportion matrix (an N-by-3 matrix, N is equal to the number of spots). For GC, B_Cycling, B_GC_DZ, B_GC_LZ, B_GC_prePB, FDC, and T_CD4_TfH_GC were selected as signature cell types. For T cell zone, T_CD4, T_CD4_TfH, T_TfR, T_Treg, T_CD4_naive, and T_CD8_naive were selected as signature cell types. For B follicle, B_mem, B_naive, and B_preGC were regarded as signature cell types. The merged spot-cell type proportion matrix was calculated by summing up the proportion of signature cell types for GC, T cell zone, and B follicle, respectively. Finally, annotated spots (spot assignment in Supplementary Data  11 ) were selected from the merged spot-cell type proportion matrix for visualization. The subset spots from the merged matrix were projected on an equilateral triangle via Barycentric coordinate project methods 37 . The projected spots were colored by FTU assignment results. Unique and overlapped spots across seven regions (i.e., GC, GC–B, B, B–T, T, T–GC, and T–GC–B) from three FTUs were assigned and visualized on the spatial map. Gene module scores were calculated using the AddModuleScore function from the Seurat (v4.0.5) package. Calculated gene module score and cell type proportion were then grouped by seven regions and visualized on the line plot (Fig.  3e, f ). One-way ANOVA using function aov in R environment was conducted to test the difference among the means of seven regions regarding gene module scores and cell type proportions, respectively.

CODEX tonsil tissue staining

An FFPE human tonsil tissue (provided by Dr. Scott Rodig, Brigham and Women’s Hospital Department of Pathology) was sectioned onto a No. 1 glass coverslip (22x22mm) pre-treated with Vectabond (SP-1800-7, Vector Labs). The tissue was deparaffinized by heating at 70 °C for 1 h and soaking in xylene 2× for 15 min each. The tissue was then rehydrated by incubating in the following sequence for 3 min each with gentle rocking: 100% EtOH twice, 95% EtOH twice, 80% EtOH once, 70% EtOH once, ddH 2 O thrice. To prepare for heat-induced antigen retrieval (HIER), a PT module (A80400012, Thermo Fisher) was filled with 1X PBS, with a coverslip jar containing 1X Dako pH 9 antigen retrieval buffer (S2375, Agilent) within. The PT module was then pre-warmed to 75 °C. After rehydration, the tissue was placed in the pre-warmed coverslip jar, then the PT module was heated to 97 °C for 20 min and cooled to 65 °C. The coverslip jar was then removed from the PT module and cooled for ~15–20 min at room temperature. The tissue was then washed in rehydration buffer (232105, Akoya Biosciences) twice for 2 min each then incubated in CODEX staining buffer (232106, Akoya Biosciences) for 20 min while gently rocking. A hydrophobic barrier was then drawn on the perimeter of the coverslip with an ImmEdge Hydrophobic Barrier pen (310018, Vector Labs). The tissue was then transferred to a humidity chamber. The humidity chamber was made by filling an empty pipette tip box with paper towels and ddH 2 O, stacking the tip box on a cool box (432021, Corning) containing a −20 °C ice block, then replacing the tip box lid with a six-well plate lid. The tissue was then blocked with 200 μL of blocking buffer.

The blocking buffer was made with 180 μL BBDG block, 10 μL oligo block, and 10 μL sheared salmon sperm DNA; the BBDG block was prepared with 5% donkey serum, 0.1% Triton X-100, and 0.05% sodium azide prepared with 1X TBS IHC Wash buffer with Tween 20 (935B-09, Cell Marque); the oligo block was prepared by mixing 57 different custom oligos (IDT) in ddH 2 O with a final concentration of 0.5 μM per oligo; the sheared salmon sperm DNA was added from its 10 mg/ml stock (AM9680, Thermo Fisher). The tissue was blocked while photobleaching with a custom LED array for 2 h. The LED array was set up by inclining two Happy Lights (6460231, Best Buy) against both sides of the cool box and positioning an LED Grow Light (B07C68N7PC, Amazon) above. The temperature was monitored to ensure that it remained under 35 °C. The staining antibodies were then prepared during the 2-h block.

DNA-conjugated antibodies at appropriate concentrations were added to 100 μL of CODEX staining buffer, loaded into a 50-kDa centrifugal filter (UFC505096, Millipore) pre-wetted with CODEX staining buffer, and centrifuged at 12,500× g for 8 min. Concentrated antibodies were then transferred to a 0.1 μm centrifugal filter (UFC30VV00, Millipore) pre-wetted with CODEX staining buffer, filled with extra CODEX staining buffer to a total volume of 181 μL, added with 4.75 μL of each Akoya blockers N (232108, Akoya), G (232109, Akoya), J (232110, Akoya), and S (232111, Akoya) to a total volume of 200 μL, then centrifuged for 2 min at 12,500× g to remove antibody aggregates. The antibody flow through (99 μL) was used to stain the tissue overnight at 4 °C in a humidity chamber covered with a foil-wrapped lid.

After the overnight antibody stain, the tissue was washed in CODEX staining buffer twice for 2 min each before fixing in 1.6% paraformaldehyde (PFA) for 10 min while gently rocking. The 1.6% PFA was prepared by diluting 16% PFA in CODEX storage buffer (232107, Akoya). After 1.6% PFA fixation, the tissue was rinsed in 1X PBS twice and washed in 1X PBS for 2 min while gently rocking. The tissue was then incubated in the cold (−20 °C) with 100% methanol on ice for 5 min without rocking for further fixation and then washed thrice in 1X PBS as before while gently rocking. The final fixation solution was then prepared by mixing 20 μL of CODEX final fixative (232112, Akoya) in 1000 μL of 1x PBS. The tissue was then fixed with 200 μL of the final fixative solution at room temperature for 20 min in a humidity chamber. The tissue was then rinsed in 1X PBS and stored in 1X PBS at 4 °C prior to CODEX imaging.

A black flat bottom 96-well plate (07-200-762, Corning) was used to store the reporter oligonucleotides, with each well corresponding to an imaging cycle. Each well contained two fluorescent oligonucleotides (Cy3 and Cy5, 5 μL each) added to 240 μL of plate master mix containing DAPI nuclear stain (1:600) (7000003, Akoya) and CODEX assay reagent (0.5 mg/mL) (7000002, Akoya). For the first and last blank cycles, an additional plate buffer was used to substitute for each fluorescent oligonucleotide. The 96-well plate was securely sealed with aluminum film (14-222-342, Thermo Fisher) and kept at 4 °C prior to CODEX imaging.

CODEX antibody panel

The following antibodies, clones, and suppliers were used in this study:

BCL-2 (124, Novus Biologicals, 1:50), CCR6 (polyclonal, Novus Biologicals, 1:25), CD11b (EPR1344, Abcam, 1:50), CD11c (EP1347Y, Abcam, 1:50), CD15 (MMA, BD Biosciences, 1:200), CD16 (D1N9L, Cell Signaling Technology, 1:100), CD162 (HECA-452, Novus Biologicals, 1:200), CD163 (EDHu-1, Novus Biologicals, 1:200), CD2 (RPA-2.10, Biolegend, 1:25), CD20 (rIGEL/773, Novus Biologicals, 1:200), CD206 (polyclonal, R&D Systems, 1:100), CD25 (4C9, Cell Marque, 1:100), CD30 (BerH2, Cell Marque, 1:25), CD31 (C31.3 + C31.7 + C31.10, Novus Biologicals, 1:200), CD4 (EPR6855, Abcam, 1:100), CD44 (IM-7, Biolegend, 1:100), CD45 (B11 + PD7/26, Novus Biologicals, 1:400), CD45RA (HI100, Biolegend, 1:50), CD45RO (UCH-L1, Biolegend, 1:100), CD5 (UCHT2, Biolegend, 1:50), CD56 (MRQ-42, Cell Marque, 1:50), CD57 (HCD57, Biolegend, 1:200), CD68 (KP-1, Biolegend, 1:100), CD69 (polyclonal, R&D Systems, 1:200), CD7 (MRQ-56, Cell Marque, 1:100), CD8 (C8/144B, Novus Biologicals, 1:50), collagen IV (polyclonal, Abcam, 1:200), cytokeratin (C11, Biolegend, 1:200), EGFR (D38B1, Cell Signaling Technology, 1:25), FoxP3 (236A/E7, Abcam, 1:100), granzyme B (EPR20129-217, Abcam, 1:200), HLA-DR (EPR3692, Abcam, 1:200), IDO-1 (D5J4E, Cell Signaling Technology, 1:25), LAG-3 (D2G4O, Cell Signaling Technology, 1:25), mast cell tryptase (AA1, Abcam, 1:200), MMP-9 (L51/82, Biolegend, 1:200), MUC-1 (955, Novus Biologicals, 1:100), PD-1 (D4W2J, Cell Signaling Technology, 1:50), PD-L1 (E1L3N, Cell Signaling Technology, 1:50), podoplanin (D2-40, Biolegend, 1:200), T-bet (D6N8B, Cell Signaling Technology, 1:100), TCR β (G11, Santa Cruz Biotechnology, 1:100), TCR-γ/δ (H-41, Santa Cruz Biotechnology, 1:100), Tim-3 (polyclonal, Novus Biologicals, 1:50), Vimentin (RV202, BD Biosciences, 1:200), VISTA (D1L2G, Cell Signaling Technology, 1:50), α-SMA (polyclonal, Abcam, 1:200), and β-catenin (14, BD Biosciences, 1:50). Readers of interest are referred to publication 70 for more details on the antibody clones, conjugated fluorophores, exposure, and titers.

CODEX tonsil tissue imaging

The tonsil tissue coverslip and reporter plate were equilibrated to room temperature and placed on the CODEX microfluidics instrument. All buffer bottles were refilled (ddH 2 O, DMSO, 1X CODEX buffer (7000001, Akoya)), and the waste bottle was emptied before the run. To facilitate the setting up of imaging areas and z planes, the tissue was stained with 750 μL of nuclear stain solution (1 μL of DAPI nuclear stain in 1500 μL of 1X CODEX buffer) for 3 min, then washed with the CODEX fluidics device. For each imaging cycle, three images that corresponded to the DAPI, Cy3, and Cy5 channels were captured. The first and last blank imaging cycles did not contain any Cy3 or Cy5 oligos, and thus are used for background correction.

The CODEX imaging was operated using a ×20/0.75 objective (CFI Plan Apo λ, Nikon) mounted to an inverted fluorescence microscope (BZ-X810, Keyence) which was connected to a CODEX microfluidics instrument and CODEX driver software (Akoya Biosciences). The acquired multiplexed images were stitched, and background corrected using the SINGER CODEX Processing Software (Akoya Biosciences). For this study, six independent 2048 × 2048 field-of-views (FOV) were cropped from the original 20,744 × 20,592 image. The FOVs were selected to include key cell types and tissue structures in tonsils, such as tonsillar crypts or lymphoid nodules.

Custom ImageJ macros were used to normalize and cap nuclear and surface image signals at the 99.7th percentile to facilitate cell segmentation. Cell segmentation was performed using a local implementation of Mesmer from the DeepCell library (deepcell-tf 0.11.0) 40 , where the multiplex_segmentation.py script was modified to adjust the segmentation resolution (microns per pixel, mpp). model_mpp = 0.5 generated satisfactory segmentation results for this study. Single-cell features based on the cell segmentation mask were then scaled to cell size and extracted as FCS files.

Cell clustering and annotation

Single-cell features were normalized to each FOV’s median DAPI signal to account for FOV signal variation, arcsinh transformed with cofactor = 150, capped between 1st–99th percentile, and rescaled to 0–1. Sixteen markers (cytokeratin, podoplanin, CD31, αSMA, collagen IV, CD11b, CD11c, CD68, CD163, CD206, CD7, CD4, CD8, FoxP3, CD20, CD15) were used for unsupervised clustering using FlowSOM 41 (66 output clusters). The cell type for each cluster was annotated based on its relative feature expression, as determined via Marker Enrichment Modeling 42 , and annotated clusters were visually compared to the original images to ensure accuracy and specificity. Cells belonging to indeterminable clusters were further clustered (20 output clusters) and annotated as above.

SpaGFT implementation on tonsil CODEX data and interpretation

Resize codex images and spagft implementation.

As each FOV consisted of 2048 by 2048 pixels (~0.4 μm per pixel size), the CODEX image needed to be scaled down to 200 by 200 pixels (~3.2 μm per pixel size) to reduce the high computational burden (Supplementary Fig.  8a ). Therefore, original CODEX images (2048 by 2048 pixels) were resized to 200 by 200 images by implementing function “resize” and selecting cubic interpolation from the imager package (v.42) in R environments. SpaGFT was then applied to the resized images by following default parameters.

Structural similarity (SSIM) calculation

The Structural Similarity (SSIM) score was a measurement for locally evaluating the similarity between two images regardless of image size 71 . The SSIM score ranged from 0 to 1; a higher score means more similarity between two images. It was defined as follows:

x and y were windows with 8 by 8 pixels; \(l\left(x,\,y\right)=\frac{2{\mu }_{x}{\mu }_{y}+{C}_{1}}{{\mu }_{x}^{2}+{\mu }_{x}^{2}+{C}_{1}}\) was the luminance comparison function for comparing the average brightness of the two images regarding pixels x and \(y\) . \({C}_{1}\) is constant, and \(\alpha\) is the weight factor of luminance comparison. \(c\left(x,\,y\right)=\frac{2{\sigma }_{x}{\sigma }_{y}+{C}_{1}}{{\sigma }_{x}^{2}+{\sigma }_{x}^{2}+{C}_{2}}\) was the contrast comparison function for measuring the standard deviation of two images. \({C}_{2}\) is constant, and \(\beta\) is the weight factor of contrast comparison. \(s\left(x,\,y\right)=\frac{{\sigma }_{{xy}}+{C}_{3}}{{\sigma }_{x}{\sigma }_{y}+{C}_{3}}\) was the structure comparison by calculating the covariance between the two images. \({C}_{3}\) is constant, and \(\gamma\) is the weight factor of structure comparison.

Cell–cell distance and interaction analysis

To compute cell–cell distance within one FTU, we first select cells assigned to each FTU. An undirected cell graph was then constructed, where the cell was a node and edge connected by every two cells defined by the Delaunay triangulation using the deldir function from the deldir package (v.1.0-6). Subsequently, the edge represented the observed distance between the connected two cells, and Euclidean distance was used for calculating the distance 72 . Lastly, the average distance among different cell types was computed by taking the average of the observed cell–cell distance to generate the network plot. Regarding the determination of the cell–cell interaction, the spatial location of cells assigned in each FTU was permutated and re-calculated cell–cell distance as expected distance. If the cell–cell distance is lower than 15 μm 73 (~5 pixels in the 200 by 200-pixel image), the cells will contact and interact with each other. Wilcoxon rank-sum test was used for the computed p -value for expected distance and observed distance. If the expected distance was significantly smaller than the observed distance, it suggested that cells would interact with each other.

SpaGFT implementation in SpaGCN

Let X spa be the SRT gene expression matrix with the dimension \({n}_{{{\rm {spot}}}}\times {n}_{{{\rm {gene}}}}\) , in which \({n}_{{{\rm {spot}}}}\) and \({n}_{{{\rm {gene}}}}\) represent the numbers of spots and genes, respectively. Upon normalization, the spot cosine similarity matrix \({{{{\boldsymbol{X}}}}}_{{{{\boldsymbol{s}}}}}\) is computed by the formula \({{{{\bf{X}}}}}_{{{{\bf{s}}}}}{{{\boldsymbol{=}}}}{{{{\bf{X}}}}}_{{{{\bf{spa}}}}}{{{{\bf{X}}}}}_{{{{\bf{spa}}}}}^{{{{\rm{T}}}}}\) , yielding a matrix with dimension \({n}_{{{\rm {spot}}}}\times {n}_{{{\rm {spot}}}}\) . Denote \({{{\bf{U}}}}=({{{{\bf{\mu }}}}}_{{{{\bf{1}}}}}{{,}}\,{{{{\bf{\mu }}}}}_{{{{\bf{2}}}}}{{{\boldsymbol{,}}}}\,{{\ldots }}{{,}}\,{{{{\bf{\mu }}}}}_{{{{{\bf{n}}}}}_{{{{\bf{FC}}}}}})\) , where each \({{{{\bf{\mu }}}}}_{{{{\bf{l}}}}}\) is the l th eigenvector of the Laplacian matrix of the spatial graph and \({n}_{{{\rm {FC}}}}\) is the number of Fourier coefficients. Hence, graph Fourier transform is implemented to transform \({{{{\bf{X}}}}}_{{{{\bf{s}}}}}\) into the frequency domain by:

Subsequently, the newly augmented spot-by-feature matrix is obtained by concatenating SRT gene expression matrix \({{{{\bf{X}}}}}_{{{{\bf{spa}}}}}\) and transformed signal matrix \({\hat{{{{\bf{X}}}}}}_{{{{\bf{s}}}}}\) :

Finally, the matrix X new is inputted into SpaGCN as a replacement for the original gene expression matrix to predict the spatial domain cluster labels across all spots.

To evaluate the performance of such modification, 12 human dorsolateral prefrontal cortex of 10x Visium datasets were applied in benchmarking based on annotations from the initial study of SpaGCN 4 . The adjusted Rand index (ARI) was selected as the evaluation metric to measure the consistency between the predicted spot clusters and manually annotated spatial domain. The parameter num_fcs, which controlled the count of FCs, was determined by utilizing a grid search methodology executed on datasets 151508 and 151670. The search spanned a range of values from 600 to 1400, sampled per 100 steps. Upon analysis, the optimal parameter value was established at 1000 (Supplementary Data  15 ), while the other parameters were set to the default in SpaGCN. Next, the performance was compared on the 10 remaining datasets for the independent test.

SpaGFT implementation in TACCO

SpaGFT was implemented to improve the performance of TACCO, which leveraged optimal transport (OT) to transfer annotation labels from scRNA-seq to spatial transcriptomics data. The core objective function of TACCO is denoted by a cost matrix \({{{{\bf{C}}}}}=(c_{{tb}})\) and a proportion matrix \({{{\bf{\Gamma }}}}=({\gamma }_{{tb}})\) :

Specifically, \({c}_{{tb}}\) quantifies the cost that transports an object \(b\) to an annotation \(t\) . In TACCO, principal component analysis (PCA) was used to reduce the dimension of scRNA-seq and spatial transcriptomics gene expression matrices to the PC matrices by keeping the first 100 PCs, respectively. Subsequently, \({{{\bf{C}}}}\) is computed by calculating the Bhattacharyya coefficients between cell type-averaged scRNA-seq and spatial transcriptomics PC matrices. Finally, the OT’s optimization is solved by using the Sinkhorn–Knopp matrix scaling algorithm to yield a ‘good’ proportion matrix \({{{\bf{\Gamma }}}}\) .

For finding \({{{\bf{\Gamma }}}}\) , the cost matrix \({{{\bf{C}}}}\) plays the most important role in the OT’s optimization process. Based on the originally calculated \({{{\bf{C}}}}\) , an updated cost matrix \({{{{\bf{C}}}}}^{{{{\bf{update}}}}}\) considering spatial topology information is fused. To incorporate this topology information from the spatial data, the coordinates of spatial spots are used to construct a spatial graph, which is as the input with gene expression and initial TACCO-calculated mapping \({{{\bf{\Gamma }}}}\) , which represent cell-type proportions into SpaGFT for calculating FCs of genes and cell types (CT). Subsequently, these gene FCs matrices were weighted and averaged by spot expression value to obtain the spots’ FCs for obtaining spot level constraints. The cosine distance is calculated between the FCs of spatial spots and the FCs of cell types to create the updated CT-spot cost matrix \({{{{\bf{C}}}}}^{{{{\bf{update}}}}}\) . The \({{{{\bf{C}}}}}^{{{{\prime} }}}\) is a united cost matrix fused by \({{{\bf{C}}}}\) and \({{{{\bf{C}}}}}^{{{{\bf{update}}}}}\) with a balancing parameter \(\beta\) as

This updated \({{{{\mathbf{C}}}}}^{{\prime} }\) is then fed back into TACCO’s OT algorithm to predict revised cell type proportions for the spatial data. In addition, we used a simulated validation dataset with the setting of \({{{\rm{bead\; size}}}}=5\) to conduct a grid search on the input parameters \(S\) , the sensitivity in the Kneedle algorithm from SpaGFT, and \(\beta\) for determining these hyperparameters. While maintaining computational efficiency, we ascertained that the updated TACCO with \(\beta=0.8\) and \(S=24\) can achieve the best performance. Our experiments reveal that the updated TACCO, enriched with SpaGFT features, outperforms the baseline TACCO model in the simulated independent test dataset with the setting of \({{{\rm{bead\; size}}}}\in [10,\,20,\,30,\,40,\,50]\) .

SpaGFT implementation in Tangram

Denote \({{{{\bf{X}}}}}_{{{{\bf{sc}}}}}\) as the gene expression matrix of scRNA-seq with the dimension \({n}_{{{\rm {cell}}}}\times {n}_{{{\rm {gene}}}},\) in which \({n}_{{{\rm {cell}}}}\) and \({n}_{{{\rm {gene}}}}\) represent the numbers of cells and genes, respectively. \({{{{\bf{X}}}}}_{{{{\bf{spa}}}}}\) is the SRT gene expression matrix with dimension \({n}_{{{\rm {spot}}}}\times \,{n}_{{{\rm {gene}}}}\) , and \({n}_{{{\rm {spot}}}}\) represents the number of spots. Tangram aims to find a mapping matrix \({{{\bf{M}}}}={\left({m}_{{ij}}\right)}_{{n}_{{{\rm {cell}}}}\times {n}_{{{\rm {spot}}}}}\) , where \(0\le {m}_{{ij}}\le 1\) , \({\sum }_{i}^{{n}_{{{\rm {spot}}}}}{m}_{{ij}}=1\) and \({m}_{{ij}}\) reflects the probability of cell \(i\) mapping to spot \(j\) . Hence, \({{{{\bf{M}}}}}^{{{{\rm{T}}}}}{{{{\bf{X}}}}}_{{{{\bf{sc}}}}}\) can be treated as the reconstructed SRT gene expression matrix using scRNA-seq. Let \({{{{{\bf{X}}}}}_{{{{\bf{re}}}}}{{{\boldsymbol{=}}}}{{{\bf{M}}}}}^{{{{\rm{T}}}}}{{{{\bf{X}}}}}_{{{{\bf{sc}}}}}\) . The regularization part of the original objective function of Tangram is as follows:

where the first term describes the cosine similarity of gene \(k\) across all spots in reconstructed SRT gene expression matrix and real SRT gene expression matrix, weighted by \({w}_{1}\) ; and the second term describes the cosine similarity of spot \(j\) across all genes in reconstructed SRT gene expression matrix and real SRT gene expression matrix, weighted by \({w}_{2}\) . By maximizing the objective function, the optimal mapping matrix \({{{{\bf{M}}}}}^{{{{\boldsymbol{*}}}}}\) can be obtained.

Denote \({{{\bf{U}}}}=\left({{{{\bf{\mu }}}}}_{{{{\bf{1}}}}}{{{\boldsymbol{,}}}}\,{{{{\bf{\mu }}}}}_{{{{\bf{2}}}}}{{,}}\,{{\ldots }}{{,}}\,{{{{\bf{\mu }}}}}_{{{{{\bf{n}}}}}_{{{{\bf{FC}}}}}}\right)\) , where each \({{{{\boldsymbol{\mu }}}}}_{{{{\bf{l}}}}}\) is the l th eigenvector of the Laplacian matrix of the spatial graph and \({n}_{{{\rm {FC}}}}\) is the number of Fourier coefficients. Hence, we can implement graph Fourier transform for genes by

Therefore, both \({\hat{{{{\bf{X}}}}}}_{{{{\bf{spa}}}}}\) and \({\hat{{{{\bf{X}}}}}}_{{{{\bf{re}}}}}\) are the representations of genes in the frequency domain with the dimension \({n}_{{{\rm {FC}}}}\times {n}_{{{\rm {gene}}}}\) . In addition, \({{{{\bf{X}}}}}_{{{{\bf{spa}}}}}^{{{{\prime} }}}{{=}}{{{{\bf{X}}}}}_{{{{\bf{spa}}}}}{{{{\bf{X}}}}}_{{{{\bf{spa}}}}}^{{{{\rm{T}}}}}\) can be considered as the spot similarity matrix calculated by gene expression from real SRT data with dimension is \({n}_{{{\rm {spot}}}}\times {n}_{{{\rm {spot}}}}\) . Similarly, \({{{{\bf{X}}}}}_{{{{\bf{re}}}}}^{{{{\prime} }}}=({{{{\bf{M}}}}}^{{{{\rm{T}}}}}{{{{\bf{X}}}}}_{{{{\rm{sc}}}}}){({{{{\bf{M}}}}}^{{{{\rm{T}}}}}{{{{\bf{X}}}}}_{{{{\bf{sc}}}}})}^{{{{\rm{T}}}}}\) represents the spot similarity matrix calculated by gene expression in reconstructed SRT data. In this way, we can implement graph Fourier transform for spots by:

Therefore, both \({\widetilde{{{{\boldsymbol{X}}}}}}_{{{{\boldsymbol{spa}}}}}\) and \({\widetilde{{{{\boldsymbol{X}}}}}}_{{{{\boldsymbol{re}}}}}\) are the new representations of spots in the frequency domain with the dimension \({n}_{{{\rm {FC}}}}\times {n}_{{{\rm {spot}}}}\) . Therefore, we improved the objective function of Tangram by adding the similarity measurements of genes and spots in the frequency domain. The new objective function is

where w 1 weights similarities of genes in the vertex domain; \({w}_{2}\) weights similarities of spots in the vertex domain; \({w}_{3}\) weights the similarities of genes in the frequency domain and \({w}_{4}\) weights the similarities of spots in the frequency domain.

To evaluate the performance of such modification. We adopted the evaluation scheme from Bin Li et al. study. In addition, we simulated this SRT dataset by ‘gridding’ a dataset (STARmap) using various window sizes (400, 450, …, 1200). In addition, simulated datasets of window sizes 400 and 1200 were used for grid search to determine the hyperparameters. In this way, \({w}_{3}\) and \({w}_{4}\) were set to 11 and 1, respectively, and other parameters (including \({w}_{1}\) and \({w}_{2}\) ) were the default parameters of Tangram. Our experiments reveal that the updated Tangram, enriched with SpaGFT features, outperforms the baseline Tangram model.

SpaGFT implementation in CAMPA

Overall, the CAMPA framework, a conditional variational autoencoder for identifying conserved subcellular organelle on pixel-level iterative indirect immunofluorescence, was modified by adding an entropy term on its loss function to regularize graph signal (e.g., protein intensity) spreading or concentration. Specifically, compared with the baseline CAMPA loss function, which computed the mean squared error (MSE) loss for each pixel, the modified loss function additionally considered protein global spreading at the cell level.

Data preparation for model training, testing, and validation

Following the baseline CAMPA paper and guidelines 16 , 292,548 (0.05% of full data) pixels datasets were down-sampled from processed cell nuclei of I09 (normal), I10 (Triptolide treatment), I11 (normal), and I16 (TSA treatment) wells based on 184A1 cell line. The training, testing, and validation data were set to 70%, 10%, and 20%, respectively.

Entropy regularization

For cell \(i\in I\) , where \(I\) was the complete set of all cells in the down-sampled data, the corresponding original protein signatures in each cell were denoted as \({{{{\bf{X}}}}}^{{{{\rm{i}}}}}\) with the dimension \({n}_{{{{pixel}}}}\times {n}_{{{{channel}}}}\) , where \({n}_{{{{pixel}}}}\) and \({n}_{{{{channel}}}}\) represented the number of pixels in one cell and the number of proteins, respectively. Similarly, \({\hat{{{{\bf{X}}}}}}^{{{{\rm{i}}}}}\) was denoted as reconstructed protein signatures for cell \(i\) . To measure the spreading of reconstructed protein signatures in the frequency domain, \({\hat{{{{\bf{X}}}}}}^{{{{\rm{i}}}}}\) and the coordinates of pixels were input into SpaGFT for computing the FC \({\hat{{{{\bf{F}}}}}}^{{{{\rm{i}}}}}\) with the dimension, in which \({n}_{{{{FC}}}}\) was the number of FC. Denote \({{{\bf{U}}}}=({{{{\bf{\mu }}}}}_{{{{\bf{1}}}}}{{,}}\,{{{{\bf{\mu }}}}}_{{{{\bf{2}}}}}{{,}}\,{{\ldots }}{{,}}\,{{{{\bf{\mu }}}}}_{{{{{\bf{n}}}}}_{{{{\bf{FC}}}}}})\) , where each \({{{{\bf{\mu }}}}}_{{{{\bf{k}}}}}\) was the k th eigenvector of the Laplacian matrix of the spatial neighboring graph for cell \(i\) . Hence, FCs of reconstructed protein signatures for cell \(i\) was calculated by

Subsequently, \({\hat{{{{\bf{F}}}}}}^{{{{\rm{i}}}}}=({\hat{{{{\bf{f}}}}}}_{{{{\bf{1}}}}}^{{{{\bf{i}}}}}{{{\boldsymbol{,}}}}\,{\hat{{{{\bf{f}}}}}}_{{{{\bf{2}}}}}^{{{{\bf{i}}}}}{{{\boldsymbol{,}}}}\,{{{\boldsymbol{\ldots }}}}{{{\boldsymbol{,}}}}\,{\hat{{{{\bf{f}}}}}}_{{{{{\bf{n}}}}}_{{{{\bf{FC}}}}}}^{{{{\bf{i}}}}})\) was used to calculate entropy by the entropy function, which regularized a concentrated graph signal 19 , 74

where \({\parallel \cdot \parallel }_{2}\) presents \({L}^{2}\) -norm.

In addition, the \(\eta\) parameter was used as a weighting term to balance the initial loss function and the entropy-decreasing loss function, assigned with 0.3 as default. The formula of the modified loss function \({{{{\rm{L}}}}}_{{{{\rm{modified}}}}}\) was as follows:

where D is a constant, which was used the same as the baseline mode ( D  = 0.5). The initial decoder loss function was a part of the objective function in CAMPA, which used an analytical solution from \(\sigma\) -VAE 75 to learn the variance of the decoder. The MSE and the logarithm of the variance were minimized through \(\sigma\) , which was a weighting parameter between the MSE reconstruction term and the KL-divergence term in the CAMPA objective function. There was an analytic solution to compute the value of \(\sigma\) :

\({\sigma }^{*2}\) was estimated value for \({\sigma }^{2}\) and \({{{{\boldsymbol{\nu }}}}}^{{{{\bf{i}}}}}\) presented the estimated latent mean for \({{{{\bf{X}}}}}^{{{{\bf{i}}}}}\) .

Regarding the implementation, the training and testing datasets were selected to build the modified and baseline models, respectively. Subsequently, to fairly compare the two models’ training efficiency, the same validation dataset and initial loss were implemented to evaluate the convergence of validation loss.

To interpret the modified CAMPA training efficiency improvement regarding biological perspective, batch effect removal and prediction accuracy were evaluated. Regarding batch effect removal, a proportion of 1% of pixels were subsampled from prepared data. First, UMAP embeddings calculated from the CAMPA latent representations were generated to visualize the mixture of three perturbation conditions. To quantitatively compare the batch effect removal between the baseline and modified model, the kBET 57 score was computed using the CAMPA latent representations across perturbation conditions. Following the kBET suggestion, 0.5% pixels (~1500 pixels) were iteratively selected for calculating the kBET score (a higher rejection rate suggested a better batch effect removal result) 10000 times using 1–100 neighbors.

Subsequently, the CAMPA latent representations were clustered utilizing the Leiden algorithm 16 at resolutions of 0.2, 0.4, 0.6, 0.8, 1.2, 1.6, and 2.0. To understand the identity of each cluster predicted by the modified CAMPA under the resolution of 0.2, the protein intensities in each pixel cluster were visualized in the heatmap. Each pixel’s channel values were averaged at the cluster level and scaled by channel (column-level) z -score. Clusters were annotated based on the highest expressed markers and human protein atlas.

To evaluate the conserveness and homogeneity of the predicted cluster across different resolutions, we implemented high-label entropy to quantify the trend of diverging from one cluster into two clusters 76 . For example, at the resolution of 0.2, all pixels of cluster 6 predicted by the modified model were used to calculate entropy via a probability vector with two lengths. The first element of this vector was a percentage of pixels at the current resolution (i.e., 0.2), which tended to be the largest cluster at the next resolution (e.g., 0.4). The second element of this vector was the percentage of the rest of the pixels at the current resolution, which tended to be other clusters at the next resolution. The high-label entropy was repeatedly calculated on the same pixels of one cluster within/across baseline and modified model across gradient resolutions (i.e., 0.2, 0.4, 0.6, 0.8, 1.2, 1.6, and 2.0). To visualize intact cells and summarize the relation between pixel and cell in Supplementary Data  19 , seven clusters predicted by the modified model based on resolution 0.2 were transferred to all pixels from full-size data via function project_cluster_data in the CAMPA package. The illustrated examples (id: 367420 and 224081) were extracted to calculate the FC of COIL and SETD1A and visualize.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

All datasets from 10x Visium can be accessed from https://www.10xgenomics.com/products/spatial-gene-expression . Slide-DNA-seq data is available as accession code SCP1278 in the Single Cell Portal. Slide-TCR-seq data is available as accession code SCP1348 in the Single Cell Portal. The GSM5519054_Visium_MouseBrain data can be accessed via the GEO database with an accession code GSM5519054 . Regarding the human brain dataset, twelve samples can be accessed via endpoint “jhpce#HumanPilot10x” on Globus data transfer platform at http://research.libd.org/globus/ . The other six human brain datasets (2-3-AD_Visium_HumanBrain, 2-8-AD_Visium_HumanBrain, T4857-AD_Visium_HumanBrain, 2-5_Visium_HumanBrain, 18-64_Visium_HumanBrain, and 1-1_Visium_HumanBrain) can be accessed via the GEO database with an accession code GSE220442 and https://bmbls.bmi.osumc.edu/scread/stofad-2 . The two Slide-seqV2 datasets are available as accession code SCP815 in the Single Cell Portal. MERFISH data (Slice1_Replicate1-Vizgen_MouseBrainReceptor) can be accessed from https://console.cloud.google.com/marketplace/product/gcp-public-data-vizgen/vizgen-mouse-brain-map?pli=1&project=vizgen-gcp-share . Xenium data (Rep1-Cancer_Xenium_HumanBreast) is downloaded from https://www.10xgenomics.com/products/xenium-in-situ/human-breast-dataset-explorer . Spatial-CITE-seq data can be accessed via the GEO database with an accession number of GSE213264 . Spatial epigenome–transcriptome co-profiling data (spatial_ATAC_RNA_MouseE13) can be accessed via the GEO database with an accession code GSE205055 . The 184A1 datasets used to train modified CAMPA reported in this manuscript can be found at https://doi.org/10.5281/zenodo.7299516 . SPOT data can be accessed via the GEO database with an accession number of GSE198353 . The CODEX tonsil data generated in this study have been deposited in the Zenodo database under accession code 10433896 . Source data are provided in this paper.  Source data are provided with this paper.

Code availability

SpaGFT is a Python package for modeling and analyzing spatial transcriptomics data. The SpaGFT source code and the analysis scripts for generating results and figures in this paper are available at https://github.com/OSU-BMBL/SpaGFT . The source code is also available on Zenodo 77 with link https://doi.org/10.5281/zenodo.12595086 .

Liu, S. et al. Spatial maps of T cell receptors and transcriptomes reveal distinct immune niches and interactions in the adaptive immune response. Immunity 55 , 1940–1952.e1945 (2022).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Vandereyken, K., Sifrim, A., Thienpont, B. & Voet, T. Methods and applications for single-cell and spatial multi-omics. Nat. Rev. Genet. 24 , 494–515 (2023).

Article   CAS   PubMed   Google Scholar  

Jain, S. et al. Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP). Nat. Cell Biol. 25 , 1089–1100 (2023).

Hu, J. et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods 18 , 1342–1351 (2021).

Schürch, C. M. et al. Coordinated cellular neighborhoods orchestrate antitumoral immunity at the colorectal cancer invasive front. Cell 182 , 1341–1359.e1319 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Mages, S. et al. TACCO unifies annotation transfer and decomposition of cell identities for single-cell and spatial omics. Nat. Biotechnol. 41 , 1465–1473 (2023).

Biancalani, T. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat. Methods 18 , 1352–1362 (2021).

Wang, Y. et al. Sprod for de-noising spatially resolved transcriptomics data based on position and image information. Nat. Methods 19 , 950–958 (2022).

Sun, S., Zhu, J. & Zhou, X. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat. Methods 17 , 193–200 (2020).

Liu, Y. et al. High-spatial-resolution multi-omics sequencing via deterministic barcoding in tissue. Cell 183 , 1665–1681.e1618 (2020).

Svensson, V., Teichmann, S. A. & Stegle, O. SpatialDE: identification of spatially variable genes. Nat. Methods 15 , 343–346 (2018).

Zhu, J., Sun, S. & Zhou, X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biol. 22 , 184 (2021).

Velten, B. & Stegle, O. Principles and challenges of modeling temporal and spatial omics data. Nat. Methods 20 , 1462–1474 (2023).

Lake, B. B. et al. An atlas of healthy and injured cell states and niches in the human kidney. Nature 619 , 585–594 (2023).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Chen, F., Wang, Y.-C., Wang, B. & Kuo, C.-C. J. Graph representation learning: a survey. APSIPA Trans. Signal Inf. Process. 9 , e15 (2020).

Article   Google Scholar  

Spitzer, H., Berry, S., Donoghoe, M., Pelkmans, L. & Theis, F. J. Learning consistent subcellular landmarks to quantify changes in multiplexed protein maps. Nat. Methods 20 , 1058–1069 (2023).

Gut, G., Herrmann, M. D. & Pelkmans, L. Multiplexed protein maps link subcellular organization to cellular states. Science 361 , eaar7042 (2018).

Article   PubMed   Google Scholar  

Ricaud, B., Borgnat, P., Tremblay, N., Gonçalves, P. & Vandergheynst, P. Fourier could be a data scientist: from graph Fourier transform to signal processing on graphs. C. R. Phys. 20 , 474–488 (2019).

Article   ADS   CAS   Google Scholar  

Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A. & Vandergheynst, P. The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30 , 83–98 (2013).

Article   ADS   Google Scholar  

Palla, G., Fischer, D. S., Regev, A. & Theis, F. J. Spatial components of molecular tissue biology. Nat. Biotechnol. 40 , 308–318 (2022).

Buzzi, R. M. et al. Spatial transcriptome analysis defines heme as a hemopexin-targetable inflammatoxin in the brain. Free Radic. Biol. Med. 179 , 277–287 (2022).

Stickels, R. R. et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat. Biotechnol. 39 , 313–319 (2021).

Chen, S. et al. Spatially resolved transcriptomics reveals genes associated with the vulnerability of middle temporal gyrus in Alzheimer’s disease. Acta Neuropathol. Commun. 10 , 188 (2022).

Maynard, K. R. et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat. Neurosci. 24 , 425–436 (2021).

Ortiz, C. et al. Molecular atlas of the adult mouse brain. Sci. Adv. 6 , eabb3446 (2020).

Hodge, R. D. et al. Conserved cell types with divergent features in human versus mouse cortex. Nature 573 , 61–68 (2019).

Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563 , 72–78 (2018).

Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19 , 335–346 (2016).

Science, A.I.f.B. Allen Institute for Brain Science (2004). Allen Mouse Brain Atlas [dataset] (Allen Institute for Brain Science, 2011).

Miller, B. F., Bambah-Mukku, D., Dulac, C., Zhuang, X. & Fan, J. Characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomics data with nonuniform cellular densities. Genome Res. gr. 271288.271120 (2021).

Zhang, K., Feng, W. & Wang, P. Identification of spatially variable genes with graph cuts. Nat. Commun. 13 , 5488 (2022).

Ortega, A., Frossard, P., Kovačević, J., Moura, J. M. F. & Vandergheynst, P. Graph signal processing: overview, challenges, and applications. Proc. IEEE 106 , 808–828 (2018).

Hou, W., Ji, Z., Ji, H. & Hicks, S. C. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 21 , 218 (2020).

Elyanow, R., Dumitrascu, B., Engelhardt, B. E. & Raphael, B. J. netNMF-sc: leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Res. 30 , 195–204 (2020).

Kleshchevnikov, V. et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nat. Biotechnol. 40 , 661–671 (2022).

Li, B. et al. Benchmarking spatial and single-cell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution. Nat. Methods 19 , 662–670 (2022).

Bhate, S. S., Barlow, G. L., Schürch, C. M. & Nolan, G. P. Tissue schematics map the specialization of immune tissue motifs and their appropriation by tumors. Cell Syst. 13 , 109–130.e106 (2022).

Kerfoot, S. M. et al. Germinal center B cell and T follicular helper cell development initiates in the interfollicular zone. Immunity 34 , 947–960 (2011).

Natkunam, Y. The biology of the germinal center. Hematology 2007 , 210–215 (2007).

Greenwald, N. F. et al. Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning. Nat. Biotechnol. 40 , 555–565 (2022).

Van Gassen, S. et al. FlowSOM: using self-organizing maps for visualization and interpretation of cytometry data. Cytometry Part A 87 , 636–645 (2015).

Diggins, K. E., Greenplate, A. R., Leelatian, N., Wogsland, C. E. & Irish, J. M. Characterizing cell subsets using marker enrichment modeling. Nat. Methods 14 , 275–278 (2017).

Liu, C.C. et al. Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering. Nat Commun. 14 , 4618 (2023).

Pavlasova, G. & Mraz, M. The regulation and function of CD20: an “enigma” of B-cell biology and targeted therapy. Haematologica 105 , 1494–1506 (2020).

Meda, B. A. et al. BCL-2 Is consistently expressed in hyperplastic marginal zones of the spleen, abdominal lymph nodes, and ileal lymphoid tissue. Am. J. Surg. Pathol. 27 , 888–894 (2003).

Hockenbery, D. M., Zutter, M., Hickey, W., Nahm, M. & Korsmeyer, S. J. BCL2 protein is topographically restricted in tissues characterized by apoptotic cell death. Proc. Natl Acad. Sci. USA 88 , 6961–6965 (1991).

Heit, A. et al. Vaccination establishes clonal relatives of germinal center T cells in the blood of humans. J. Exp. Med. 214 , 2139–2152 (2017).

Chtanova, T. et al. T follicular helper cells express a distinctive transcriptional profile, reflecting their role as non-Th1/Th2 effector cells that provide help for B cells1. J. Immunol. 173 , 68–78 (2004).

Dorfman, D. M., Brown, J. A., Shahsafaei, A. & Freeman, G. J. Programmed death-1 (PD-1) is a marker of germinal center-associated T cells and angioimmunoblastic T-cell lymphoma. Am. J. Surg. Pathol. 30 , 802–810 (2006).

Marsee, D. K., Pinkus, G. S. & Hornick, J. L. Podoplanin (D2-40) is a highly effective marker of follicular dendritic cells. Appl. Immunohistochem. Mol. Morphol. 17 , 102–107 (2009).

Gray, E. E. & Cyster, J. G. Lymph node macrophages. J. Innate Immun. 4 , 424–436 (2012).

Johansson-Lindbom, B., Ingvarsson, S. & Borrebaeck, C. A. Germinal centers regulate human Th2 development. J. Immunol. 171 , 1657–1666 (2003).

Nakagawa, R. & Calado, D. P. Positive selection in the light zone of germinal centers. Front. Immunol. 12 , 661678 (2021).

Allen, C. D. C. et al. Germinal center dark and light zone organization is mediated by CXCR4 and CXCR5. Nat. Immunol. 5 , 943–952 (2004).

Allen, C. D., Okada, T. & Cyster, J. G. Germinal-center organization and cellular dynamics. Immunity 27 , 190–202 (2007).

Wang, X. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361 , eaat5691 (2018).

Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. methods 16 , 43–49 (2019).

Morris, G. E. The Cajal body. Biochim. Biophys. Acta (BBA) - Mol. Cell Res. 1783 , 2108–2115 (2008).

Article   CAS   Google Scholar  

Tajima, K. et al. SETD1A protects from senescence through regulation of the mitotic gene expression program. Nat. Commun. 10 , 2854 (2019).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Wang, J. et al. Dimension-agnostic and granularity-based spatially variable gene identification using BSP. Nat. Commun. 14 , 7367 (2023).

Li, Z. et al. Benchmarking computational methods to identify spatially variable genes and peaks. Preprint at bioRxiv https://doi.org/10.1101/2023.12.02.569717 (2023).

Tang, X. et al. Explainable multi-task learning for multi-modality biological data analysis. Nat. Commun. 14 , 2546 (2023).

Bao, F. et al. Integrative spatial analysis of cell morphologies and transcriptional states with MUSE. Nat. Biotechnol. 40 , 1200–1209 (2022).

Chang, Y. et al. Define and visualize pathological architectures of human tissues from spatially resolved transcriptomics using deep learning. Comput Struct Biotechnol J. 20 , 4600–4617 (2022).

Huang, W. et al. Graph frequency analysis of brain signals. IEEE J. Sel. Top. Signal Process. 10 , 1189–1203 (2016).

Lu, K.-S. & Ortega, A. Fast graph Fourier transforms based on graph symmetry and bipartition. IEEE Trans. Signal Process. 67 , 4855–4869 (2019).

Article   ADS   MathSciNet   Google Scholar  

Magoarou, L. L., Gribonval, R. & Tremblay, N. Approximate fast graph Fourier transforms via multilayer sparse approximations. IEEE Trans. Signal Inf. Process. Netw. 4 , 407–420 (2018).

MathSciNet   Google Scholar  

Satopaa, V., Albrecht, J., Irwin, D. & Raghavan, B. Finding a ‘kneedle’ in a haystack: detecting knee points in system behavior. In Proc. 2011 31st International Conference on Distributed Computing Systems Workshops (ed. Du) 166–171 (IEEE, 2011).

Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17 , 2815–2839 (2022).

Phillips, D. et al. Highly multiplexed phenotyping of immunoregulatory proteins in the tumor microenvironment by CODEX tissue imaging. Front. Immunol. 12 , 687673 (2021).

Zhou, W., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13 , 600–612 (2004).

Jiang, S. et al. Combined protein and nucleic acid imaging reveals virus-dependent B cell and macrophage immunosuppression of tissue microenvironments. Immunity 55 , 1118–1134.e1118 (2022).

Fang, R. et al. Conservation and divergence of cortical cell organization in human and mouse revealed by MERFISH. Science 377 , 56–62 (2022).

Ricaud, B. & Torrésani, B. A survey of uncertainty principles and some signal processing applications. Adv. Comput. Math. 40 , 629–650 (2014).

Article   MathSciNet   Google Scholar  

Rybkin, O., Daniilidis, K. & Levine, S. Simple and effective VAE training with calibrated decoders. In Proc. International Conference on Machine Learning (ed. Meila) 9179–9189 (PMLR, 2021).

Sikkema, L. et al. An integrated cell atlas of the lung in health and disease. Nat. Med. 29 , 1563–1577 (2023).

Liu, J. et al. SpaGFT Github: OSU-BMBL/SpaGFT: 0.1.1. Zenodo (2024).

Download references

Acknowledgements

This work was part of the PhD thesis of Y.C., who was co-mentored by Z.L. and Q.M. and was supported by research grants P01CA278732 (A.M. and Z.L.), P01AI177687 (A.M., Y.J., and D.H.B.), R21HG012482 (Ma), U54AG075931 (A.M.), R01DK138504 (A.M.), NIH DP2AI171139 (Y.J.), and R01AI149672 (Y.J.) from the National Institutes of Health. This work was supported by Gilead’s Research Scholars Program in Hematologic Malignancies (Y.J.), Sanofi iAward (Y.J.), the Bill & Melinda Gates Foundation INV-002704 (Y.J.), the Dye Family Foundation (Y.J.), and the Bridge Project, a partnership between the Koch Institute for Integrative Cancer Research at MIT and the Dana-Farber/Harvard Cancer Center (Y.J.). This work was also supported by the Pelotonia Institute of Immuno-Oncology (PIIO). Figure  1 a and Supplementary Fig.  1 , created with BioRender.com, were released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.

Author information

These authors contributed equally: Yuzhou Chang, Jixin Liu.

Authors and Affiliations

Department of Biomedical Informatics, College of Medicine, Ohio State University, Columbus, OH, 43210, USA

Yuzhou Chang, Yi Jiang, Anjun Ma, Qi Guo, Megan McNutt, Jordan E. Krull & Qin Ma

Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH, 43210, USA

Yuzhou Chang, Anjun Ma, Jordan E. Krull, Zihai Li & Qin Ma

School of Mathematics, Shandong University, 250100, Jinan, China

Jixin Liu & Bingqiang Liu

Center for Virology and Vaccine Research, Beth Israel Deaconess Medical Center, Boston, MA, 02115, USA

Yao Yu Yeo, Dan H. Barouch & Sizun Jiang

Program in Virology, Division of Medical Sciences, Harvard Medical School, Boston, MA, 20115, USA

Yao Yu Yeo & Sizun Jiang

Department of Pathology, Dana Farber Cancer Institute, Boston, MA, 02115, USA

Scott J. Rodig & Sizun Jiang

Department of Pathology, Brigham & Women’s Hospital, Boston, MA, 02115, USA

Scott J. Rodig

Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, 02139, USA

Dan H. Barouch

Department of Pathology, Stanford University School of Medicine, Stanford, CA, 94305, USA

Garry P. Nolan

Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, 65211, USA

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: Q.M.; methodology: J.L., Y.C., B.L., Q.M.; software coding: J.L., Y.J., and Y.C.; data collection and investigation: Y.C., Q.G., and M. M.; experiment and interpretation: Z.L., D.X., Y.Y.Y., S.J., S.R., G.N., and D.B.; data analysis and visualization: Y.C., Y.J. and J.L.; case study design and interpretation: Y.C., J.L., S.J., J.E.K. and A.M.; software testing and tutorial: J.L., Y.J., and Y.C.; graphic demonstration: Y.C., Y.J., and A.M.; manuscript writing, review, and editing: all the authors.

Corresponding authors

Correspondence to Bingqiang Liu or Qin Ma .

Ethics declarations

Competing interests.

S.J. is a co-founder of Elucidate Bio Inc., has received speaking honorariums from Cell Signaling Technology, and has received research support from Roche unrelated to this work. The other authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Jie Ding and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1-19, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Chang, Y., Liu, J., Jiang, Y. et al. Graph Fourier transform for spatial omics representation and analyses of complex organs. Nat Commun 15 , 7467 (2024). https://doi.org/10.1038/s41467-024-51590-5

Download citation

Received : 12 February 2024

Accepted : 08 August 2024

Published : 29 August 2024

DOI : https://doi.org/10.1038/s41467-024-51590-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

data representation types

COMMENTS

  1. Data Representation: Definition, Types, Examples

    Data Representation: Data representation is a technique for analysing numerical data. The relationship between facts, ideas, information, and concepts is depicted in a diagram via data representation. It is a fundamental learning strategy that is simple and easy to understand. It is always determined by the data type in a specific domain.

  2. What are the different ways of Data Representation?

    When the row is placed in ascending or descending order is known as arrayed data. Types of Graphical Data Representation. Bar Chart. Bar chart helps us to represent the collected data visually. The collected data can be visualized horizontally or vertically in a bar chart like amounts and frequency. It can be grouped or single.

  3. 2.1: Types of Data Representation

    2.1: Types of Data Representation. Page ID. Two common types of graphic displays are bar charts and histograms. Both bar charts and histograms use vertical or horizontal bars to represent the number of data points in each category or interval. The main difference graphically is that in a bar chart there are spaces between the bars and in a ...

  4. Data Representation in Computer: Number Systems, Characters

    A computer uses a fixed number of bits to represent a piece of data which could be a number, a character, image, sound, video, etc. Data representation is the method used internally to represent data in a computer. Let us see how various types of data can be represented in computer memory. Before discussing data representation of numbers, let ...

  5. Data Representation in Computer Science

    Data Representation is about how computers interpret and function with different information types, including binary systems, bits and bytes, number systems (decimal, hexadecimal) and character encoding (ASCII, Unicode). Binary Data Representation is the conversion of all kinds of information processed by a computer into binary format.

  6. Data Representation

    Contents pages for the section covering Data Representation from binary representation to various data compression methods at GCSE, IB and A Level - for Computer Science students. FOUNDATION YEARS GCSE ... hexadecimal, and ASCII formats, as well as the different types of data, such as integers, floating-point numbers, characters, and images. We ...

  7. Data representation

    The first unit, data representation, is all about how different forms of data can be represented in terms the computer can understand. ... A pointer combines an address and a type. The memory representation of a pointer is the same as the representation of its address value. The size of that integer is the machine's word size; for example, on ...

  8. PDF Data Representation

    Data Representation • Data refers to the symbols that represent people, events, things, and ideas. Data can be a name, a number, the colors in a photograph, or the notes in a musical composition. • Data Representation refers to the form in which data is stored, processed, and transmitted. • Devices such as smartphones, iPods, and

  9. Decoding Computation Through Data Representation

    Primitive data types: Computers deal with binary data at the most basic level. In most programming languages, integers, floating-point numbers, characters, and Booleans are foundational data types. Their representation involves bit patterns in memory, with specifics such as endian-ness, precision, and overflow/underflow considerations.

  10. PDF Data Representation

    We can represent numbers using only the digits 0s and 1s with the binary number system. Instead of counting the number of 1s, 5s, 10s, and 25s in coins, or 1s, 10s, 100s, and 1000s in abstract amounts, count the number of 1s, 2s, 4s, 8s, etc. For example, 1101 in binary is 1 * 8 + 1 * 4 + 0 * 2 + 1 * 1 = 13 in decimal.

  11. PDF Data Types And Representation

    Two major approaches: structural equivalence and name equivalence. Name equivalence is based on declarations. Two types are the same only if they have the same name. (Each type definition introduces a new type) strict: aliases (i.e. declaring a type to be equal to another type) are distinct. loose: aliases are equivalent.

  12. PDF Data Representation

    Table 1.1: Powers of Two and their Binary Representation Because of this property, numbers that are a power of two are very, very common when talking about computers. Table1.1shows powers of two up to 210 and their binary representation. The powers of two show up repeatedly in the sizes of di erent objects

  13. Graphical Representation of Data

    Geography Class 12 Chapter 3 talks about the Graphical Representation of Data. It includes all types of representation processes of data through different types of graphs like line, bar, pie, dot, and isopleth maps. Graphical representation gives us a visual of the raw data which helps us to understand to analyze it through different numeric format

  14. Data Representation

    A programmer considers memory content to be data types of the programming language he uses. Now recall figure 1.2 and 1.3 of chapter 1 to reinforce your thought that conversion happens from computer user interface to internal representation and storage. Data Representation in Computers

  15. Different forms of data representation in today's world

    Numbers are directly converted into binary representation to specify mathematical operations. The 0s and 1s used to represent digital data. The number system that humans normally use is in base 10. Number File Formats -. Integer, Fixed point, Date, Boolean, Decimal, etc.

  16. Data Representation and Data Types

    Data Representation and Data Types. Data Representation. Most of us write numbers in Arabic form, ie, 1, 2, 3,..., 9. Some people write them differently, such as I, II, III, IV,..., IX. Nomatter what type of representation, most human beings can understand, at least the two types I mentioned. Unfortunately the computer doesn't.

  17. A Tutorial on Data Representation

    The interpretation of binary pattern is called data representation or encoding. Furthermore, it is important that the data representation schemes are agreed-upon by all the parties, i.e., industrial standards need to be formulated and straightly followed. ... The char data type are based on the original 16-bit Unicode standard called UCS-2. The ...

  18. Data Representation in Computer Organization

    Data can be anything like a number, a name, notes in a musical composition, or the color in a photograph. Data representation can be referred to as the form in which we stored the data, processed it and transmitted it. In order to store the data in digital format, we can use any device like computers, smartphones, and iPads.

  19. PDF Data Representation

    Data Representation Data Representation Eric Roberts CS 106A February 10, 2016 Claude Shannon Claude Shannon was one of the pioneers who shaped computer science in its early ... • When you pass an argument of a primitive type to a method, Java copies the value of the argument into the parameter variable. As a result, changes to the parameter ...

  20. Introduction to Data Representation

    The way that we stored, processed, and transmitted data is referred to as data representation. We can use any device, including computers, smartphones, and iPads, to store data in digital format. The stored data is handled by electronic circuitry. A bit is a 0 or 1 used in digital data representation. Data Representation Techniques.

  21. Representation of Data/Information

    Representation of Data/Information - Computers do not understand human language; they understand data within the prescribed form. Data representation is a method to represent data and encode it in a computer system. Generally, a user inputs numbers, text, images, audio, and video etc types of data to process but the computer converts t.

  22. How exactly are data types represented in a computer?

    One binary digit is called a bit. Computers tend to work with memory in 8-bit chunks called bytes. A char in C is one byte. An int is typically four bytes (although it can be different on different machines). So a char can hold only 256 possible values, 2^8. An int can hold 2^32 different values.

  23. 3D modeling

    In 3D computer graphics, 3D modeling is the process of developing a mathematical coordinate-based representation of a surface of an object (inanimate or living) in three dimensions via specialized software by manipulating edges, vertices, and polygons in a simulated 3D space. [1] [2] [3]Three-dimensional (3D) models represent a physical body using a collection of points in 3D space, connected ...

  24. Graph Fourier transform for spatial omics representation and ...

    From a computational standpoint, SpaGFT and scGCO are two graph representation methods, among others, for spatial omics data analysis, with the former focusing on omic feature representation and ...