Browser not supported

This probably isn't the experience you were expecting. Internet Explorer isn't supported on Uber.com. Try switching to a different browser to view our site.

Uber logo

Schedule rides in advance

Solving Big Data Challenges with Data Science at Uber

Featured image for Solving Big Data Challenges with Data Science at Uber

The data involved in serving millions of rides and food deliveries on Uber’s platform doesn’t just facilitate transactions, it also helps teams at Uber continually analyze and improve our services. When we launch new services, we can quickly measure success, and when we see anomalies in the data, we can quickly look for root causes.

Charged with serving this data for everyday operational analysis, our Data Warehouse team maintains a massively parallel database running Vertica, a popular interactive data analytics platform. Every day, our system handles millions of queries, with 95 percent of them taking less than 15 seconds to return a response.

Meeting this challenge was not easy, especially considering the exponential growth of Uber’s ride and delivery volume over the years. Growing storage requirements for this system made our initial strategy of adding fully duplicated Vertica clusters to increase query volume cost-prohibitive.

A solution arose through the combined forces of our Data Warehouse and Data Science teams. Looking at the problem through cost analysis, our data scientists helped our Data Warehouse engineers come up with a means of partially replicating Vertica clusters to better scale our data volume. Optimizing our compute resources in this manner meant that we could scale to our current pace of serving over a billion trips on our platform, leading to improved user experiences worldwide.

Scaling for query volume

During Uber’s initial period of rapid growth, we adopted a fairly common approach of installing multiple isolated Vertica clusters to serve the millions of analytics queries made every day. These clusters were completely isolated mirror images of each other, providing two key advantages. First, they offered tolerance to cluster failures, for instance, if a cluster fails, the business can run as usual since the backup cluster holds a copy of all required data. Second, we could distribute incoming queries to different clusters, as depicted in Figure 1, below, thereby helping increase the volume of queries that can be processed simultaneously:

diagram of fully replicated databases

With data stored in multiple isolated clusters, we investigated strategies to balance the query load. Some common strategies we found included:

  • Random assignment: Randomly assign an incoming query to a cluster, with the assumption that randomization will automatically result in a balanced load.
  • User segmentation: Assign users to different clusters so that all queries from a given user are directed only to the assigned cluster.
  • CPU balancing: Keep track of CPU usage across different clusters and assign queries to clusters with the lowest CPU usage.

Relying on multiple, fully-isolated clusters with a routing layer to enforce user-segmentation at a cluster level came with the challenge of managing these database clusters, along with the storage inefficiency associated with replicating each piece of data across every cluster. For example, if we have 100 petabytes of data replicated six times, the total data storage requirement is 600 petabytes. Other challenges of replication, like the compute cost associated with writing data and creating necessary projections and indexes associated with incremental data updates, also became apparent.

These challenges were further compounded by our rapid global growth and foray into new ventures, such as food delivery, freight, and bike share. As we began ingesting increasing amounts of data into Uber’s Data Warehouse to support the needs of our growing business, the fact that Vertica combines compute and storage on individual machines meant a corresponding increase in the amount of hardware needed to support the business. Essentially, we would be paying hardware costs for increased storage without any gain in query volume. If we chose to add more clusters, the resource wastage implicit in the replication process would mean that the actual query volume did not grow linearly. The sheer lack of efficiency in terms of capital allocation, as well as performance, meant that we needed to think outside of the box to find a solution that scales.

Applying data science to data infrastructure

Given Uber’s expertise in data science, we decided to apply principles from that field to optimize our data infrastructure. Working closely with the Data Science team, we set out to increase query and data volume scalability for our fast analytic engines.

A natural strategy to overcome the storage challenge was to move from fully replicated databases to partially replicated databases. As shown in Figure 2, below, in comparison to a fully replicated database system where all the data is copied to all isolated database clusters, a partially replicated database system segments data into different overlapping sets of data elements, equal to the number of clusters:

Diagram comparing fully replicated and partially replicated databases

Due to the large scale of the problem, involving thousands of queries and hundreds of tables, constructing these different overlapping sets of data elements is non-trivial. Further, partial replication strategies are often short-lived as data elements grow at different rates, and these data elements change as the business evolves. Apart from considering database availability, along with compute and storage scalability, we also had to consider the migration costs of partially replicating our databases.

With this data infrastructure challenge in mind, our Data Warehouse and Data Science teams came up with three basic requirements for our optimal solution:

  • Minimize overall disk space requirement: Our rapid growth meant that our existing strategy of adding fully-replicated clusters wasn’t efficient, as outlined above. Any new solution must allow us to densify our storage and make efficient use of resources.
  • Balance disk usage across clusters: Ideally, we want disk space filled in each cluster to be almost the same. Assuming that data is growing at the same pace across all clusters, this is desirable as it ensures no single cluster runs out of disk space before others.
  • Balance query volume across clusters: While optimizing for disk space, we also want to ensure that we are distributing query volume evenly across clusters. If neglected, we could end up with a situation where all queries are routed to a single cluster.

Our data science team formalized these requirements into a cost function that can be described as:

Cost(PartialConfiguration)=S+L+M

A brief description of the three variables in the above equation is explained below, and a more detailed discussion can be found in our paper, Ephemeral Partially Replicated Databases .

  • S is described as the maximum storage utilization across N Vertica clusters. Storage utilization is the ratio of data elements stored on a single cluster to the total size of the data elements. For example, if the total size of the data element stored on a Vertica cluster is 60 petabytes, and the total size of all the data elements are 100 petabytes for a given partial configuration candidate, then storage utilization will be 0.6
  • L is described as the maximum compute utilization across N Vertica clusters. Compute utilization, in turn, is described as the percentage query volume that can be handled by a given cluster.
  • M is described as the maximum migration cost across N Vertica clusters. As described above, one of the challenges of using partially replicated databases is that an optimal partial configuration eventually becomes suboptimal due to the different rates at which different data elements grow as well as due to the changing nature of services and products offered by a business. As a result, replicated databases often have to be reconfigured. This reconfiguration requires moving data elements from one database to another and thereby consumes compute resources. Ideally, we prefer a configuration that minimizes the cost of migration. Migration cost pertaining to a database cluster is described as the amount of new data elements that will be copied from the given state of the database to the new state of the same database.

Minimizing the above cost function for thousands of tables and millions of queries is a difficult task. Based on empirical observations, our Data Science team identified that 10 percent of the largest tables account for about 90 percent of disk utilization. Thus, most of the disk space efficiency will be achieved with an optimal configuration of just 10 percent of our tables.

Focusing on these tables significantly reduced the number of decision parameters we required for optimization. Furthermore, our Data Science team developed an algorithm that generates purposefully sub-optimal solutions by greedily assigning tables and queries to clusters with the lowest cost. This greedy algorithm, which as compared to an optimal solution reduces disk savings by 5 percent, is significantly faster and completes within a few minutes. We decided to productionize this algorithm to favor speed over disk usage.

Once we had the data science problems figured out, the next step was to tackle the engineering challenges. To support partial replication, we had to significantly enhance two components, our proxy manager and our data manager, highlighted below:

  • Proxy Manager: With fully replicated Vertica databases, a proxy manager provides a thin abstraction between the client and its corresponding databases, while also acting as a load balancer. All incoming queries are routed through this layer, which has knowledge of query load, data location, and cluster health to ensure each query is routed to a cluster that can handle it
  • Data Manager: The second component needed was a data manager. In the fully replicated world, all data is copied from an upstream data lake to all the available Vertica databases. However, in our proposed design, each data element is copied to different databases depending on the partial configuration. The data manager holds information about which cluster requires which tables to be loaded on to it, and will share this information with the proxy manager.

diagram of partially replicated vertica databases

With all these pieces in place, our solution was able to significantly reduce overall disk consumption by over 30 percent, while continuing to provide the same level of compute scalability and database availability. The savings achieved resulted in decreased hardware cost despite query volume growth and also ensured that we were able to balance load evenly across clusters. For Uber’s internal teams using this data, this load balancing meant improved uptime as all queries were always directed to the most healthy clusters and reduced failures for ETLs .

Building an intelligent infrastructure

Working closely with the Data Science team on this project demonstrated how the power of machine learning and data science can be infused into the data infrastructure world, and be used to create a meaningful impact not only on Uber’s business but also for thousands of users, from AI researchers to city operations managers, within Uber who rely on us to power insight-gathering and decision-making. The success of this project has spurred deeper collaboration between our Infrastructure and Data Science teams and has led to the development of a new Intelligent Infrastructure team to rethink infrastructure design for Big Data applications.

If you are interested in working alongside us as we build a data-driven platform that moves the world, come join our teams !

Atul Gupte

Atul Gupte is a former product manager on Uber's Product Platform team. At Uber, he drives product decisions to ensure our data science teams are able to achieve their full potential, by providing access to foundational infrastructure and advanced software to power Uber’s global business.

Ritesh Agrawal

Ritesh Agrawal

Ritesh Agrawal is a senior data scientist on Uber's Data Science team, leading the intelligent infrastructure and developer platform teams. His work is focused on finding innovative ways to use data science and AI to make Uber’s infrastructure more adaptive and scalable and enhance developer productivity.

Posted by Atul Gupte, Ritesh Agrawal

Related articles

Image

Introduction to Kafka Tiered Storage at Uber

July 1 / Global

Image

How Uber Accomplishes Job Counting  At Scale

May 22 / Global

Image

DataK9: Auto-categorizing an exabyte of data at field level through AI/ML

May 9 / Global

Image

From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey

May 2 / Global

Image

How LedgerStore Supports Trillions of Indexes at Uber

April 4 / Global

Most popular

Post thumbnail

Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore

Post thumbnail

Flaky Tests Overhaul at Uber

Post thumbnail

Ensuring Precision and Integrity: A Deep Dive into Uber’s Accounting Data Testing Strategies

Post thumbnail

Your guide to NJ TRANSIT’s Access Link Riders’ Choice Pilot 2.0

Resources for driving and delivering with Uber

Experiences and information for people on the move

Ordering meals for delivery is just the beginning with Uber Eats

Restaurants

Inspiration and product details for the places that feed us

Putting stores within reach of a world of customers

Transforming the way companies move and feed their people

Taking shipping logistics in a new direction

Moving care forward together with medical providers

Expanding the reach of public transportation

Explore how Uber employees from around the globe are helping us drive the world forward at work and beyond

Engineering

The technology behind Uber Engineering

Community support

Doing the right thing for cities and communities globally

Uber news and updates in your country

Product, how-to, and policy content—and more

Sign up to drive

Sign up to ride.

The magic behind Uber’s data-driven success

Uber, the ride-hailing giant, is a household name worldwide. We all recognize it as the platform that connects riders with drivers for hassle-free transportation. But what most people don’t realize is that behind the scenes, Uber is not just a transportation service; it’s a data and analytics powerhouse. Every day, millions of riders use the Uber app, unwittingly contributing to a complex web of data-driven decisions. This blog takes you on a journey into the world of Uber’s analytics and the critical role that Presto, the open source SQL query engine, plays in driving their success.

Uber’s DNA as an analytics company

At its core, Uber’s business model is deceptively simple: connect a customer at point A to their destination at point B. With a few taps on a mobile device, riders request a ride; then, Uber’s algorithms work to match them with the nearest available driver and calculate the optimal price. But the simplicity ends there. Every transaction, every cent matters. A ten-cent difference in each transaction translates to a staggering $657 million annually. Uber’s prowess as a transportation, logistics and analytics company hinges on their ability to leverage data effectively.

The pursuit of hyperscale analytics

The scale of Uber’s analytical endeavor requires careful selection of data platforms with high regard for limitless analytical processing. Consider the magnitude of Uber’s footprint. 1 The company operates in more than 10,000 cities with more than 18 million trips per day. To maintain analytical superiority, Uber keeps 256 petabytes of data in store and processes 35 petabytes of data every day. They support 12,000 monthly active users of analytics running more than 500,000 queries every single day.

To power this mammoth analytical undertaking, Uber chose the open source Presto distributed query engine. Teams at Facebook developed Presto to handle high numbers of concurrent queries on petabytes of data and designed it to scale up to exabytes of data. Presto was able to achieve this level of scalability by completely separating analytical compute from data storage. This allowed them to focus on SQL-based query optimization to the nth degree.

What is Presto?

Presto is an open source distributed SQL query engine for data analytics and the data lakehouse, designed for running interactive analytic queries against datasets of all sizes, from gigabytes to petabytes. It excels in scalability and supports a wide range of analytical use cases. Presto’s cost-based query optimizer, dynamic filtering and extensibility through user-defined functions make it a versatile tool in Uber’s analytics arsenal. To achieve maximum scalability and support a broad range of analytical use cases, Presto separates analytical processing from data storage. When a query is constructed, it passes through a cost-based optimizer, then data is accessed through connectors, cached for performance and analyzed across a series of servers in a cluster. Because of its distributed nature, Presto scales for petabytes and exabytes of data.

The evolution of Presto at Uber

Beginning of a data analytics journey.

Uber began their analytical journey with a traditional analytical database platform at the core of their analytics. However, as their business grew, so did the amount of data they needed to process and the number of insight-driven decisions they needed to make. The cost and constraints of traditional analytics soon reached their limit, forcing Uber to look elsewhere for a solution.

Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. While this side-by-side strategy enabled data capture, they quickly discovered that the data lake worked well for long-running queries, but it was not fast enough to support the near-real time engagement necessary to maintain a competitive advantage.

To address their performance needs, Uber chose Presto because of its ability, as a distributed platform, to scale in linear fashion and because of its commitment to ANSI-SQL, the lingua franca of analytical processing. They set up a couple of clusters and began processing queries at a much faster speed than anything they had experienced with Apache Hive, a distributed data warehouse system, on their data lake.

Continued high growth

As the use of Presto continued to grow, Uber joined the Presto Foundation, the neutral governing body behind the Presto open source project, as a founding member alongside Facebook. Their initial contributions were based on their need for growth and scalability. Uber focused on contributing to several key areas within Presto:

Automation: To support growing usage, the Uber team went to work on automating cluster management to make it simple to keep up and running. Automation enabled Uber to grow to their current state with more than 256 petabytes of data, 3,000 nodes and 12 clusters. They also put process automation in place to quickly set up and take down clusters.

Workload Management: Because different kinds of queries have different requirements, Uber made sure that traffic is well-isolated. This enables them to batch queries based on speed or accuracy. They have even created subcategories for a more granular approach to workload management.

Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data. Large, untested workloads run the risk of hogging all the resources. In some cases, the queries run out of memory and do not complete.

To address this challenge, Uber created and maintains sample versions of datasets. If they know a certain user is doing exploratory work, they simply route them to the sampled datasets. This way, the queries run much faster. There may be inaccuracy because of sampling, but it allows users to discover new viewpoints within the data. If the exploratory work needs to move on to testing and production, they can plan appropriately.

Security: Uber adapted Presto to take users’ credentials and pass them down to the storage layer, specifying the precise data to which each user has access permissions. As Uber has done with many of its additions to Presto, they contributed their security upgrades back to the open source Presto project.

The technical value of Presto at Uber

Analyzing complex data types with presto.

As a digital native company, Uber continues to expand its use cases for Presto. For traditional analytics, they are bringing data discipline to their use of Presto. They ingest data in snapshots from operational systems. It lands as raw data in HDFS. Next, they build model data sets out of the snapshots, cleanse and deduplicate the data, and prepare it for analysis as Parquet files.

For more complex data types, Uber uses Presto’s complex SQL features and functions, especially when dealing with nested or repeated data, time-series data or data types like maps, arrays, structs and JSON. Presto also applies dynamic filtering that can significantly improve the performance of queries with selective joins by avoiding reading data that would be filtered by join conditions. For example, a parquet file can store data as BLOBS within a column. Uber users can run a Presto query that extracts a JSON file and filters out the data specified by the query. The caveat is that doing this defeats the purpose of the columnar state of a JSON file. It is a quick way to do the analysis, but it does sacrifice some performance.

Extending the analytical capabilities and use cases of Presto

To extend the analytical capabilities of Presto, Uber uses many out-of-the-box functions provided with the open source software. Presto provides a long list of functions, operators, and expressions as part of its open source offering, including standard functions, maps, arrays, mathematical, and statistical functions. In addition, Presto also makes it easy for Uber to define their own functions. For example, tied closely to their digital business, Uber has created their own geospatial functions.

Uber chose Presto for the flexibility it provides with compute separated from data storage. As a result, they continue to expand their use cases to include ETL, data science , data exploration, online analytical processing (OLAP), data lake analytics and federated queries.

Pushing the real-time boundaries of Presto

Uber also upgraded Presto to support real-time queries and to run a single query across data in motion and data at rest. To support very low latency use cases, Uber runs Presto as a microservice on their infrastructure platform and moves transaction data from Kafka into Apache Pinot, a real-time distributed OLAP data store, used to deliver scalable, real-time analytics.

According to the Apache Pinot website, “Pinot is a distributed and scalable OLAP (Online Analytical Processing) datastore, which is designed to answer OLAP queries with low latency. It can ingest data from offline batch data sources (such as Hadoop and flat files) as well as online data sources (such as Kafka). Pinot is designed to scale horizontally, so that it can handle large amounts of data. It also provides features like indexing and caching.”

This combination supports a high volume of low-latency queries. For example, Uber has created a dashboard called Restaurant Manager in which restaurant owners can look at orders in real time as they are coming into their restaurants. Uber has made the Presto query engine connect to real-time databases.

To summarize, here are some of the key differentiators of Presto that have helped Uber:

Speed and Scalability: Presto’s ability to handle massive amounts of data and process queries at lightning speed has accelerated Uber’s analytics capabilities. This speed is essential in a fast-paced industry where real-time decision-making is paramount.

Self-Service Analytics: Presto has democratized data access at Uber, allowing data scientists, analysts and business users to run their queries without relying heavily on engineering teams. This self-service analytics approach has improved agility and decision-making across the organization.

Data Exploration and Innovation: The flexibility of Presto has encouraged data exploration and experimentation at Uber. Data professionals can easily test hypotheses and gain insights from large and diverse datasets, leading to continuous innovation and service improvement.

Operational Efficiency: Presto has played a crucial role in optimizing Uber’s operations. From route optimization to driver allocation, the ability to analyze data quickly and accurately has led to cost savings and improved user experiences.

Federated Data Access: Presto’s support for federated queries has simplified data access across Uber’s various data sources, making it easier to harness insights from multiple data stores, whether on-premises or in the cloud.

Real-Time Analytics: Uber’s integration of Presto with real-time data stores like Apache Pinot has enabled the company to provide real-time analytics to users, enhancing their ability to monitor and respond to changing conditions rapidly.

Community Contribution: Uber’s active participation in the Presto open source community has not only benefited their own use cases but has also contributed to the broader development of Presto as a powerful analytical tool for organizations worldwide.

The power of Presto in Uber’s data-driven journey

Today, Uber relies on Presto to power some impressive metrics. From their latest Presto presentation in August 2023, here’s what they shared:

Uber’s success as a data-driven company is no accident. It’s the result of a deliberate strategy to leverage cutting-edge technologies like Presto to unlock the insights hidden in vast volumes of data. Presto has become an integral part of Uber’s data ecosystem, enabling the company to process petabytes of data, support diverse analytical use cases, and make informed decisions at an unprecedented scale.

Getting started with Presto

If you’re new to Presto and want to check it out, we recommend this Getting Started page where you can try it out.

Alternatively, if you’re ready to get started with Presto in production you can check out IBM watsonx.data , a Presto-based open data lakehouse. Watsonx.data is a fit-for-purpose data store, built on an open lakehouse architecture, supported by querying, governance and open data formats to access and share data.

1 Uber. EMA Technical Case Study, sponsored by Ahana. Enterprise Management Associates (EMA). 2023.

More from Artificial intelligence

Applying generative ai to revolutionize telco network operations .

5 min read - Generative AI is shaping the future of telecommunications network operations. The potential applications for enhancing network operations include predicting the values of key performance indicators (KPIs), forecasting traffic congestion, enabling the move to prescriptive analytics, providing design advisory services and acting as network operations center (NOC) assistants.   In addition to these capabilities, generative AI can revolutionize drive tests, optimize network resource allocation, automate fault detection, optimize truck rolls and enhance customer experience through personalized services. Operators and suppliers are…

Re-evaluating data management in the generative AI age

4 min read - Generative AI has altered the tech industry by introducing new data risks, such as sensitive data leakage through large language models (LLMs), and driving an increase in requirements from regulatory bodies and governments. To navigate this environment successfully, it is important for organizations to look at the core principles of data management. And ensure that they are using a sound approach to augment large language models with enterprise/non-public data. A good place to start is refreshing the way organizations govern…

IBM announces new AI assistant and feature innovations at Think 2024

4 min read - As organizations integrate artificial intelligence (AI) into their operations, AI assistants that merge generative AI with automation are proving to be key productivity drivers. Despite various barriers to AI, these assistants combine generative AI and automation. This integration helps improve productivity by transforming how we work, offloading repetitive tasks, enabling self-service actions, and providing guidance on completing end-to-end processes. AI assistants from IBM facilitate enterprise adoption of AI to modernize business operations. They are purpose-built, tailored to specific use cases,…

IBM Newsletters

How Uber uses data science to reinvent transportation?

Understand how the ride sharing service Uber uses big data and data science to reinvent transportation and logistics globally.

How Uber uses data science to reinvent transportation?

With more than 8 million users, 1 billion Uber trips and 160,000+ people driving for Uber across 449 cities in 66 countries – Uber is the fastest growing startup standing at the top of its game. Tackling problems like poor transportation infrastructure in some cities, unsatisfactory customer experience, late cars, poor fulfilment, drivers denying to accept credit cards and more –Uber has “eaten the world” in less than 5 years and is a remarkable name to reckon when it comes to solving problems for people in transportation.

data_science_project

Ola Bike Rides Request Demand Forecast

Downloadable solution code | Explanatory videos | Tech Support

If you have ever booked an Uber, you might know how simple the process is –just press a button, set the pickup location, request a car, go for a ride and pay with a click of a button. The process is simple but there is a lot going on behind the scenes. The secret key driving growth of the $51 billion start-up, is the big data it collects and leverages for insightful and intelligent decision making. While Uber moves people around the world without owning any cars, data moves Uber. With the foundation to build the most intelligent company on the planet by completely solving problems for riders –Big Data and Data Science are at the heart of everything Uber does - surge pricing, better cars, detecting fake rides, fake cards, fake ratings, estimating fares and driver ratings. Read on to understand how Uber makes clever use of big data and data science to reinvent transportation and logistics globally.

Uber Big Data and Data Science

Table of Contents

Big data at uber, data products at uber - surge pricing, matching algorithms at uber, fare estimates, uber data science tools.

ProjectPro Free Projects on Big Data and Data Science

“Uber lives or dies by data. Their overall mission and their sustainability is completely dependent on how good their data is. The more data they can collect, the more information they can derive from patterns and behaviours. Their ability to increase profits is all dependent on that.” - said Spencer, a former Uber driver.

There is no need to look for a local taxi or to tip a bellman for the ride, you are just a click away from a high quality customer experience with Uber’s revolutionizing data driven business model. Data is the biggest asset for Uber and its complete business model is based on the big data principle of crowdsourcing. Anybody with a car willing to help someone get to a desired location can offer help in getting them there.

New Projects

It is tricky to get sufficient details on Uber’s big data infrastructure but we have some interesting information here about Uber’s big data. Uber’s data is collected in a Hadoop data lake and it uses spark and hadoop to process the data. Uber’s data comes from a range of data types and databases like SOA database tables, schema less data stores and the event messaging system, Apache Kafka.

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

Data Science Interview Preparation

Uber is greedy about what data it collects and with many cheap relative storage options like Hadoop and Spark -it has got data about every single GPS point for every trip taken on Uber. Uber stores historic information about its system and capabilities to ease doing data science for its data scientists down the road. Keeping the change logs, versioning of database schemas helps data scientist answer every question on-hand. With the data Uber has, data scientists can answer questions like what did the Uber system look like at a particular point of time from a customer perspective, supply behaviour perspective, from inter-server communication perspective or even to the state of a database.

With a huge database of drivers, as soon as a user requests for car, their algorithms match a user with the most suitable driver within a 15 second window to the nearest driver. Uber stores and analyses data on every single trip the users take which is leveraged to predict the demand for cars, set the fares and allocate sufficient resources. Data science team at Uber also performs in-depth analysis of the public transport networks across different cities so that they can focus on cities that have poor transportation and make the best use of the data to enhance customer service experience.

Here's what valued users are saying about ProjectPro

user profile

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

user profile

Director Data Analytics at EY / EY Tech

Not sure what you are looking for?

In fact, uber drivers continue to generate data for Uber even when they are not carrying any passengers because they transmit data back to the central platform at Uber which is used to draw inferences on traffic patterns. The data is stored into the database for supply and demand algorithm analysis. Driver data is used for autonomous car research, surge pricing, tracking the location of drivers, monitoring driver’s speed, motion and acceleration and identifying if a driver is working for a competing cab sharing company.

Big data analysis spans across diverse functions at Uber – machine learning, data science, marketing , fraud detection and more. Uber data consists of information about trips, billing, health of the infrastructure and other services behind its app. City operations teams use uber big data to calculate driver incentive payments and predict many other real time events. The complete process of data streaming is done through a Hadoop Hive based analytics platform which gives right people and services with required data at right time.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

“Whether it’s calculating Uber’s “surge pricing, “helping drivers to avoid accidents, or finding the optimal positioning of cars to maximize profits, data is central to what Uber does. All these data problems…are really crystalized on this one math with people all over the world trying to get where they want to go. That’s made data extremely exciting here, it’s made engaging with Spark extremely exciting.”- said Uber’s Head of Data Aaron Schildkrout.

Data Science at Uber

Data science is an integral part of Uber’s products and philosophy. Uber does an exceptional job of hiring data-oriented people throughout the company through its exclusive Uber Analytics test v3.1. Any individual applying a job at Uber that requires analysing back-end extract from the application, has to take the Uber Analytics Test.

Recommended Reading:    Top 20 Data Analytics Projects for Students to Practice in 2021

On the product front, Uber’s data team  is behind all the predictive models powering the ride sharing cab service right from predicting that “Your driver will be in here in 3 minutes.” to estimating fares, showing up surge prices and heat maps to the drivers on where to position themselves within the city.The business success of Uber depends on its ability to create a positive user experience through statistical data analysis. What make Uber unique is that the data science driven insights don’t just stay within the dashboards or company reports but they are implemented in real-time into its to create a positive user experience for customers and drivers.

Explore Categories

To create the most efficient market and maximize the number of rides it can provide –Uber uses surge pricing.  You are running late and stressed enough to take the public transport, Uber could come to your rescue, and however you soon notice that they will charge you 1.5 times more than the usual rate.

Sometimes when you try to book an Uber, and what you thought would be a $10 ride is going to be 2 or 3 or even 4 times more – this is due to the surge pricing algorithms that Uber implements behind the scenes. Data Science is at the heart of Uber’s surge pricing algorithm. Given a certain demand, what is the right price for a car based on the economic conditions. The king of ride sharing service maintains the surge pricing algorithm to ensure that their passengers always get a ride when they need one even if it comes at the cost of inflated price. Uber has even applied for a patent on big data informed pricing i.e. surge pricing.

Most of the predictive models at Uber follow the business logic on how pricing decisions are made. For instance, the Geosurge (name for surge pricing or dynamic pricing model at Uber) looks at the data available and then compares theoretical ideals with what is actually implemented in the real world. Uber’s surge pricing model is based on both geo-location and demand (for a ride) to position drivers efficiently. Data science methodologies are extensively used to analyse the short term effects of surge pricing on customer demand and long term effects of surge pricing on retaining customers. Uber depends on regression analysis to find out which neighbourhoods will be the busiest so it can activate surge pricing to get more drivers on the roads.

Uber recently announced that it’s going to limit the use of surge pricing through machine learning. The machine learning algorithms will take multiple data inputs and predict where the highest demand is going to be so that Uber drivers can be redirected there. This will ensure that there is no supply and demand shortage so that it does not have to actually implement surge pricing. Uber has not yet confirmed as to when this new system with smart machine learning algorithms would be rolled out to reduce surge pricing.

Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization

Timing is everything at Uber. Given a pickup location, drop off location and time of the day, predictive models developed at Uber predict how long will it take for a driver to cover the distance. Uber has sophisticated routing and matching algorithms that direct cars to people and people to places. Right from the time you open the uber app till you reach your destination, Uber’s routing engine and matching algorithms are hard at work.

Uber follows a supplier pick map matching algorithm where the customer selects the variables associated with a service (in this case Uber app) and makes a match by sending requests to the most optimal list of service providers. Any Uber ride request is first sent to the nearest available Uber driver (the nearest available Uber driver is determined by comparing the customer location with the expected time of arrival of the driver). The Uber driver then accepts or rejects a ride request. This matching algorithm works well for Uber since the transaction is highly commoditized i.e. the number of variables that the customer has to decide before a match is made are minimal.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Uber uses a mixture of internal and external data to estimate fares. Uber calculates fares automatically using street traffic data, GPS data and its own algorithms that make alterations based on the time of the journey. It also analyses external data like public transport routes to plan various services.

Get More Practice, More Data Science and Machine Learning Projects , and More guidance.Fast-Track Your Career Transition with ProjectPro

Python is the go-to data science programming language at Uber and is extensively used by the Uber data team. Commonly used third party modules to do data science at Uber include NumPy, SciPy, Matplotlib and Pandas. Uber data team does use R programming language , Octave or Matlab occasionally for prototypes or one-off data science projects and not for production stack. D3 is the most preferred data visualization tool at Uber and Postgres, the most preferred SQL framework.

What can you expect in future from Uber‘s data driven methodologies?

With initiatives like UberFresh for grocery deliveries, UberRush for package courier service and UberChopper offering helicopter rides to the wealthy-Uber is all set to revolutionize private transportation globally. Uber knows the popular nightclubs in the city, best in class restaurants and has data about traffic patterns across different regions. Uber’s data would be soon be combined with customer specific personal data in exchange of benefits making Uber the big Big Data Company. Soon, citizens would not mind sharing their SSN with Uber if they use your data to book a restaurant for a romantic dinner date on Valentine’s Day that has good live music and arrange a pick up for you and your wife in a luxury car.

Access Data Science and Machine Learning Project Code Examples

So the next time on your “Uber” ride experience, do think of some data science that is going behind the scenes. The quality of service that you are enjoying is the due to the big data being analysed and data science being applied, to create a better riding experience for you.

Access Solved Big Data and Data Projects

About the Author

author profile

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

arrow link

© 2024

© 2024 Iconiq Inc.

Privacy policy

User policy

Write for ProjectPro

Logo.

Digital Innovation and Transformation

Mba student perspectives.

  • Assignments
  • Assignment: Competing with Data

Uber knows you: how data optimizes our rides

big data case study on uber

While Uber transports people and meals around the world without owning a car, they still rely on fuel: Data, data and more data – the magic word for Uber.

Everyone knows Uber. But dude, they know you at least equally well!

While Uber transports people around the world without owning a car, there is only one fuel that powers Uber: Data. This is the secret key driving growth of the silicon valley start-up revolutionizing the taxi industry. What makes Uber unique is that the data driven insights don’t just stay within its internal dashboards but are implemented real-time into its services to generate an unprecedented user experience for both customers and drivers. 1

Wait, what’s the use of knowing my way to work?

Come on, you can do better! Uber uses data in many different ways with two applications standing out.

Matching Algorithms

Pathways to a Just Digital Future

Starting as soon as you open the app, until you reach your destination, Uber’s routing engine and matching algorithms are working hard. By entering the planned route and time of day, prediction models directly forecast the driving time and allocates the optimal driver through a process called batch-matching.

Through a machine learning algorithm, the models become more accurate in their predictive power with each ride filed. This matching algorithm allows Uber to minimize the number of variables a customer has to enter. In addition to that, they offer lower wait times and a more reliable experience for riders. Drivers, in turn, get more time to earn. 1

Surge Pricing

The instant implementation of live data allows Uber to effectively operate a dynamic pricing model. Using geo-location coordinates from drivers, street traffic and ride demand data, the so called Geosurge-algorithm compares theoretical ideals with what is actually implemented in the real world to make alterations based on the time of the journey. Using this process, fares are updated in real time based on demand. In addition, this allows prices to be adjusted specifically to different areas in cites, so that some neighborhoods may have surge pricing while others do not. 2

big data case study on uber

Furthermore, smart machine learning algorithms will take multiple data inputs and predict where the highest demand is going to be. During peak time, drivers receive live data in form of heat maps to compare the demand in different areas. 3

big data case study on uber

This system allows Uber to optimally position drivers ensuring that there is no supply and demand shortage. Doing so, they create the most efficient market and maximize the number of rides it can provide which in turn benefits all parties. 1

But that’s billions of data – how do they manage?

That’s right, Uber gives about 15 million rides per day. 4 To manage this data flood, they introduced its own Machine Learning platform called Michelangelo which is used to create different models for Uber’s various services.

Michelangelo is an internal ML-as-a-service platform that democratizes and optimizes the scaling of AI, ML and Deep Learning. It enables internal teams to seamlessly build, deploy, and operate machine learning solutions at Uber’s scale. It is designed to cover the end-to-end ML workflow: manage data, train, evaluate, and deploy models, make predictions, and monitor predictions. For the Geeks, visit this page where Michelangelo is presented in detail. 5

big data case study on uber

Boy, this sounds expensive – was it really necessary?

Hell yes! Before Michelangelo was born, Uber’s ML operations faced big challenges such as bad data quality, high data latency, lack of efficiency and scalability, and poor reliability. With its business growing exponentially, the amount of incoming data increased every day.

Being Uber means being efficient! Travis Kalanick – Co-founder of Uber

To realize Michelangelo, new data scientists, analysts and engineers had to be hired and the computing power and its internet bandwidth had to be heavily increased. 6,7 There are no exact spending figures available on this, but Ubers financials’ show that R&D spending increased by over 150 million 8 over the year prior to implementation in 2017. Although the entire amount was certainly not invested in this project, we expect that quite some money was spent for Uber’s new best buddy.

So, all their problems are solved now?

You have no idea! Even though Uber has managed to successfully process and use the vast amounts of data, they still face major challenges. The most important to mention here are the status of its drivers, tax issues, constitutional issues and of course the rising competition of companies such as Lyft, Didi or Grab ( details about challenges ). 9 In my view, however, Uber remains a highly competitive company with virtually no limits. Consider the diverse offerings such as packaging and food delivery, the upcoming driverless technologies and of course even air taxis which is by the way my favorite idea!

But Jesus! Think about how much data you need to manage for that!

1 How Uber uses data science to reinvent transportation? (projectpro.io)

2 How Surge Pricing Works | Drive with Uber | Uber

3 When and where are the most riders? | Driving & Delivering – Uber Help

4 Scaling Machine Learning at Uber with Michelangelo | Uber Blog

5 Data Science at Uber. Uber is one of the most successful… | by Jagandeep Singh | Medium

6 Uber’s Big Data Platform: 100+ Petabytes with Minute Latency | Uber Blog

7 Evolving Michelangelo Model Representation for Flexibility at Scale | Uber Blog

8 Uber R&D spending worldwide 2018 | Statista

9 4 Challenges Uber Will Face in the Next Years (investopedia.com)

10 https___blogs-images.forbes.com_amitchowdhry_files_2016_05_Uber-Surge-Pricing.jpg (960×573) (gettagged.us)

Student comments on Uber knows you: how data optimizes our rides

Yannik — thanks for the post, it was both hilariously written AND interesting. It was thought-provoking to read about how Uber is able to adjust its services in real-time, versus using big data as an input to make its product better in the long-term. Even though Uber and Lyft have achieved mass scale, I do wonder if they will continue to be competitive with rising prices and the increased ubiquity of big data as a business asset.

Great post!

Something I have always thought of is if and how algorithms can be trained to show empathy and act ethically. Your point about Uber being able to selectively surge charge brings back memories of Uber surcharging during mass shootings. I wonder if at some point algorithms will be able to cross reference what is going on in the public domain (news, online, etc) with location info and at some point make these ethical decisions without human intervention.

Great blog and an interesting read Yannik! Uber has definitely done a great job in eliminating the customer pain points around commuting by leveraging customer data But as I see their increasing challenges especially in the developing economies like India: frequent cancellation by drivers, drivers insisting on cash payment due to lack of payment transparency for drivers (which was sorted 2 months back by uber after being in India for almost 10 years), poor customer service, and now rising competition with electric vehicle ride hailing player. Uber had been able to do good in the US and some part of the European market, but it has struggled from the beginning in the developing market due to stiff competition. I’m really curious to know what will be their next growth strategy, what will be their future? And how are they going to use the plethora of customer data to make their next bet?

Yannik, this was an awesome read! I used Uber/Lyft on a daily basis when I worked in consulting and am still a frequent user of it now so I love asking the drivers about how the app works for them. One of the fascinating things I heard was that if a top-rated driver is on their way to pick up a non-top-rated user and a top-rated user subsequently requests a ride, the app will cancel the original ride to the non-top-rated user and redirect the driver to the top-rated user instead. I understood this as the app ensuring that their top-rated users have the best service from their best drivers (not necessarily to incentivize users to be better riders, since most users are unaware of this mechanism) but reading from your post, it strikes me that it may also be a cost saving mechanism to link its best drivers and users to “minimize the number of variables” for both parties and curate highly efficient rides to increase capacity.

Leave a comment Cancel reply

You must be logged in to post a comment.

Big Data in Practice by Bernard Marr

Get full access to Big Data in Practice and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

42 UBER How Big Data Is At The Centre Of Uber’s Transportation Business

Uber is a smartphone app-based taxi booking service which connects users who need to get somewhere with drivers willing to give them a ride. The service has been hugely popular. Since being launched to serve San Francisco in 2009, the service has been expanded to many major cities on every continent except for Antarctica, and the company are now valued at $41 billion. The business are rooted firmly in Big Data, and leveraging this data in a more effective way than traditional taxi firms has played a huge part in their success.

What Problem Is Big Data Helping To Solve?

Uber’s entire business model is based on the very Big Data principle of crowdsourcing: anyone with a car who is willing to help someone get to where they want to go can offer to help get them there. This gives greater choice for those who live in areas where there is little public transport, and helps to cut the number of cars on our busy streets by pooling journeys.

How Is Big Data Used In Practice?

Uber store and monitor data on every journey their users take, and use it to determine demand, allocate resources and set fares. The company also carry out in-depth analysis of public transport networks in the cities they serve, so they can focus coverage in poorly served areas and provide links to buses and trains.

Uber hold a vast database of drivers in all of the cities they cover, so when a passenger asks for a ride, they can ...

Get Big Data in Practice now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Cover of Software Architecture Patterns

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

big data case study on uber

Using Big Data to Estimate Consumer Surplus: The Case of Uber

Estimating consumer surplus is challenging because it requires identification of the entire demand curve. We rely on Uber’s “surge” pricing algorithm and the richness of its individual level data to first estimate demand elasticities at several points along the demand curve. We then use these elasticity estimates to estimate consumer surplus. Using almost 50 million individual-level observations and a regression discontinuity design, we estimate that in 2015 the UberX service generated about $2.9 billion in consumer surplus in the four U.S. cities included in our analysis. For each dollar spent by consumers, about $1.60 of consumer surplus is generated. Back-of-the-envelope calculations suggest that the overall consumer surplus generated by the UberX service in the United States in 2015 was $6.8 billion.

We are grateful to Josh Angrist, Keith Chen, Joseph Doyle, Hank Farber, Alan Krueger, Greg Lewis, Jonathan Meer, and Glen Weyl for helpful comments and discussions. We are also grateful to Mattie Toma for excellent research assistance. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.

Peter Cohen transitioned from paid independent contractor to full-time employee of Uber during the writing of the paper. As a current employee, he has an equity stake in the company.

Jonathan Hall was an employee and shareholder of Uber Technologies before, during, and after the writing of this paper.

MARC RIS BibTeΧ

Download Citation Data

Working Groups

Mentioned in the news, more from nber.

In addition to working papers , the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter , the NBER Digest , the Bulletin on Retirement and Disability , the Bulletin on Health , and the Bulletin on Entrepreneurship  — as well as online conference reports , video lectures , and interviews .

15th Annual Feldstein Lecture, Mario Draghi, "The Next Flight of the Bumblebee: The Path to Common Fiscal Policy in the Eurozone cover slide

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

Share Podcast

Cold Call podcast series

Uber’s Strategy for Global Success

How can Uber adapt its business model to compete in unique global markets?

  • Apple Podcasts

As Uber entered unique regional markets around the world – from New York to Shanghai, it has adapted its business model to comply with regulations and compete locally. As the transportation landscape evolves, how can Uber adapt its business model to stay competitive in the long term?

Harvard Business School assistant professor Alexander MacKay describes Uber’s global market strategy and responses by regulators and local competitors in his case, “ Uber: Competing Globally .”

HBR Presents is a network of podcasts curated by HBR editors, bringing you the best business ideas from the leading minds in management. The views and opinions expressed are solely those of the authors and do not necessarily reflect the official policy or position of Harvard Business Review or its affiliates.

BRIAN KENNY: The theory of disruptive innovation was first coined by Harvard Business School professor Clayton Christensen in his 1997 book, The Innovator’s Dilemma . The theory explains the phenomenon by which an innovation transforms an existing market or sector by introducing simplicity, convenience, and affordability where complication and high cost are the status quo. Think Netflix disrupting the video rental space. Over the years, the term has been applied liberally and not always correctly to other examples, but every so often, an idea comes along that really fits the bill. Enter Uber, the ridesharing behemoth that turned the car service industry on its head. In a few short years after launching in 2010, Uber became the largest car service in the world, as measured in ride count. Last year, Uber drove 6.2 billion riders. Today’s case takes us to London in 2019, where Uber is facing the latest in a long list of challenges from regulators threatening their ability to continue operating in that important market. In this episode of Cold Call , we welcome Alexander MacKay to discuss the case entitled, “Uber: Competing Globally.” I’m your host, Brian Kenny, and you’re listening to Cold Call on the HBR Presents network.

Alexander MacKay is in the strategy unit at Harvard Business School. His research focuses on matters of competition, including pricing, demand, and market structure. Alex, thanks for joining us on Cold Call today.

ALEX MACKAY: Thank you, Brian. Very happy to be here.

BRIAN KENNY: The idea of Uber seems so simple, but it was revolutionary in so many ways. And Uber has been in the headlines many times for both good and bad reasons in its decade of existence. So we’re going to touch on a lot of those things today. So thanks for sharing the case with us.

ALEX MACKAY: Brian, I’m very happy to. It’s a little funny, we’ve actually started to see the first few students who have never hailed a traditional taxi in our classrooms. So I think increasingly, the contrast between the two is going to be pretty difficult for people to fully understand.

BRIAN KENNY: Let me ask you to start by telling us what your cold call would be when you set up the class here.

ALEX MACKAY: The case starts off with the current legal battle going on in London. And so the first question I just ask to start the classroom is: What’s the end game for Uber in London? What do they look like 10 years from now? In the midst of this ongoing legal battle, there has been back and forth, some give and take from both sides, Transportation for London, and also on the Uber side as well. And there’s actually a recent court case that has allowed Uber to have a little more time to operate. They bought about 18 more months of time, but this has been also brought with additional, stricter scrutiny, and 18 months from now, they’re going to be at it again trying to figure out exactly what rules Uber’s allowed to operate under.

BRIAN KENNY: It seems like 18 months in the lifetime of Uber is like a decade. Everything seems to happen so quickly for this company. That’s a long period of time. What made you decide to write this case? How does it relate to the work that you’re doing in your research?

ALEX MACKAY: A big focus of my research is on competition policy, particularly the realms of antitrust and regulation. And here we have a company, Uber, whose relationship with regulation has been really essential to its strategy from day one. And I think appreciating the effects of regulation and how its impact Uber’s performance in different markets, is really critical for understanding strategy and global strategy broadly.

BRIAN KENNY:  Let’s just talk a little bit about Uber. I think people are familiar with it, but they may not be familiar with just how large they are in this space. And the space that they’ve sort of created has also blown up and expanded in many ways. So how big is Uber? Like what’s the landscape of ridesharing look like and where does Uber sit in that landscape?

ALEX MACKAY: Uber globally is the biggest ridesharing company. In 2018, they had over $10 billion in revenue for both ridesharing and their Uber Eats platform. And you mentioned in the introduction, that they had over 6 billion rides in 2019. That’s greater than 15 million rides every day that’s happening on their platform. So really, just an enormous company.

BRIAN KENNY: So they started back in 2010. It’s been kind of an amazing decade of growth for them. How do you explain that kind of rapid expansion?

ALEX MACKAY: They were financed early on with some angel investors. I think Kalanick’s background really helped there to get some early funding. But one of the critical things that allowed them to expand early into many markets that helped their growth was they’re a relatively asset light company. On the ground, they certainly need sales teams, they need translation work to move into different markets, but because the main asset they were providing in these different markets was software, and drivers were bringing their own cars and riders were bringing their own phones, the key pieces of hardware that you need to operate this market, they really didn’t have to invest a ton of capital. In fact, when they launched in Paris, they launched as sort of a prototype, just to show, “Hey, we can do this in Paris without too much difficulty,” as their first international market. So being able to really scale it across different markets really allowed them to grow. I think by 2015, their market cap was $60 billion, five years after founding, which is just an incredible rate of growth.

BRIAN KENNY: So they’re the biggest car service in the world, but they don’t own any cars. Like what business are they really in, I guess is the question?

ALEX MACKAY: They’re certainly in the business of matching riders to drivers. They’ve been able to do this in a way that doesn’t require them to own cars, just through the use of technology. And so what they’re doing, and this is I think pretty well understood, is that they’re using existing capital, people who have cars that may be going unused, personal cars, and Uber is able to use that and deploy that to give riding services to different customers. Whereas in the traditional taxi model, you could have taxis that you didn’t necessarily own, but you leased them or you rented them, but they had the express purpose of being driven for taxi services. And so it wasn’t using idle capital. You kind of had to create additional capital in order to provide the services.

BRIAN KENNY: So you mentioned Travis Kalanick a little bit earlier, but he was one of the co-founders of the company, and the case goes a little bit into his philosophy of what expansion into new markets should look like. Can you talk a little bit about that?

ALEX MACKAY: Certainly. Yeah. And I think it might even be helpful to talk a bit about his background, which I think provides a little more context before Uber. He dropped out of UCLA to work on his first company, Scour, and that was a peer-to-peer file sharing service, a lot like Napster, and actually predated Napster. And where he was operating was sort of an evolving legal gray area. Eventually, Scour got sued for $250 billion by a collection of entertainment companies and had to file for bankruptcy.

BRIAN KENNY: Wow.

ALEX MACKAY: He followed that up with his next venture, Red Swoosh, and that was software aimed at allowing users to share network bandwidth. So again, it was a little bit ahead of its time, making use of recent advances in technology. Early on though, they got in trouble with the IRS. They weren’t withholding taxes, and there were some other issues with his co-founder, and there was sort of a bad breakup between the two. Despite this, he persevered and ended up selling the company for $23 million in 2007. And after that, his next big thing was Uber. So one thing I just want to point out is that at all three of these companies, he was looking to do something that leveraged new technology to change the world. And by nature, sometimes businesses like that operate in a legal gray area and you have very difficult decisions to make. Some other decisions you have to make are clearly unethical and there’s really no reason to make some of those decisions, like with the taxes and with some other things that came out later on at Uber, but certainly one of the things that any founder who’s looking to change the world with a big new technology company has to deal with, is that often, the legal framework and the regulatory framework around what you’re trying to do isn’t well established.

BRIAN KENNY: Obviously drama seems to follow Travis where he goes. And his expansion strategy was pretty aggressive. It was almost like a warlike mentality in terms of going into a new market. And you could sort of sum it up as saying ask forgiveness. Is that fair?

ALEX MACKAY: Yeah. Yeah. Ask for forgiveness, not permission. I think they were really focused on winning. I think that was sort of their ultimate goal. We describe in the case there’s this policy of principle confrontation, to ignore existing regulations until you receive pushback. And then when you do receive pushback, either from local regulators or existing sort of taxicab drivers, mobilize a response to sort of confront that. During their beta launch in 2010, they received a cease-and-desist letter from the city of San Francisco. And they essentially just ignored this letter. They rebranded, they used to be UberCab, and they just took “Cab” out of their name, so now they’re Uber. And you can see their perspective in their press release in response to this. They say, “UberCab is a first to market cutting edge transportation technology, and it must be recognized that the regulations from both city and state regulatory bodies have not been written with these innovations in mind. As such, we are happy to help educate the regulatory bodies on this new generation of technology and work closely with both agencies to ensure compliance.”

BRIAN KENNY: It’s a little arrogant.

ALEX MACKAY: Yeah, so you can see right there, they’re saying, what we’re operating in is sort of this new technology-based realm and the regulators don’t really understand what’s going on. And so instead of complying with the existing regulations, we’re going to try to push regulations to fit what we’re trying to do.

BRIAN KENNY: The case is pretty epic in terms of it sort of cuts a sweeping arc across the world, looking at the challenges that they faced with each market they entered, and none more interesting I think the New York City, which is obviously an enormous market. Can you talk a little bit about some of the challenges they faced going into New York with the cab industry being as prevalent as it was and is?

ALEX MACKAY: Yeah, absolutely. I mean, I think it’s pretty well known for people who are familiar with New York that there were restrictions on the number of medallions which allowed taxis to operate. So there was a limited number of taxis that could drive around New York City. This restriction had really driven up the value of these medallions to the taxi owners. And if you had the experience of taking taxis in New York City prior to the advent of Uber, what you’d find is that there were some areas where the service was very, very good. Downtown, Midtown Manhattan, you could almost always find a taxi, but there are other parts of the city where it was very difficult at times to find a cab. And when you got in a cab, you weren’t sure that you were always going to be given a fair ride. And so Uber coming in and providing this technology that allowed you to pick up a ride from anywhere and sort of track the route as you’re going on really disrupted this market. Consumers love them. They had a thousand apps signups before they even launched. Kalanick mentioned this in terms of their launch strategy, we have to go here because the consumers really want us here. But immediately, they started getting pushback from the taxicab owners who were threatened by this new mode of transportation. They argued that they should be under the same regulations that the taxis were. And there were a lot of local government officials that were sort of mobilized against Uber as well. De Blasio, the Mayor of New York, wrote opinion articles against Uber, claiming that they were contributing to congestion. There was a lot of concern that maybe they had some safety issues, and the taxi drivers and the owners brought a lawsuit against Uber for evading these regulations. And then later on, and this was the case in many local governments, de Blasio introduced a bill to put additional restrictions on Uber that would make them look a lot more like a traditional taxi operating model, with limited number of licenses and strict requirements for reporting.

BRIAN KENNY: And this is the same scenario that’s going to play out almost with every city that they go into because there is such an established infrastructure for the taxi industry in those places. They have lobbyists. They’re tied into the political networks. In some instances, it was revealed that they’ve been connected with organized crime. So not for the faint of heart, right, trying to expand into some of the biggest cities in the United States.

ALEX MACKAY: Absolutely. Absolutely. And what’s sort of fascinating about the United States is it’s actually a place where a company can engage in this battle over regulation on the ground. And de Blasio writes his opinion article and pushes forward this bill. Uber responds by taking out an ad campaign, over $3 million, opposing these regulations and calling out de Blasio. So again, we sort of have this fascinating example of Uber mobilizing their own lobbyists, their lawyers, but also public advertising to sort of convince the residents of New York City that de Blasio and the regulators that are trying to come down on them are in the wrong.

BRIAN KENNY: Yeah. And at the end of the day, it’s consumers that they’re really making this appeal to, because I guess my question is, are these regulations stifling innovation? And if they are, who pays the ultimate price for that, Uber or the consumer?

ALEX MACKAY: Consumers definitely loved Uber. And I don’t think any of the regulators were trying to stifle innovation. I don’t think they would say that. I think their biggest concern, their primary concern was safety, and a secondary and related concern here was losing regulatory oversight over the transportation sector. So this is a public service that had been fairly tightly regulated for a long time, and there was some concern that what happens when this just becomes almost a free market sector. At the same time, these regulators have the lobbyists from the taxicab industry and other interested parties in their ear trying to convince them that Uber really is like a taxi company and should be regulated, and really emphasizing the safety concerns and other concerns to try to get stricter regulations put on Uber. And part of that may be valid. I think you certainly should be concerned about safety and there are real concerns there, but part of it is simply the strategic game that rivals are going to play between each other. And the taxicab industry sees Uber as a threat. It’s in their best interest to lobby the regulators to come down on Uber.

BRIAN KENNY: And what’s amazing to me is that while all this is playing out, they’re not turning their tails and running. They’re continuing to push forward and expand into other parts of the world. So can you talk a little bit about what it was like trying to go into countries in Latin America, countries in Asia, where the regulations and the regulatory infrastructure is quite different than it is in the US?

ALEX MACKAY: In the case, we have anecdotes, vignettes, one for each continent. And their experience in each continent was actually pretty different. Even within a continent, you’re going to have very different regulatory frameworks for each country. So we sort of pick a few and focus on a few, just to highlight how the experience is very different in different countries. And one thing that’s sort of interesting, in Latin America, we focus on Bogota in Colombia, and what’s sort of interesting there is they launched secretly and they were pretty early on considered to be illegal, but they continue to operate despite the official policy of being illegal in Colombia. And they were able to do that in a way that you may not be able to do it so easily in the United States, just because of the different layers of enforcement and policy considerations that are present in Colombia and not necessarily in the United States. Now, when I talk about the current state of Uber in different countries, this is continually evolving. So they temporarily suspended their operations early in 2020 in Columbia. Now they’re back. This is a continual back and forth game that they’re playing with the regulators in different markets.

BRIAN KENNY: And in a place like Colombia, are they not worried about violence and the potential for violence against their drivers?

ALEX MACKAY: Absolutely. So this is true sort of around the world. I think in certain countries, violence becomes a little bit more of a concern. And what they found in Colombia is they did have more incidents where taxi drivers decided to take things into their own hands and threaten Uber drivers and Uber riders, sometimes with weapons. Another decision Uber had to make that was related to that was whether or not to allow riders to pay in cash. Because in the United States, they’d exclusively used credit cards, but in Latin America and some other countries like India, consumers tended to prefer to use cash to pay, and allowing that sort of opened up this additional risk that Uber didn’t really have a great system in place to protect them from. Because when you go to cash, you’re not able to track every rider quite as easily, and there’s just a bigger chance for fraud or for robbery and that sort of thing popping up.

BRIAN KENNY: Going into Asia was also quite a challenge for them. Can you talk a little bit about some of the challenges they faced, particularly in China?

ALEX MACKAY: They had very different experiences in each country in Asia. China was a unique case that is very fascinating, because when Uber launched there, there were already existing technology-based, you might call them, rideshare companies, that were fairly prominent, Didi and Kuaidi, And these companies later merged to be one company, DiDi, which is huge. It’s on par with Uber in terms of its global presence as a ridesharing company. When Uber launched there, they didn’t fully anticipate all the changes they would have to make to going into a very different environment. In China, besides having established competitors, Google Maps didn’t work, and they sort of relied on that mapping software to do their location services. So they had to completely redo their location services. They also, again, relied on credit cards for payments, and in China, consumers increasingly used apps to do their payments. And this became a little bit of a challenge because the main app that Chinese customers used, they used WeChat and Alipay primarily, they were actually owned by parent companies of the rival ridesharing company. So Uber had to essentially negotiate with its rivals in order to have consumers pay for their ridesharing services. And so here are a few sort of localization issues that you could argue Uber didn’t fully anticipate when they launched. The other thing about competing in China that’s sort of interesting is that Chinese policy regarding competition is very different from policy in the United States and much of Europe. For the most part, there’s not the traditional antitrust view of protecting the consumers first and foremost. That certainly comes into play, but the Chinese government has other objectives, including promoting domestic firms. And so if you think about launching into a company where there’s a large established domestic rival that certainly increases the difficulty of success, because when push comes to shove, the government is likely to come down on the side of your rival, which is the domestic company, and not the foreign entrant.

BRIAN KENNY: Yeah, which is understandable, I guess, to some extent. This sounds exhausting, to be sort of fighting skirmishes on all these fronts in all these different places in the world. How does that affect the morale or tear at the fabric maybe of the culture at a company like Uber, where they’re trying to manage this on a global scale and running into challenges every step of the way?

ALEX MACKAY: It certainly has an effect. I think Uber did a very good job at recruiting teams of people who really wanted to win. And so, if that’s the consistent message you’re sending to your teams, then these challenges may be actually considered somewhat exciting. And so I think by bringing in that sort of person, I think they actually fueled this desire to win in these markets and really kept the momentum going. One of the downsides of this of course is that if you exclusively focus on winning and getting around the existing regulations, there does become this challenge of what’s ethical and what’s not ethical? And in certain business areas, there actually often is a little bit of a gray line. I mean, you can see this outside of ridesharing. It’s a much broader thing to think about, but regulation of pharmaceuticals, regulation of use of new technologies such as drones, often the technology outpaces the regulation by a little bit and there’s this lag in trying to figure out what actually is the right thing to do. I think it’s a fair question whether or not you can disentangle this sort of principle of confrontation that’s so pervasive throughout the company culture when it comes to regulation from this principle confrontation of other ethical issues that are not necessarily business driven, and whether or not it’s easy to maintain that separation. And I think that’s a fair question, certainly worthy for debate. But what I think is important is you can set up a company where you are abiding by ethical issues that are very clear, but you’re still going to face challenges on the legal side when you’re developing a new business in an area with new technology.

BRIAN KENNY: That’s a great insight. I mean, I found myself asking myself as I got through the case, I can’t tell if Uber is the victim or the aggressor in all of this. And I guess the answer is they’re a little bit of both.

ALEX MACKAY: Yeah. I think it’s fair to characterize them as an aggressor, and I think you sort of need to be if you want to succeed and if you want to change the world in a new technology area. In some sense, they’re a victim in that we’re all the victim as consumers and as firms of regulations that are sometimes difficult to adapt in real time to changing market conditions. And there’s a good reason why they are sticky over time, but sometimes that can be very costly. Going back to something we talked about earlier, I think there are hardly any consumers that wanted Uber kicked out of New York City. I think everyone realized this was just so much superior to any other option they had, that they were really willing to fight to keep Uber around in the limited ways they could.

BRIAN KENNY: So let’s go back to the central issue in the case then, which is, how important is it to them, in terms of their global strategy, to have a presence in a place like London? They’re still not profitable by the way, we should point that out, that despite the fact that they are the largest in the space, they haven’t turned the corner to profitability yet. I would imagine London’s kind of important.

ALEX MACKAY: Absolutely. London is a key international city, and a presence there is important for Uber’s overall brand. So many people travel through London, and it’s a real benefit for anyone who travels to be able to use the same service at any city you stop in. At the same time, they’re facing these increasing regulatory pressures from London, and so it’s a real question whether or not, 10 years from now, they look substantially different from the established taxi industry that’s there. And you can kind of see this battle playing out across different markets. As another example, in Ghana. When they entered there, they actually entered with a framework for understanding. They helped build the regulations for ridesharing services in Ghana when they entered. But over time, that evolved to additional restrictions as the existing taxi companies pushed back on them. So I think a key lesson here in all of this is that the regulations that you see at any given point in time aren’t absolutely fixed, for anyone starting a technology-based company, there will be regulations that do get created that affect your business. Stepping outside of transportation, we can see that going on now with the big tech firms and sort of the antitrust investigations they’re are under. And the policymakers in the US and Europe are really trying to evolve the set of regulations to reflect the different businesses that Apple, Facebook, Microsoft, Google are involved in.

BRIAN KENNY: One thing we haven’t touched on, and it’s not touched on in the case obviously because it just sort of started fairly recently, is the pandemic and the implications of the pandemic for the rideshare industry as fewer people find themselves in need of going anywhere. Have you given any thought to that and whether that’s going to have any effect on the regulations?

ALEX MACKAY: It certainly could. Uber is in a somewhat fortunate position, at least if you judge by their market capitalization, with respect to the pandemic. Initially their stocks took a pretty big hit, but rebounded pretty quickly, and part of this is because the primary part of their business is the transportation through Uber X, but they do also offer the delivery services through Uber Eats, and that business has really picked up during this pandemic. There’s certainly a mix of views about the future, but I think most people do believe that at some point we’ll get back to business as usual, at least for Uber services, when we come up with a vaccine. I think most people anticipate that they’ll be resuming use of Uber once it becomes safe to do so. And I think, to be frank, a lot of people already have resumed using Uber, especially people who don’t have cars or who see it as a valuable alternative or a safer alternative to public transit.

BRIAN KENNY: Yeah, that’s a really good point. And the Uber Eats thing is interesting as another example of how it’s important for businesses to re-imagine the business that they’re in because that, in many ways, may be helping them through a really tough patch here. This has been a really interesting conversation, Alex, I want to ask you one final question, which is, as the students are packing up to leave class, what’s the one thing you want them to take away from the case?

ALEX MACKAY: So I would hope the students take away the importance of regulation in business strategy. And I think the case of Uber really highlights that. And if you look at the conversation around Uber I’d say for the first 10 years of their existence, it was essentially around the superiority of their technology and not so much how they handled regulation. If you think back to the cease-and-desist letter that San Francisco issued in 2010, if Uber had simply stopped operations then, we wouldn’t have the ridesharing world that we have today. So their strategy of principle confrontation with respect to regulation was really essential for their future growth. Again, this does raise important ethical considerations as you’re operating in a legal gray area, but it’s certainly an essential part of strategy.

BRIAN KENNY: Alex, thanks so much for joining us on Cold Call today. It’s been great talking to you.

ALEX MACKAY: Thank you so much, Brian.

BRIAN KENNY: If you enjoy Cold Call, you might like other podcasts on the HBR Presents Network. Whether you’re looking for advice on navigating your career, you want the latest thinking in business and management, or you just want to hear what’s on the minds of Harvard Business School professors, the HBR Presents Network has a podcast for you. Find them on Apple podcasts or wherever you listen. I’m your host, Brian Kenny, and you’ve been listening to Cold Call , an official podcast of Harvard Business School on the HBR Presents Network.

  • Subscribe On:

Latest in this series

This article is about competitive strategy.

  • Global strategy
  • Government policy and regulation

Partner Center

Computers, Materials & Continua
DOI:10.32604/cmc.2021.014922
Article

Computing the User Experience via Big Data Analysis: A Case of Uber Services

Jang Hyun Kim 1 , 2 , Dongyan Nan 1 , * , Yerin Kim 2 and Min Hyung Park 2

1 Department of Interaction Science/Department of Human-Artificial Intelligence Interaction, Sungkyunkwan University, Seoul, 03063, Korea 2 Department of Applied Artificial Intelligence/Department of Human-Artificial Intelligence Interaction, Sungkyunkwan University, Seoul, 03063, Korea * Corresponding Author: Dongyan Nan. Email: [email protected] Received: 27 October 2020; Accepted: 01 January 2021

Abstract: As of 2020, the issue of user satisfaction has generated a significant amount of interest. Therefore, we employ a big data approach for exploring user satisfaction among Uber users. We develop a research model of user satisfaction by expanding the list of user experience (UX) elements (i.e., pragmatic, expectation confirmation, hedonic, and burden) by including more elements, namely: risk, cost, promotion, anxiety, sadness, and anger. Subsequently, we collect 125,768 comments from online reviews of Uber services and perform a sentiment analysis to extract the UX elements. The results of a regression analysis reveal the following: hedonic, promotion, and pragmatic significantly and positively affect user satisfaction, while burden, cost, and risk have a substantial negative influence. However, the influence of expectation confirmation on user satisfaction is not supported. Moreover, sadness, anxiety, and anger are positively related to the perceived risk of users. Compared with sadness and anxiety, anger has a more important role in increasing the perceived burden of users. Based on these findings, we also provide some theoretical implications for future UX literature and some core suggestions related to establishing strategies for Uber and similar services. The proposed big data approach may be utilized in other UX studies in the future.

Keywords: User satisfaction; user experience; big data; sentiment analysis; Uber

1  Introduction

Founded in 2009, Uber Technologies, Inc. aimed to improve the efficiency of taxi services in major cities in the United States [ 1 ]. Suggesting the “sharing economy” model as a solution, the company has quickly emerged as a major innovative disruptor of the traditional transportation market [ 2 ]. However, Uber is no longer a maverick of the market. Operating in 700 cities across 84 countries [ 3 ], the estimated value of the company has surpassed 70 billion dollars, thus far the highest value for a privately owned technology company in the world [ 1 ]. Furthermore, the sharing economy model has now become a new norm [ 1 ].

Following Uber, similar business models were introduced in different parts of the world, such as DiDi in China, Grab in South East Asia, and Ola in India [ 1 ]. Although these services are also growing, owing to their strength in the local context, Uber continues to reign as the global market leader. Moreover, along with its rapid growth in size and value, Uber is further developing its business model. The company is now expanding to new domains such as delivery services. Additionally, Uber is investing in innovative technologies such as autonomous vehicles [ 1 ].

As a technology company, Uber has made several attempts to advance its service using data science and machine learning approaches. However, these have mostly focused on the technical aspects of the service, such as estimating the consumer surplus [ 4 ], while only limited efforts have been made to understand the user experience (UX) of Uber services. From the perspective of a service provider, analyzing the UX elements and enhancing user satisfaction are vital for providing a successful service [ 5 – 7 ]. Therefore, UX elements can be useful for establishing business strategies and operations.

An effective way of examining UX is the analysis of online reviews made by users. According to Jang et al. [ 8 ], analyzing user-oriented datasets can serve as the basis for improving certain services or products. Among the different kinds of user-oriented datasets, online reviews are easily accessible and reveal various perceptions and feelings toward a service or product [ 7 , 8 ]. Consequently, we attempt to compute user satisfaction from UX elements using large-scale online review datasets of Uber.

2  Literature Review

2.1 Prior Research on Uber and Similar Services

As of 2020, several researchers have attempted to explore the contribution of UX or user perceptions toward the diffusion of Uber and similar services. Min et al. [ 9 ] proposed a model for Uber by integrating the technology acceptance model and innovation diffusion theory. They collected data from 336 Uber users and demonstrated that relative advantage and compatibility notably influenced the intentions of people to use Uber via usefulness and ease of use. Lee et al. [ 10 ] examined survey data from 295 participants by using structural equal modeling and indicated that perceived benefit, risk, trust, and platform quality considerably affected the intentions of individuals to adopt Uber. By computing validated data from 443 responses, Ma et al. [ 11 ] revealed that perceived physical risk and trust in the driver notably affected the intention of an individual to discontinue the use of DiDi. Guo et al. [ 12 ] examined data from 307 samples and concluded that the intention to use DiDi could be determined based on institutional and calculative-based trusts.

Most of the studies based on Uber and similar services have explored UX by analyzing a limited (fewer than 1,000) number of samples. Thus, in the context of Uber services, we attempt to use a big data approach to explore user satisfaction, which may be strongly related to the continuance intention and loyalty of an individual [ 13 – 16 ]. The proposed big data approach aims to address the limitations of the prior studies on Uber and similar services.

2.2 Satisfaction and UX Elements

We interpret the term “satisfaction” as the post-assessment by an individual of the overall experience of using Uber services [ 8 , 13 ]. Jang et al. [ 8 ] reported that individual satisfaction was influenced by several UX elements (i.e., pragmatic, hedonic, burden, and expectation confirmation). Consequently, we attempt to utilize UX elements for predicting the user satisfaction with Uber services.

2.2.1 Pragmatic

We conceptualize “pragmatic” as the evaluation by a user regarding the level of usefulness and ease of use of Uber services [ 8 ]. Some studies have demonstrated that pragmatic is notably related to user satisfaction with specific products and services [ 7 , 8 ]. That is, satisfaction among users will increase if they feel that utilizing certain services is beneficial and does not require considerable effort [ 13 , 17 , 18 ]. Hence, we hypothesize the following relation:

H1: Pragmatic is significantly related to satisfaction with Uber services.

2.2.2 Expectation Confirmation

We interpret “expectation confirmation” as the degree to which the performance of the Uber service meets user expectations [ 8 , 19 ]. Several researchers have noted a connection between expectation confirmation and user satisfaction in various domains [ 7 , 8 ]. As implied by the expectation–confirmation theory, users are satisfied with services when they feel that the experience of using the services meets their expectations [ 13 ]. Thus, we hypothesize the following:

H2: Expectation confirmation is notably related to satisfaction with Uber services.

2.2.3 Hedonic

We define “hedonic” as the degree of pleasantness of using Uber services [ 8 ]. Several scholars have indicated that perceived hedonic is critical for determining the satisfaction of an individual when using certain products and services [ 8 , 20 ]. Therefore, we hypothesize the following:

H3: Hedonic is notably related to satisfaction with Uber services.

2.2.4 Burden

We define “burden” as the degree to which users think that using Uber services can lead to negative emotions [ 8 , 21 ]. Several user-oriented studies have reported that user satisfaction is negatively influenced by the perceived burden [ 7 , 8 ]. Previous studies [ 8 , 22 , 23 ] found that anxiety, anger, and sadness were dimensions of the perceived burden. Thus, negative emotions, such as anxiety, anger, and sadness, among users may notably influence the perceived burden. Accordingly, with respect to Uber services, we hypothesize the following:

H4: Burden is significantly related to satisfaction with Uber services.

H4-1: Anxiety is significantly related to burden.

H4-2: Anger is significantly related to burden.

H4-3: Sadness is significantly related to burden.

2.3 Extending UX Elements to Risk, Cost, and Promotion

To understand user satisfaction more comprehensively, we introduced additional predictors (i.e., risk, cost, and promotion), which were considered important to the research models used in earlier studies on UX [ 24 – 26 ].

We interpret “risk” as the degree to which users feel apprehensive of the negative consequences of using Uber services [ 27 ]. Several scholars have suggested that the perception of risk be addressed when exploring the UX of services or systems [ 10 , 24 ]. If individuals feel that utilizing specific services will induce negative outcomes, they may develop a negative perspective of using the services. Accordingly, a relationship between risk and user satisfaction has been validated in previous studies [ 28 , 29 ]. Moreover, several researchers have reported that negative emotions (and the consequent feelings) among users toward using a particular service would result in a high perceived risk [ 30 , 31 ]. This is because individuals who experience negative emotions (e.g., anxiety) tend to perceive an ambiguous stimulus as a threat or risk [ 30 , 32 ]. In the same vein, Lin [ 31 ] confirmed a link between negative emotions (e.g., anger and sadness) and risk, using results from 1,000 questionnaires. Consequently, we hypothesize the following:

H5: Risk is significantly related to satisfaction with Uber services.

H5-1: Anxiety is significantly related to risk.

H5-2: Anger is significantly related to risk.

H5-3: Sadness is significantly related to risk.

Park [ 25 ] suggested that economic concerns be considered when exploring UX, because a user tends to compare products or services in terms of their potential benefits and costs. We define “cost” as the concern associated with the economic costs of utilizing Uber services [ 33 , 34 ].

The notable effect of cost on user satisfaction was verified in several user-oriented studies. Park [ 35 ] analyzed survey data from South Korea and concluded that perceived cost had a negative influence on user satisfaction with airline services. Based on data from 288 Taobao users, Zhu et al. [ 36 ] demonstrated that perceived cost was crucial in influencing user satisfaction. Therefore, we hypothesize the following:

H6: Cost is notably related to satisfaction with Uber services.

2.3.3 Promotion

Buil et al. [ 26 ] identified “promotion” as an important element that could increase brand awareness and accelerate the diffusion of certain products and services. In other words, companies can attract public attention by launching promotional offers such as coupons, discounts, and free gifts. We conceptualize “promotion” as the degree to which individuals think they can obtain rewards by using the Uber service [ 37 ].

Several researchers have demonstrated a significant impact of promotion on the satisfaction of an individual with various services [ 37 , 38 ]. For instance, with respect to mobile catering applications, Wang et al. [ 37 ] reported, based on 196 responses, that perceived promotions have an influence on the level of satisfaction of an individual. Accordingly, we hypothesize the following relation:

H7: Promotion is notably related to satisfaction with Uber services.

2.4 Proposed Model

Based on the abovementioned hypotheses, a model for examining user satisfaction with Uber services was constructed ( Fig. 1 ).

images

Figure 1: Proposed model

3  Methodologies

3.1 Data Collection

We collected 125,768 comments from online textual reviews on Google Play Store (reviews that contained only numerical ratings without text were excluded) on the “Uber-Request a ride” application over the two-year period from September 12, 2018 to September 24, 2020.

3.2 Preprocessing

Before data analysis, we preprocessed all the collected review data. In the linguistic inquiry and word count (LIWC) program, which was employed in this study, reviews including stopwords and punctuation were used. Therefore, we focused on the tokenization and accurate lemmatization of words. First, we split the sentences to obtain accurate results via part-of-speech (POS) tagging, which helped in achieving highly accurate lemmatization. Subsequently, all the tokens were lemmatized with the drawn POS tags.

3.3 Measurements

Based on earlier studies [ 8 , 39 – 42 ], we propose that employing sentiment analysis can be an effective way to investigate user perspectives on Uber services. This is because the hedonic and pragmatic dimensions of UX on certain services or products can be computed using online user reviews by performing sentiment analysis with the LIWC program [ 7 , 8 , 40 , 43 ].

Thus, to enumerate cognition-related statements and words, we utilized the LIWC program to perform sentiment analysis on the preprocessed online reviews [ 22 , 23 , 44 ]. Subsequently, the results of the sentiment analysis were categorized based on the classification guidelines of UX elements from prior studies [ 7 , 8 ]. Accordingly, we used LIWC categories to measure pragmatic, hedonic, burden, expectation confirmation, risk, cost, promotion, anxiety, anger, and sadness. Assuming that the rating score represents the degree of user satisfaction with specific products or services, the user satisfaction with Uber services was measured by the rating score from the reviews [ 8 , 45 ]. The measurement details are presented in Tab. 1 .

Table 1: Outline of the measurements

images

4  Results

4.1 Descriptive Information

The results of the descriptive analysis are presented in Tab. 2 .

Table 2: Descriptive information of the elements

images

4.2 Testing of the Hypotheses

We performed a regression analysis to test our hypotheses. To detect multicollinearity in the research model, variance inflation factor (VIF) values were computed. As all the VIF values were considerably less than 10, we confirmed that any multicollinearity is minimal [ 46 ], thereby validating the research model.

images

Figure 2: Testing of the hypotheses. * * : p < 0.001, * : p < 0.01

5  Discussion and Conclusions

This study is one of the first attempts to investigate user satisfaction with Uber services via a big data approach. Considering that the previous studies [ 7 , 8 ] that used big data approaches to examine UX elements did not consider the factors of risk, cost, promotion, anxiety, sadness, and anger, the contributions of this study become highly meaningful and useful. The extended UX element framework and methodologies used in this study may be applied to explore user satisfaction with other services or products. Furthermore, the proposed approach addresses the problem of a limited number of samples or participants of most previous studies on UX [ 9 , 10 , 20 ]. We also provided several theoretical and managerial implications.

5.1 Theoretical and Practical Implications

The results of the regression analysis showed hedonic to be the most significant predictor of user satisfaction. This means that users will be satisfied when they enjoy using a service. Burden also strongly affects user satisfaction: if users feel burdened while using a service, their satisfaction will decrease. Overall, these results indicate that the positive/negative emotions of users can strongly influence their levels of satisfaction.

Both pragmatic and promotion significantly and positively influence user satisfaction. Hence, when users realize the pragmatic value of a service and receive coupons or free gifts upon using the service, their satisfaction will be enhanced.

Cost was found to exert a negative influence on user satisfaction. This indicates that if the monetary cost of using a service becomes extremely high or exceeds the expected range, the degree of satisfaction with the service may be reduced.

Risk was revealed to have a negative influence on user satisfaction, meaning that users may be dissatisfied if they perceive that using a service will induce negative outcomes.

Expectation confirmation did not considerably affect user satisfaction. However, some studies [ 13 , 25 ] stated that expectation confirmation tended to indirectly influence user satisfaction. Consequently, had the indirect influence of expectation confirmation been considered, the results of our study might have been different.

Burden is strongly affected by anxiety, sadness, and anger. Interestingly, anger plays a crucial role in increasing the perceived burden of users toward the service. This means that users are more likely to feel burdened and dissatisfied when they feel angry while using a service compared with times when they feel sad or anxious.

Risk can be increased by negative emotions such as anxiety, sadness, and anger. However, the impact of anger on risk is relatively small compared with that of anxiety and sadness. This means that in comparison with times when users feel angry, they are more likely to experience uncertainty about the negative outcomes of using a service when they feel anxious or sad.

In terms of practical implications, our findings can serve as guidelines for the stakeholders involved in services similar to Uber. Companies should focus on methods to increase the perceived hedonic of users and decrease the perceived burden of users. Other UX elements (e.g., pragmatic, risk, cost, and promotion) have a weak influence on user satisfaction. Thus, in the case of limited human and financial resources [ 20 ], from the perspective of UX, companies should prioritize enhancing the hedonic value of their services.

5.2 Limitations and Suggestions

This study has the following limitations, which should be addressed in further research.

First, this study did not consider other significant variables that may be related to UX. For instance, the demographic characteristics of users can notably affect their satisfaction with specific services or systems. Second, this study was conducted using review comments (data) written in the English language, and thus it can only be applied to the English-speaking users of Uber. The same research conducted in different languages may lead to different results. Hence, performing further studies on Uber users speaking different languages is suggested. Moreover, future studies could also include different car sharing services in other countries, such as DiDi, Grab, and Ola.

Funding Statement: This work was supported by a National Research Foundation of Korea (NRF) (http://nrf.re.kr/eng/index) grant funded by the Korean government (NRF-2020R1A2C1014957).

Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.

 1 .  G. Dudley, D. Banister and T. Schwanen. (2017). “The rise of Uber and regulating the disruptive innovator,” Political Quarterly , vol. 88 , no. 3 , pp. 492–499. [ Google Scholar ]

 2 .  S. Jiang, L. Chen, A. Mislove and C. Wilson. (2018). “On ridesharing competition and accessibility: Evidence from Uber, Lyft, and Taxi,” in Proc. of the 2018 World Wide Web Conf., Republic and Canton of Geneva, CHE, pp. 863–872. [ Google Scholar ]

 3 .  K. Thelen. (2018). “Regulating Uber: The politics of the platform economy in Europe and the United States,” Perspectives on Politics , vol. 16 , no. 4 , pp. 938–953. [ Google Scholar ]

 4 .  J. D. Hall, C. Palsson and J. Price. (2018). “Is Uber a substitute or complement for public transit?,” Journal of Urban Economics , vol. 108 , pp. 36–50. [ Google Scholar ]

 5 .  Z. Deng, Y. Lu, K. K. Wei and J. Zhang. (2010). “Understanding customer satisfaction and loyalty: An empirical study of mobile instant messages in China,” International Journal of Information Management , vol. 30 , no. 4 , pp. 289–300. [ Google Scholar ]

 6 .  E. Goodman, M. Kuniavsky and A. Moed. (2013). “Observing the user experience: A practitioner’s guide to user research,” IEEE Transactions on Professional Communication , vol. 56 , no. 3 , pp. 260–261. [ Google Scholar ]

 7 .  E. Park. (2019). “Motivations for customer revisit behavior in online review comments: Analyzing the role of user experience using big data approaches,” Journal of Retailing and Consumer Services , vol. 51 , pp. 14–18. [ Google Scholar ]

 8 .  J. Jang and M. Y. Yi. (2017). “Modeling user satisfaction from the extraction of user experience elements in online product reviews,” in Proc. of the 2017 CHI Conf. Extended Abstracts on Human Factors in Computing Systems, New York, NY, USA, pp. 1718–1725. [ Google Scholar ]

 9 .  S. Min, K. K. F. So and M. Jeong. (2019). “Consumer adoption of the Uber mobile application: Insights from diffusion of innovation theory and technology acceptance model,” Journal of Travel & Tourism Marketing , vol. 36 , no. 7 , pp. 770–783. [ Google Scholar ]

10 . Z. W. Lee, T. K. Chan, M. S. Balaji and A. Y. L. Chong. (2018). “Why people participate in the sharing economy: An empirical investigation of Uber,” Internet Research , vol. 28 , no. 3 , pp. 829–850. [ Google Scholar ]

11 . L. Ma, X. Zhang, X. Ding and G. Wang. (2019). “Risk perception and intention to discontinue use of ride-hailing services in China: Taking the example of DiDi Chuxing,” Transportation Research Part F: Traffic Psychology and Behaviour , vol. 66 , pp. 459–470. [ Google Scholar ]

12 . J. Guo, J. Lin and L. Li. (2020). “Building users’ intention to participate in a sharing economy with institutional and calculative mechanisms: An empirical investigation of DiDi in China,” Information Technology for Development , pp. 1–25, . https://doi.org/10.1080/02681102.2020.1807894 . [ Google Scholar ]

13 . A. Bhattacherjee. (2001). “Understanding information systems continuance: An expectation-confirmation model,” MIS Quarterly , vol. 25 , no. 3 , pp. 351–370. [ Google Scholar ]

14 . S. Y. Lam, V. Shankar, M. K. Erramilli and B. Murthy. (2004). “Customer value, satisfaction, loyalty, and switching costs: An illustration from a business-to-business service context,” Journal of the Academy of Marketing Science , vol. 32 , no. 3 , pp. 293–311. [ Google Scholar ]

15 . C. H. Kwon, D. H. Jo and H. G. Mariano. (2020). “Exploring the determinants of relationship quality in retail banking services,” KSII Transactions on Internet and Information Systems , vol. 14 , no. 8 , pp. 3457–3472. [ Google Scholar ]

16 . N. On, G. M. Ryu, M. J. Koh, J. R. Lee and N. G. Kim. (2020). “An empirical study on the intention to reuse computational science and engineering platforms: A case study of EDISON,” KSII Transactions on Internet and Information Systems , vol. 14 , no. 8 , pp. 3437–3456. [ Google Scholar ]

17 . F. D. Davis. (1989). “Perceived usefulness, perceived ease of use, and user acceptance of information technology,” MIS Quarterly , vol. 13 , no. 3 , pp. 319–340. [ Google Scholar ]

18 . Y. J. Joo, H. J. So and N. H. Kim. (2018). “Examination of relationships among students’ self-determination, technology acceptance, satisfaction, and continuance intention to use K-MOOCs,” Computers & Education , vol. 122 , pp. 260–272. [ Google Scholar ]

19 . C. L. Hsu and J. C. C. Lin. (2015). “What drives purchase intention for paid mobile apps?—An expectation confirmation model with perceived value,” Electronic Commerce Research and Applications , vol. 14 , no. 1 , pp. 46–57. [ Google Scholar ]

20 . D. Nan, Y. Kim, M. H. Park and J. H. Kim. (2020). “What motivates users to keep using social mobile payments?,” Sustainability , vol. 12 , no. 17 , pp. 678. [ Google Scholar ]

21 . H. Suh, N. Shahriaree, E. B. Hekler and J. A. Kientz. (2016). “Developing and validating the user burden scale: A tool for assessing user burden in computing systems,” in Proc. of the 2016 CHI Conf. on Human Factors in Computing Systems, New York, NY, USA, pp. 3988–3999. [ Google Scholar ]

22 . Y. R. Tausczik and J. W. Pennebaker. (2010). “The psychological meaning of words: LIWC and computerized text analysis methods,” Journal of Language and Social Psychology , vol. 29 , no. 1 , pp. 24–54. [ Google Scholar ]

23 . J. W. Pennebaker, R. L. Boyd, K. Jordan and K. Blackburn. (2015). “The development and psychometric properties of LIWC2015,” . [Online]. Available: http://hdl.handle.net/2152/31333 . [ Google Scholar ]

24 . J. Yi, G. Yuan and C. Yoo. (2020). “The effect of the perceived risk on the adoption of the sharing economy in the tourism industry: The case of Airbnb,” Information Processing & Management , vol. 57 , no. 1 , pp. 102108. [ Google Scholar ]

25 . E. Park. (2020). “User acceptance of smart wearable devices: An expectation-confirmation model approach,” Telematics and Informatics , vol. 47 , pp. 101318. [ Google Scholar ]

26 . I. Buil, L. De Chernatony and E. Martínez. (2013). “Examining the role of advertising and sales promotions in brand equity creation,” Journal of Business Research , vol. 66 , no. 1 , pp. 115–122. [ Google Scholar ]

27 . C. Phonthanukitithaworn, C. Sellitto and M. W. L. Fong. (2015). “User intentions to adopt mobile payment services: A study of early adopters in Thailand,” Journal of Internet Banking and Commerce , vol. 20 , no. 1 , pp. 1–29. [ Google Scholar ]

28 . A. K. Kar. (2020). “What affects usage satisfaction in mobile payments? modelling user generated content to develop the “digital service usage satisfaction model,” Information Systems Frontiers , pp. 1–21, . https://doi.org/10.1007/s10796-020-10045-0 . [ Google Scholar ]

29 . Y. Chen, X. Yan, W. Fan and M. Gordon. (2015). “The joint moderating role of trust propensity and gender on consumers’ online shopping behavior,” Computers in Human Behavior , vol. 43 , pp. 272–283. [ Google Scholar ]

30 . H. Sang and J. Cheng. (2020). “Effects of perceived risk and patient anxiety on intention to use community healthcare services in a big modern city,” SAGE Open , vol. 10 , no. 2 , pp. 2158244020933604. [ Google Scholar ]

31 . W. B. Lin. (2008). “Investigation on the model of consumers’ perceived risk—Integrated viewpoint,” Expert Systems with Applications , vol. 34 , no. 2 , pp. 977–988. [ Google Scholar ]

32 . N. Derakshan and M. W. Eysenck. (1997). “Interpretive biases for one’s own behavior and physiology in high-trait-anxious individuals and repressors,” Journal of Personality and Social Psychology , vol. 73 , no. 4 , pp. 816–825. [ Google Scholar ]

33 . T. T. T. Pham and J. C. Ho. (2015). “The effects of product-related, personal-related factors and attractiveness of alternatives on consumer adoption of NFC-based mobile payments,” Technology in Society , vol. 43 , pp. 159–172. [ Google Scholar ]

34 . D. H. Shin. (2009). “Determinants of customer acceptance of multi-service network: An implication for IP-based technologies,” Information & Management , vol. 46 , no. 1 , pp. 16–22. [ Google Scholar ]

35 . E. Park. (2019). “The role of satisfaction on customer reuse to airline services: An application of big data approaches,” Journal of Retailing and Consumer Services , vol. 47 , pp. 370–374. [ Google Scholar ]

36 . D. H. Zhu, Y. P. Chang and A. Chang. (2015). “Effects of free gifts with purchase on online purchase satisfaction,” Internet Research , vol. 25 , no. 5 , pp. 690–706. [ Google Scholar ]

37 . Y. S. Wang, T. H. Tseng, W. T. Wang, Y. W. Shih and P. Y. Chan. (2019). “Developing and validating a mobile catering app success model,” International Journal of Hospitality Management , vol. 77 , pp. 19–30. [ Google Scholar ]

38 . W. H. Kim, J. L. Cho and K. S. Kim. (2019). “The relationships of wine promotion, customer satisfaction, and behavioral intention: The moderating roles of customers’ gender and age,” Journal of Hospitality and Tourism Management , vol. 39 , pp. 212–218. [ Google Scholar ]

39 . J. Jung, P. Petkanic, D. Nan and J. H. Kim. (2020). “When a girl awakened the world: A user and social message analysis of Greta Thunberg,” Sustainability , vol. 12 , no. 7 , pp. 2707. [ Google Scholar ]

40 . Y. Wang, F. Subhan, S. Shamshirband, M. Z. Asghar, I. Ullah et al. (2020). , “Fuzzy-based sentiment analysis system for analyzing student feedback and satisfaction,” Computers, Materials & Continua , vol. 62 , no. 2 , pp. 631–655. [ Google Scholar ]

41 . J. Kim and N. Moon. (2019). “Rating and comments mining using TF-IDF and SO-PMI for improved priority ratings,” KSII Transactions on Internet and Information Systems , vol. 13 , no. 11 , pp. 5321–5334. [ Google Scholar ]

42 . Y. Zhang, J. Cheng, Y. Yang, H. Li, X. Zheng et al. (2020). , “Covid-19 public opinion and emotion monitoring system based on time series thermal new word mining,” Computers, Materials & Continua , vol. 64 , no. 3 , pp. 1415–1434. [ Google Scholar ]

43 . J. Kim, K. Bae, E. Park and A. P. Del Pobil. (2019). “Who will subscribe to my streaming channel? The case of Twitch,” in Conf. Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing, New York, NY, USA, pp. 247–251. [ Google Scholar ]

44 . A. N. Tuch, R. Trusell and K. Hornbæk. (2013). “Analyzing users’ narratives to understand experience with interactive products,” in Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, New York, NY, USA, pp. 2079–2088. [ Google Scholar ]

45 . E. O. Park, B. K. Chae, J. Kwon and W. H. Kim. (2020). “The effects of green restaurant attributes on customer satisfaction using the structural topic model on online customer reviews,” Sustainability , vol. 12 , no. 7 , pp. 2843. [ Google Scholar ]

46 . R. M. O’brien. (2007). “A caution regarding rules of thumb for variance inflation factors,” Quality & Quantity , vol. 41 , no. 5 , pp. 673–690. [ Google Scholar ]

This work is licensed under a , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Logo of Data Booster

Case Study How Uber employees use 20x more data in decision-making

Uber’s analytics team was flooded with requests from Operations Managers on how they could explore important data sources.

big data case study on uber

Although reports and dashboards were available, Operations Managers at Uber knew that the best and fastest decisions could only be made by exploring the data. Uber tried educating the professionals through meetings, coaching and how-to guides, but this was not enough.

Uber needed a solution to make its global workforce data-driven at scale. Through hands-on upskilling on our platform, thousands of Uber employees now use data in their daily work. Operations, Marketing and Product teams use it for planning and decision-making.

Download the case study and learn more about Uber’s journey to data-driven decision making.

Download the Uber Case Study

Uber managed to upskill over 24.000 employees in data-driven decision making. Learn more about how we made this happen.

  • About Masters of Media
  • Current students
  • Alumni: Class of 2018-2019
  • Alumni: Class of 2017-2018
  • Alumni: Class of 2016-2017
  • Alumni: Class of 2015-2016
  • Alumni: Class of 2014-2015
  • Alumni: Class of 2012-2013
  • Blog Writing Guide Lines

The Big Problem with Uber’s Big Data: Ethics and Regulation of Data Ownership

Print Friendly, PDF & Email

“Technology is neither good nor bad; nor is it neutral” (Kranzberg 1986, p. 545)

That is why it is key to understand how we, as users and moderators, give additional meaning to technological features and participate in the complex chain of interactions they bring across society. Living in the mere beginnings of the era of “Big Data”, it is pressing to address the cultural and ethical implications of a phenomenon often idolised and seen as a universal answer by many in the business and scientific spheres (Boyd & Crawford 2012). Who should have access to big datasets? Should the use of Big Data be regulated and how to address the privacy concerns that come with mass information collection? Using the work of Danah Boyd and Kate Crawford “Critical Questions for Big Data”, we will analyse how the gig economy companies, taking Uber as an example, are handling Big Data and why it has caused a series of ethical controversies and recent legislative action.

Questioning Big Data

Boyd and Crawford define Big Data as a “cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology” (Boyd & Crawford 2012, p.663). This definition breaks away from the understanding that Big Data is just a dataset too large for human comprehension and transforms the term into a more complex phenomenon of social, not scientific, origins. Consequently, this leaves room for theorising and critiquing the role of Big Data in many social shifts. Boyd and Crawford develop six provocative claims about the influence of this phenomenon, three of which have a specific relevance to the case of Uber and will serve as the theoretical backbone of analysis – evaluating how Big Data changed the definition of work but created new ethical issues and failed to deliver on its promise of objectivity. 

Big Data Changes Definitions

Big Data is at the core of Uber’s business model. It collects, analyses, and stores huge amounts of information that is later used to fuel the algorithms of the platform and produce an “optimised” (according to pre-determined criteria in the AI) personalised service. With these abilities, Uber gained an extraordinary market-breaking advantage in the ride hailing industry (Rogers 2015). Most importantly, it redefined how “work” is perceived by introducing the “on-demand digital independent contractor” (Malin & Chandler 2017) model. 

Big Data gave Uber enough power and agency to be able to attract workers with its ease-of-use and escape the classic employee-employer relationship, defining itself as a data-powered platform that serves as a mediator between drivers and consumers (Wilhelm 2018). With this position, Uber solely relies on Big Data and the algorithms that collect and use it to balance the complex relationship between service providers and customers, an approach that seems heavily technologically deterministic. Nevertheless, for good or bad, Uber and the data-powered gig economy have irreversibly changed the way people define work in the service industry – to a point that “app workers” accounts for the majority of the ride-hailing and delivery labour force (Malin & Chandler, 2017).

Just Because it Is Accessible Does not Make it Ethical 

Boyd and Crawford make the important point that Big Data can produce “destabilising amounts of knowledge and information that lack the regulating force” (Boyd & Crawford 2012, p.666). Uber is experiencing this effect more and more recently with a growing amount of legislative action taken against the company’s data collection policies and lack of algorithmic transparency. The ethics of data ownership and availability have become the “next frontier in the fight for gig workers’ rights” (Clarke 2021). 

As Uber drivers are considered independent contractors and not employees, the company has not deemed it necessary to share with its workers the data it collects about their work and how it influences the algorithm’s opinion of individual workers. Drivers also have no way to retrieve their personal data, to erase it, or to migrate it if they decide to start working at a competitive platform (although the GigCV initiative is currently trying to make the latter possible). 

The ethics behind data ownership in the gig economy is a heavily disputed topic, but recent court decisions are turning the debate in favour of workers (Reshaping Work, 2021). In a landmark case of March 2021, Amsterdam’s District Court ruled that Uber must disclose “data used to deduct earnings, assign work, and suspend drivers” and also shed light on how driver surveillance systems are used in the Netherlands (Ongweso Jr, 2021). Similar rulings across Europe suggest that the debate around regulating Big Data is more a “when” and “how” than an “if” question at that point. 

Claims to Objectivity and Accuracy Are Misleading 

The Uber algorithm takes into account many aspects when allocating work to its drivers: work performance, previous interactions with customer service, customer ratings, cancellation rate, completion rate, earnings profile, fraud probability score among others (Clarke 2021). However, nobody truly knows the exact extend of data collection and the way algorithms utilise this information. Uber is notoriously reluctant to share such data with researchers, policymakers, or the public. Nevertheless, there are jurisdictions where Uber has been legally forced to provide certain datasets to data scientists, most notably in Chicago. This lead to the discovery of bias and racial discrimination in the company’s dynamic pricing algorithms in a study on over 68 million Uber rides in Chicago (Wiggers 2020). Critiquing Big Data with a study based on Big Datasets is exactly the kind of self-reflexivity that is often lacking in the scientific community (Boyd and Crawford 2012), but this trend can also be explained by the lack of openly accessible datasets that deem a larger territorial study on the subject impossible.

We Are Our Tools

There is a “deep industrial drive toward gathering and extracting maximal value from data” (Boyd & Crawford 2012) and that is not inherently negative. However, we should remain mindful and question the ethical implications of this new data-driven society. As the example of Uber showcased, Big Data is not a magical universal solution, and its flawed collection and interpretation can cause serious social divides and issues. “We are our tools” (Boyd and Crawford 2021, p.675) and we should be aware and responsible for the consequences they cause.

Comments are closed.

Related Posts

Gig industry and the issue of data, ‘quiet ride’ in uber: technology defeats small talk., the questionable ethics of r/hermancainaward.

DataFlair

  • Big Data Tutorials

5 Big Data Case Studies – How big companies use Big Data

Undoubtedly Big Data has become a big game-changer in most of the modern industries over the last few years. As Big Data continues to pass through our day to day lives, the number of different companies that are adopting Big Data continues to increase.

Let us see how Big Data helped them to perform exponentially in the market with these 6 big data case studies.

Top 5 Big Data Case Studies

Following are the interesting big data case studies –

1. Big Data Case Study – Walmart

Walmart is the largest retailer in the world and the world’s largest company by revenue, with more than 2 million employees and 20000 stores in 28 countries. It started making use of big data analytics much before the word Big Data came into the picture.

Walmart uses Data Mining to discover patterns that can be used to provide product recommendations to the user, based on which products were brought together.

WalMart by applying effective Data Mining has increased its conversion rate of customers. It has been speeding along big data analysis to provide best-in-class e-commerce technologies with a motive to deliver superior customer experience.

The main objective of holding big data at Walmart is to optimize the shopping experience of customers when they are in a Walmart store.

Big data solutions at Walmart are developed with the intent of redesigning global websites and building innovative applications to customize the shopping experience for customers whilst increasing logistics efficiency.

Hadoop and NoSQL technologies are used to provide internal customers with access to real-time data collected from different sources and centralized for effective use.

2. Big Data Case Study – Uber

Uber is the first choice for people around the world when they think of moving people and making deliveries. It uses the personal data of the user to closely monitor which features of the service are mostly used, to analyze usage patterns and to determine where the services should be more focused.

Uber focuses on the supply and demand of the services due to which the prices of the services provided changes. Therefore one of Uber’s biggest uses of data is surge pricing. For instance, if you are running late for an appointment and you book a cab in a crowded place then you must be ready to pay twice the amount.

For example, On New Year’s Eve, the price for driving for one mile can go from 200 to 1000. In the short term, surge pricing affects the rate of demand, while long term use could be the key to retaining or losing customers. Machine learning algorithms are considered to determine where the demand is strong.

3. Big Data Case Study – Netflix

It is the most loved American entertainment company specializing in online on-demand streaming video for its customers.

Netflix has been determined to be able to predict what exactly its customers will enjoy watching with Big Data. As such, Big Data analytics is the fuel that fires the ‘recommendation engine’ designed to serve this purpose. More recently, Netflix started positioning itself as a content creator, not just a distribution method.

Unsurprisingly, this strategy has been firmly driven by data. Netflix’s recommendation engines and new content decisions are fed by data points such as what titles customers watch, how often playback stopped, ratings are given, etc. The company’s data structure includes Hadoop, Hive and Pig with much other traditional business intelligence.

Netflix shows us that knowing exactly what customers want is easy to understand if the companies just don’t go with the assumptions and make decisions based on Big Data.

4. Big Data Case Study – eBay

A big technical challenge for eBay as a data-intensive business to exploit a system that can rapidly analyze and act on data as it arrives (streaming data). There are many rapidly evolving methods to support streaming data analysis.

eBay is working with several tools including Apache Spark , Storm, Kafka. It allows the company’s data analysts to search for information tags that have been associated with the data (metadata) and make it consumable to as many people as possible with the right level of security and permissions (data governance).

The company has been at the forefront of using big data solutions and actively contributes its knowledge back to the open-source community.

5. Big Data Case Study – Procter & Gamble

Procter & Gamble whose products we all use 2-3 times a day is a 179-year-old company. The genius company has recognized the potential of Big Data and put it to use in business units around the globe. P&G has put a strong emphasis on using big data to make better, smarter, real-time business decisions.

The Global Business Services organization has developed tools, systems, and processes to provide managers with direct access to the latest data and advanced analytics. Therefore P&G being the oldest company, still holding a great share in the market despite having many emerging companies.

Big Data predicting the uncertainties

A groundbreaking study in Bangladesh has found that using data from mobile phone networks to track movements of people across the country help predict where outbreaks of diseases such as malaria are likely to occur, enabling health authorities to take preventive measures.

Every year, malaria kills more than 400,000 people globally and most of them are children.

The different type of data, including information provided by the Bangladesh ministry of health, are used to create risk maps indicating the likely locations of malaria outbreaks so the local health authorities can then be warned to take preventative action, including spraying insecticides and stockpiling bed nets and medicines to protect the population from the disease.

With the various technologies it holds, Big Data helps almost every company or sector that aspires to grow. Analyzing large datasets that are associated with the events of the company can give them insights to increase their customer satisfaction.

If you know more such interesting Big Data case studies, share with us through comments.

Keep improving! Big Data has your back 🙂

Did you like our efforts? If Yes, please give DataFlair 5 Stars on Google

courses

Tags: Big companies using big data Big Data case study Big Data Walmart case study ebay big data case study Netflix big data case study Procter & Gamble Big Data Case Study Uber big data case study

4 Responses

  • Pingbacks 0

big data case study on uber

small undata

big data case study on uber

very small data

big data case study on uber

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Big Data – Introduction
  • Big Data – History
  • Big Data – Reasons to Learn
  • Big Data – Trends
  • Big Data – Reasons Behind its Hype
  • Big Data – Benefits
  • Big Data – Top Tools
  • Big Data – Application in Banking Sector
  • Big Data – Real Time Applications
  • Big Data – Why it is Popular
  • Big Data – Career Path
  • Big Data – Use Cases
  • Big Data – Apps in Healthcare
  • Big Data – Wildlife Conservation
  • Big Data – Agriculture
  • Big Data – Retail Industry
  • Big Data – Bank Industry
  • Big Data – Media & Entertainment
  • Big Data – Automobile Industry
  • Big Data – Travel and Tourism
  • Big Data – Education Sector
  • Big Data – Telecom Industry
  • Big Data – Top Case Studies
  • Big Data – Cloud Computing
  • Big Data – Lambda Architecture
  • Big Data – Analytics Tools
  • Big Data – Vulnerability
  • Big Data in Income Tax Department
  • Big Data – Careers & Jobs Roles
  • Big Data – Developer Skills
  • Why Choose a Career in Big Data
  • Big Data – Jobs for Freshers
  • Why Switch Career in Big Data
  • Big Data – BI Tools for Visualization
  • Big Data – Salesforce Acquires Tableau
  • Big Data at Flipkart
  • Big Data in Union Budget
  • Big Data vs Data Science
  • Career switch – Mainframe to Big Data
  • Big Data – Quotes
  • Hadoop – Introduction
  • Hadoop – Why
  • Hadoop – Features
  • Hadoop – History
  • Hadoop – Ecosystem
  • Hadoop – Architecture
  • Hadoop – Pros and Cons
  • Hadoop – Analytics Tools
  • Hadoop – Internal Working
  • Hadoop – Commands
  • Hadoop – getmerge Command
  • Hadoop – copyFromLocal Command
  • Hadoop – Cluster
  • Hadoop – High Availability
  • Hadoop – Schedulers
  • Hadoop – Distributed Cache
  • Hadoop – Automatic Failover
  • Hadoop – Hadoop Streaming
  • Hadoop – Hadoop Security
  • Hadoop – Limitations & Solutions
  • Hadoop – Install Hadoop 2 on Ubuntu
  • Hadoop – Install multinode Cluster
  • What’s New in Hadoop 3
  • Hadoop – Install Hadoop 3
  • Hadoop – HBase Compaction & Data Locality
  • Hadoop 2.x vs Hadoop 3.x
  • Hadoop – Best Books
  • Hadoop – Future
  • Hadoop – Career
  • Hadoop – Job Opportunities
  • Hadoop – Job Roles
  • Hadoop – Developer Salary
  • Hadoop – Certifications
  • Hadoop for Data Science
  • Hadoop vs Cassandra
  • Hadoop vs MongoDB
  • Hadoop vs Spark vs Flink
  • Hadoop – Ecosystem Infographic
  • Hadoop Interview Que. – 1
  • Hadoop Interview Que. – 2
  • Hadoop Interview Que. – 3
  • Hadoop Quiz – Part 1
  • Hadoop Quiz – Part 2
  • Hadoop Quiz – Part 3
  • Hadoop Quiz – Part 4
  • Hadoop Quiz – Part 5
  • Hadoop Quiz – Part 6

job-ready courses

  • Trending Now
  • Foundational Courses
  • Data Science
  • Practice Problem
  • Machine Learning
  • System Design
  • DevOps Tutorial

G-Fact 112 | Machine Learning - Uber Use Case

Machine learning - uber use case.

In this video, we will explore how Uber leverages machine learning to optimize its services, improve user experience, and enhance operational efficiency. Uber is a leading ride-sharing company that uses advanced machine learning algorithms to provide accurate ride estimates, optimize routes, and predict demand. This tutorial is perfect for students, professionals, or anyone interested in understanding how machine learning can be applied to real-world use cases in the transportation industry.

Why Use Machine Learning for Uber?

Using machine learning at Uber helps to:

  • Optimize Ride Matching : Efficiently match riders with drivers to minimize wait times.
  • Predict Demand : Anticipate high-demand periods and adjust pricing dynamically.
  • Improve Route Optimization : Provide the most efficient routes to drivers, reducing travel time and fuel consumption.
  • Enhance User Experience : Deliver accurate fare estimates and improve overall service quality.

Key Concepts

Ride matching.

  • The process of pairing available drivers with ride requests in the most efficient manner.

Demand Prediction

  • Using historical data to predict future ride demand, helping to manage supply and adjust pricing.

Route Optimization

  • Determining the most efficient routes for drivers to minimize travel time and costs.

Benefits of Machine Learning at Uber

  • Efficiency : Improved operational efficiency through optimized resource allocation.
  • Accuracy : Enhanced accuracy in ride estimates and route planning.
  • Customer Satisfaction : Increased user satisfaction through reliable and timely service.

Steps to Implement Machine Learning for Uber Use Case

Data Collection :

  • Collect data on ride requests, driver locations, travel times, and other relevant metrics.

Data Preprocessing :

  • Clean and preprocess the data to remove errors, handle missing values, and prepare it for analysis.

Feature Engineering :

  • Create relevant features from the data that will be used for training machine learning models.

Model Selection and Training :

  • Choose appropriate machine learning algorithms and train models using the processed data.

Model Evaluation :

  • Evaluate the performance of the models using metrics such as accuracy, precision, recall, and F1 score.

Deployment and Monitoring :

  • Deploy the models to production and continuously monitor their performance to ensure they are delivering accurate predictions.

Practical Applications

  • Anticipate high-demand periods and adjust driver availability and pricing dynamically.
  • Provide the most efficient routes to drivers, reducing travel time and fuel consumption.
  • Deliver accurate fare estimates to users based on historical data and real-time conditions.

Video Thumbnail

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

We protect your privacy.

QCon San Francisco (Nov 18-22): Get assurance you’re adopting the right software practices. Register Now

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

  • English edition
  • Chinese edition
  • Japanese edition
  • French edition

Back to login

Login with:

Don't have an infoq account, helpful links.

  • About InfoQ
  • InfoQ Editors

Write for InfoQ

  • About C4Media

Choose your language

big data case study on uber

Get clarity from senior software practitioners on today's critical dev priorities. Register Now.

big data case study on uber

Level up your software skills by uncovering the emerging trends you should focus on. Register now.

big data case study on uber

Discover emerging trends, insights, and real-world best practices in software development & tech leadership. Join now.

big data case study on uber

Your monthly guide to all the topics, technologies and techniques that every professional needs to know about. Subscribe for free.

InfoQ Homepage News Uber’s Journey to Modernizing Big Data Infrastructure with Google Cloud Platform

Uber’s Journey to Modernizing Big Data Infrastructure with Google Cloud Platform

Jun 29, 2024 3 min read

Claudio Masolo

In a recent post on its official engineering blog, Uber, disclosed its strategy to migrate the batch data analytics and machine learning (ML) training stack to Google Cloud Platform (GCP). Uber, runs one of the largest Hadoop installations in the world, managing over an exabyte of data across tens of thousands of servers in each of its two regions. The open-source data ecosystem, particularly Hadoop, has been the cornerstone of the data platform.

The strategic migration plan consists of two steps: Initial migration and leveraging Cloud-Native Services. Uber's initial strategy involves leveraging GCP’s object store for data lake storage while migrating the rest of their data stack to GCP’s Infrastructure as a Service (IaaS). This approach allows for a swift migration with minimal disruption to the existing jobs and pipelines, as they can replicate the exact versions of their on-premises software stack, engines, and security model on IaaS. Following this phase, the Uber engineering team plans to gradually adopt GCP’s Platform as a Service (PaaS) offerings, such as Dataproc and BigQuery , to harness the elasticity and performance benefits of cloud-native services fully.

big data case study on uber

Once the initial migration is complete, the team will focus on integrating cloud-native services to maximize the data infrastructure’s performance and scalability. This phased approach ensures that Uber users, from dashboard owners to ML practitioners, experience a seamless transition without altering their existing workflows or services.

To ensure a smooth and efficient migration, the Uber team have established several guiding principles:

  • Minimize use disruption by moving the majority of the batch data stack onto cloud IaaS as-is; they aim to shield their users from any changes to their artifacts or services. Using well-known abstractions and open standards, they strive to make the migration as transparent as possible.
  • They will rely on a cloud storage connector that implements the Hadoop FileSystem interface to Google Cloud Storage, ensuring HDFS compatibility. By standardizing their Apache Hadoop HDFS clients, we will abstract the specifics of the on-premise HDFS implementation, allowing seamless integration with GCP’s storage layer.
  • The Uber team has developed data access proxies for Presto , Spark , and Hive that abstract the underlying physical compute clusters. These proxies will support the selective routing of test traffic to cloud-based clusters during the testing phase and fully route queries and jobs to the cloud stack during the full migration.
  • Utilizing Uber’s cloud-agnostic infrastructure. Uber existing container environment, computing platform, and deployment tools are built to be agnostic between cloud and on-premises. These platforms will enable to easily expand their batch data ecosystem microservices onto the cloud IaaS.
  • The team will build and enhance existing data management services to support selected and approved cloud services, ensuring robust data governance. The company aims to maintain the same levels of authorized access and security as on-premises, while supporting seamless user authentication against the object store data lake and other cloud services.

big data case study on uber

The Uber team focuses on bucket mapping and cloud resource layout for migration. Mapping HDFS files and directories to cloud objects in one or more buckets is critical. They need to apply IAM policies at varying levels of granularity, considering constraints on buckets and objects such as read/write throughput and IOPS throttling. The team aims to develop a mapping algorithm that satisfies these constraints and organizes data resources in an organization-centric hierarchical manner, improving data administration and management.

Security Integration is another workstream; adapting our existing Kerberos-based tokens and Hadoop Delegation tokens for cloud PaaS, particularly Google Cloud Storage (GCS), is essential. This workstream aims to support seamless user, group, and service account authentication and authorization, maintaining consistent access levels as on-premises.

The team also focuses on data replication. HiveSync, the permissions-aware bidirectional data replication service, allows Uber to operate in active-active mode. It extends HiveSync’s capabilities to replicate the on-premise data lake’s data to the cloud-based data lake and corresponding Hive Metastore. This includes an initial bulk migration and ongoing incremental updates until the cloud-based stack becomes the primary.

The last workstream is providing new YARN and Presto clusters on GCP Iaas. Uber data access proxies will route query and job traffic to these cloud-based clusters during the migration, ensuring a smooth transition.

Uber's big data migration to Google Cloud anticipates challenges like performance differences in storage and unforeseen issues due to its legacy system. The team plans to address these by leveraging open-source tools, utilizing cloud elasticity for cost management, migrating non-core uses to dedicated storage, and proactively testing integrations and deprecating outdated practices.

About the Author

Rate this article, this content is in the devops topic, related topics:.

  • Google Cloud
  • Apache Hadoop
  • Architecture

Related Editorial

Related sponsored content, designing data intensive applications (by o'reilly), related sponsor.

big data case study on uber

ScyllaDB is the database for data-intensive apps requiring high throughput + low latency. Achieve extreme scale with the lowest TCO. Learn More .

Related Content

The infoq newsletter.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

big data case study on uber

The Daily Show Fan Page

Experience The Daily Show

Explore the latest interviews, correspondent coverage, best-of moments and more from The Daily Show.

The Daily Show

S29 E68 • July 8, 2024

Host Jon Stewart returns to his place behind the desk for an unvarnished look at the 2024 election, with expert analysis from the Daily Show news team.

Extended Interviews

big data case study on uber

The Daily Show Tickets

Attend a Live Taping

Find out how you can see The Daily Show live and in-person as a member of the studio audience.

Best of Jon Stewart

big data case study on uber

The Weekly Show with Jon Stewart

New Episodes Thursdays

Jon Stewart and special guests tackle complex issues.

Powerful Politicos

big data case study on uber

The Daily Show Shop

Great Things Are in Store

Become the proud owner of exclusive gear, including clothing, drinkware and must-have accessories.

About The Daily Show

ORIGINAL RESEARCH article

The spatial effect of integrated economy on carbon emissions in the era of big data: a case study of china.

Yan Wang&#x;

  • 1 School of Economic and Management, Xi’an University of Technology, Xi’an, China
  • 2 School of Business and Circulation, Shaanxi Polytechnic Institute, Xian Yang, China

The digital economy has the characteristics of resource conservation, which can solve China’s high carbon emissions problems. The digital economy can quickly integrate with the real economy, forming an integrated economy. However, it is still unclear whether an integrated economy can effectively reduce carbon emissions and achieve China’s ‘dual carbon goals’. Therefore, this study takes 30 provinces in China as the research object, constructs the integration economy index system through the statistical data from 2011-2021, and explores the spatial effect of the impact of the integration economy on carbon emissions by using principal component analysis, coupled coordination model and spatial econometric model. The research results are as follows. (1) From 2011 to 2021, the comprehensive economy showed a trend of increasing yearly (from 0.667 to 0.828), and carbon emissions showed a slow decrease (from 0.026 to 0.017). (2) Due to the infiltration of China’s economic development from the eastern to the western, the spatial distribution of the integrated economy shows a decreasing trend from east to west. The spatial distribution of carbon emissions may be related to China’s industrial layout of heavy industry in the northern, and light industry in the southern, showing a trend of low in the south and high in the north. (3) The integrated economy can significantly reduce carbon emissions (the coefficients of influence, -0.146), and the reduction effect will be more obvious if spatial spillover effects are taken into account (-0.305). (4) The eastern coast, the middle reaches of the Yangtze River, and the middle reaches of the Yellow River economic zones all increase carbon emissions at a certain level of significance (0.065, 0.148, and 3.890). The Northeast, South Coastal and Southwest economic zones significantly reduce carbon emissions (-0.220, -0.092, and -0.308). The results of the Northern Coast and Northwest are not significant (-0.022 and 0.095). (5) China should tailor regional economic development policies, such as strengthening investment in digital infrastructure in the Northwest Economic Zone and fully leveraging the spatial spillover effects of integrated economy in the Northeast, Southern Coastal, and Southwest Economic Zones to reduce carbon emissions.

1 Introduction

In recent years, climate issues have become increasingly severe ( Yuan et al., 2024 ), with frequent occurrences of extreme weather phenomena such as air pollution, haze pollution, and rising temperatures ( Tian et al., 2022 ). According to the International Energy Agency (IEA), China has had the highest global carbon emissions since 2007 ( Cheng et al., 2018 ).In response to concerns from the international community about China’s willingness to contribute and share obligations towards global climate change goals, China and the United States signed the Sino-US Joint Declaration on Climate Change in 2014 ( Gao et al., 2021 ; Xu et al., 2024 ).In 2021, the Central Committee of the Communist Party of China and the State Council issued the Action Plan for Carbon Peak before 2030 , incorporating ‘carbon peak and carbon neutrality’ into the overall economic and social development, advocating for accelerating the green transformation of production and lifestyle, and ensuring the timely achievement of the ‘carbon peak’ goal before 2030 ( Zhao et al., 2022 ; Feng et al., 2024 ).

In the era of big data, the integrated economy is the focus point for countries to seize the leading position in global strategy and has become an inevitable choice to solve the problem of carbon emissions ( Shi and Sun, 2023 ; Sun et al., 2024 ). Integrated economy refers to the integration of the digital economy and real economy. The digital economy is the leading force in the current world technological revolution and industrial transformation, and many countries regard it as the new driving force for restructuring national core competitiveness ( Wang et al., 2023 ). The real economy is the foundation of a country, the source of wealth, and the soul of industry, and is the strategic core of economic development for all countries ( Cheng et al., 2023 ). With the vigorous development of digital technology, ‘integrated economy’ has become a new development model and concept ( Liu et al., 2024 ).

In 2020, the Global Climate Action Summit released the Index Climate Action Road map, which proposed implementing ‘digital’ solutions in physical industries that can help reduce global carbon emissions by up to 15% ( Feng et al., 2023a ; Feng et al., 2023b ). It can be seen that the integration of ‘digital technology’ and physical industries, namely the integrated economy, plays a sustained and powerful role in the process of carbon reduction ( Lopes de Sousa Jabbour et al., 2022 ; Sun et al., 2024 ). To achieve economic leadership and reduce pollution, countries have issued strategic plans to promote the development of integrated economies ( Granados and Gupta, 2013 ; Xu et al., 2018 ), such as the United States issuing the National Strategic Plan for Advanced Manufacturing ( Fatima et al., 2020 ), Germany issuing The High Technology Strategy 2025 ( Klippert et al., 2020 ), and the United Kingdom implementing the Extraordinary Export Plan. Made in China 2025 ( Xu et al., 2017 ) also proposes carbon reduction measures to promote China’s green and low-carbon development through intelligent manufacturing and an integrated economy ( Wang et al., 2020 ). However, China is a vast country, and the status of the integrated economy and carbon emission is different in different regions. Studying the spatial effect of an integrated economy on carbon emission is of great theoretical and practical significance for realizing the coordinated development of the economy.

Based on this, this paper takes 30 provinces in China (excluding Hong Kong Special Administrative Region, Macao Special Administrative Region, Taiwan, and Tibet Autonomous Region due to difficulties in data acquisition) as the research object, uses panel data from 2011 to 2021 to construct a measurement system for the development level of the digital economy and the real economy, and applies the empirical method to analyze the spatial effect of the integrated economy on carbon emissions. We attempt to explore the following issues: (1) What is the current situation of China’s integrated economy and carbon emissions? (2) What is the impact of an integrated economy on carbon emissions? (3) What is the spatial effect of the impact of an integrated economy on carbon emissions? (4) What policies should be increased to promote green and coordinated development across China’s regions to jointly achieve the dual-carbon goal? So, the research content of this article mainly includes the following aspects. Firstly, this article uses Principal Component Analysis (PCA) to separately measure the results of the subsystems of the digital economy and the real economy. Based on the results of the digital economy and the real economy, a coupled coordination model is used to integrate the results of the two subsystems to calculate the integrated economy. Secondly, based on comprehensive economic and carbon emission data, the Natural Breaks Classification method using software such as QGIS is used to analyze its time evolution and spatial distribution trend. Thirdly, we use Moran’s index to analyze the spatial autocorrelation of integrated economy and carbon emission levels. Fourthly, we use spatial econometric models to examine the impact of an integrated economy on carbon emissions and decompose its spatial effects. Fifthly, we classify the Chinese region into eight major economic zones and once again use spatial econometric models to analyze the heterogeneity of the impact of the integrated economy on carbon emissions in each region. Finally, based on the results, targeted policy recommendations are proposed to lay the foundation for achieving the ‘dual carbon goals’.

The main contributions of this article are reflected in the following aspects. Firstly, the existing research gap lies in the fact that few scholars have measured the integrated economy. However, as an important form of economy, the integrated economy is different from the traditional real economy and digital economy. This article constructs a coupled coordination model based on the two subsystems of the integrated economy, the digital economy and the real economy, to accurately measure the level of China’s integrated economy, filling the gap in existing research that lacks measurement of the integrated economy. Secondly, existing studies rarely mention the impact of an integrated economy on carbon emissions, and more tend to discuss the impact of a digital economy on carbon emissions. As a new form of economy, an integrated economy requires the penetration and unification of the digital economy and the real economy. This article incorporates the integration economy and carbon emissions into the same theoretical framework, analyzes the relationship between the two, and fills the gap in research on the relationship between the integration economy and carbon emissions. Finally, few scholars have considered the spatial heterogeneity of the impact of an integrated economy on carbon emissions. China, the subject of the study, is a vast country with a wide range of landmasses, and inter-regional development is bound to have differences. Our study of the spatial heterogeneity of the impact of the integrated economy on carbon emissions from the perspective of the eight economic zones has certain policy implications for the development of the integrated economy in China’s provinces according to local conditions.

2 Literature review and analysis of theoretical mechanisms

2.1 literature review.

This paper divides the previous studies into three parts, integration economy-related studies, carbon emission-related studies, and studies on the relationship between integration economy and carbon emission.

Firstly, there are fewer studies on the converging economy, mainly focusing on exploring the intrinsic coordination mechanism between the subsystems of the converging economy, i.e., the digital economy and the real economy, as well as the current development situation ( Sun et al., 2024 ). The digital economy promotes the development of China’s real economy through industrial digitization and digital industrialization, with industrial structure optimization and upgrading as the intermediary ( Hong and Ren, 2023 ). The impact of the digital economy on the real economy presents an inverted U-shaped feature, with a crowding-out effect in the eastern part of China and a promoting effect in the western part and the real economy ( Jiang and Sun, 2020 ; Xu et al., 2021 ). At present, the integrated economy is showing a decreasing trend in the east, middle, and west, with problems such as insufficient integration depth, lack of key technologies, and lax market supervision ( Zhang et al., 2022b ). It is urgent to strengthen investment in technological innovation and digital infrastructure construction, create high-level manufacturing industries, and improve and strengthen digital governance to promote the deep integration of the digital economy and the real economy ( Liu et al., 2022a ).

Secondly, the research direction of carbon emissions mainly focuses on three aspects: the current status of carbon emissions ( Xu et al., 2019 ), carbon peak prediction ( Wang and Feng, 2024 ), and the influencing factors of carbon emissions ( Tong, 2020 ; Xu, 2023 ). Firstly, the analysis of the current status of carbon emissions focuses on industries with high carbon concentration, regions with high carbon emissions, the carbon emissions of a certain region under China’s 2030 carbon peak target, and the carbon emissions tracking of a specific location or factory ( Li et al., 2016 ; Ahmadi et al., 2019 ). Secondly, regarding the research on carbon peak prediction, most of the previous researchers used big data models and scenario analysis methods to predict the future growth of carbon emissions. And the results show that most of the provinces and cities in China can achieve the goal of a carbon peak by 2030, and only individual regions, such as Hubao, Eyu and Elm, have difficulties in achieving a carbon peak ( Zhang et al., 2022b ; Dai et al., 2022 ). Finally, according to existing research, public policy factors such as carbon emission trading pilot programs and low-carbon city pilot policies ( Zhao et al., 2022 ), industrial structure factors such as energy structure and industrial robots ( Meng et al., 2018 ; Li and Zhou, 2021 ; Jiang et al., 2023 ), and macro technological factors such as outward direct investment, population aggregation, digital economy development ( Zhao and Zhu, 2022 ; Liu et al., 2023 ), and technological innovation will all have an impact on carbon emissions, carbon intensity, or efficiency ( Chen et al., 2023 ; Zha et al., 2023 ).

Thirdly, there is currently limited research on the relationship between integrated economy and carbon emissions. Most of the related research focuses on the impact of the digital economy, a subsystem of the integrated economy, on carbon emissions ( Wu et al., 2022 ). Most studies suggest that the digital economy can improve carbon emission efficiency by reducing energy consumption ( Jiang et al., 2023 ). The rationalization (advanced) of the industrial structure undermines (enhances) to some extent the carbon-emission efficiency-enhancing effect of the digital economy ( Zhang et al., 2022a ; Chang et al., 2023 ). The carbon reduction effect of the digital economy varies in different regions of China ( Zhang et al., 2022a ). The paths for the digital economy to reduce regional carbon emission intensity or enhance carbon emission efficiency mainly include increasing digital infrastructure and formulating policy guidance based on regional characteristics ( Feng et al., 2023a ; Feng et al., 2023b ; Tang and Yang, 2023 ).

In summary, existing studies focus on the role of the digital economy or industrial development in reducing carbon emissions, but few scholars have scientifically measured the level of development of the convergence economy, and fewer studies consider its carbon reduction effect from the perspective of the integrated economy. Therefore, the main contributions of this article are reflected in the following aspects. Firstly, using reasonable methods and indicator systems to measure the integrated economy can fill the gap in the measurement of the integrated economy in the existing literature. Secondly, the innovative incorporation of integrated economy and carbon emissions into the same theoretical framework has deepened the theoretical research on low-carbon economy. Finally, analyze the current situation and inherent relationship between integrated economy and carbon emissions from a spatial perspective, and deepen relevant research in spatial economics.

2.2 Theoretical mechanisms

The integrated economy is a large economic system constructed by the digital economy subsystem and the real economy subsystem ( Jiang et al., 2023 ). The process of integrating internal subsystems is essentially a process of mutual influence and mutual promotion, in which industrial digitization and digital industrialization are achieved ( Hong and Ren, 2023 ). Therefore, industrial digitization and digital industrialization are external manifestations of an integrated economy. Digital industrialization refers to the continuous expansion of digital technology industries such as the Internet, big data, and cloud computing to form an industrial scale, manifested as the materialization of the digital economy ( Peng et al., 2023 ). Industrial digitization refers to the application of digital technology to achieve intelligent manufacturing in the process of physical industry development, manifested as the digitization of the real economy ( Yi et al., 2023 ). The integrated economy can effectively reduce carbon emissions, mainly through the multiplier effect of the digital economy and the efficiency effect of the real economy.

On the one hand, the digital economy has natural green and energy-saving characteristics, with a virtual and networked nature, which can realize low-carbon growth ( Sun et al., 2024 ). The development of the digital economy has expanded the industrial cornerstone of the real economy, changed traditional business models, and injected green and low-carbon elements into the development of the real economy ( Jiang and Sun, 2020 ). Firstly, the development of the digital economy has promoted the growth of digital industries such as the Internet and cloud platforms that rely on data elements. These digital industries are based on new digital facilities, driven by innovation, and have natural high-tech attributes. Knowledge and innovation spillovers together constitute the multiplier effect of numbers. In the development of the digital industry, through digital diffusion, green creation can be achieved and regional industrial carbon emissions can be reduced. Secondly, in the era of big data, people’s product needs have completely changed. Through mining and analyzing data elements, some green and low-carbon needs have been deeply explored, guiding green innovation in enterprises. Modern enterprises have begun to be guided by consumer green demands, breaking away from the traditional value creation model of product research and development as the core.

On the other hand, using digital technology in the real economy can fully leverage the efficiency effect of innovative technology, accelerating the transformation and upgrading of the real industrial structure towards low-carbon and environmentally friendly green industries ( Liu et al., 2023 ). The real economy provides a source of data elements for the digital economy, increasing the demand for digital technology in the real industry, driving digital technology innovation, improving innovation efficiency, and achieving regional carbon emissions reduction ( Shi and Sun, 2023 ). Firstly, the major industries of the real economy involve various aspects of social life and are the main sources of carbon emissions. User characteristics, individual needs, unknown risks, etc. can be accurately analyzed and predicted through digital technology, reducing unnecessary carbon pollution and waste. Secondly, the integration of the physical industry and the digital economy can improve enterprise productivity, reduce unnecessary carbon emissions in the product manufacturing process, bring more value to the physical industry, and force enterprises to continuously engage in green innovation and achieve low-carbon development.

According to the theory of unbalanced growth, the path of economic development is full of obstacles and bottlenecks, such as shortages of technology, equipment, and products, and factor endowments ( Qi et al., 2013 ). The current state and path of development, and policy orientations are not the same in different regions, so the phenomenon of imbalance is presented regionally, and therefore imbalance is the norm ( Liu et al., 2022b ). At the current stage of development in China, there are still some policies, resources, and factors that are biased, resulting in spatial differences in the integration economy and carbon emission levels. According to the theory of spatial economics, the integrated economy has both multiplier effects and efficiency effects. From a spatial perspective, there must be spatial spillover effects, that is, the integrated economy in the local area can affect the development of the integrated economy and other economic variables in the surrounding areas. According to the theory of externalities, carbon emissions are an important pollutant in the climate environment, and environmental pollution is bound to accumulate maliciously in the region, affecting the ecology and economy of the local and surrounding areas. In summary, the spatial performance of the integrated economy and carbon emissions will inevitably exhibit spatial agglomeration effects, and the impact of the integrated economy on carbon emissions has a certain spatial spillover effect.

3 Methods and data

3.1 variable selection and data sources, 3.1.1 variable selection, 3.1.1.1 integrated economy.

Referring to relevant research, this paper uses the coupling coordination model to measure the level of Integrated economy ( IE ) (Zhang et al., 2022). We divide IE into digital economy ( DE ) and real economy ( RE ) subsystems, establish index systems, and use principal component analysis ( PCA ) to independently calculate the comprehensive values of the two subsystems. Considering the availability of data, we refer to ( Zhao et al., 2020 ) and measure the development level of the digital economy from the aspects of internet development and digital finance development. We measure the level of development of the real economy from three aspects: the scale and structure of the real economy and its future development. The specific indicators and attributes are shown in Table 1 .

www.frontiersin.org

Table 1 Index measurement system of the digital economy and the real economy.

3.1.1.2 Carbon emissions

In this paper, carbon intensity (The Amount of carbon emissions/GDP) is used as a proxy variable for carbon emissions respectively. This paper uses apparent carbon emissions to measure the amount of regional carbon emissions. Data on carbon emission quantities are from China Emission Accounts and Datasets (CEADs) ( Shan et al., 2016 , 2018 , 2020 ; Guan et al., 2021 ).

3.1.1.3 Other variables

According to the requirements of China’s high-quality development: innovation, coordination, green, openness, and sharing, this paper selects eight control variables, as shown in Table 2 . (1) Innovation. Scientific and technological innovation to guide industrial innovation and accelerate the realization of green transformation. Talent is the fundamental source of realizing green innovation. So, technology innovation intensity ( TI ) and innovative talents ( IT ) are the control variables associated with innovation. (2) Coordination. Regional coordination will accelerate the rate of inter-regional capital, technology, and talent flow, injecting capital vitality into the research and development of industrial carbon reduction technology. So, regional coordination ( RC ) and industry coordination ( IH ) are the control variables associated with coordination. (3) Green. The increase in green governance capacity will accelerate the research and development of digital green technology and solve the problem of high pollution and high energy consumption of heavy physical industry. So, green governance capability ( GG ) is the control variable associated with green. (4) Open. The diversification of capital can stimulate the vitality of enterprises to learn and introduce advanced carbon reduction technologies from abroad, and foreign investment will also inject new momentum into the development of domestic industries. So, foreign investment intensity ( FI ) and traffic-developed degree ( TD ) are the control variables associated with openness. (5) Sharing. Well-developed transportation is the basis for realizing the rapid circulation of physical industries. The Internet is the link of modern industrial connection and the basis for the development of the digital economy, which is of great significance to the green manufacturing of enterprises. Social consumption capacity is the embodiment of the purchasing power of the society, which pushes the industry to elaborate research and development of a more green and low-carbon, in order to provide green products and services. So, internet development level ( ID ) is the control variable associated with sharing. Among them, TI , FI , TD , and ID indicators are calculated by the entropy method, and the other indicators are logarithmically processed on the original data.

www.frontiersin.org

Table 2 Control variable description table.

3.1.2 Data sources

This paper uses a sample of 30 provincial administrative units in China (excluding China’s Hong Kong Special Administrative Region, Macao Special Administrative Region, Taiwan, and Tibet Autonomous Region, which has a lot of missing values) to conduct empirical analysis for the years 2011-2021. Data from the Chinese Research Data Services Platform (CNRDS) data service platform, Easy Professional Superior (EPS) database, China Carbon Accounting Database (CEADs), China Statistical Yearbook (2012-2022), China Energy Statistics Yearbook (2012-2022), China Information Industry Yearbook (2012-2022), Peking University Digital Inclusive Finance Index (2011-2021) Index Report, China E-Commerce Report (2011-2021), provincial statistical yearbooks and government work reports, etc., where missing values are filled in using linear interpolation.

3.2 Research method

The steps to use the method in this article are as follows: (1) Firstly, this article uses Principal Component Analysis (PCA) to separately measure the results of the subsystems of the digital economy and the real economy. (2) Secondly, based on the results of the digital economy and the real economy, a coupled coordination model is used to integrate the results of the two subsystems to calculate the integrated economy. (3) Thirdly, based on comprehensive economic and carbon emission data, the Natural Breaks Classification method using software such as QGIS is used to analyze its spatial distribution trend. (4) Fourthly, use Moran’s index to analyze the spatial autocorrelation of integrated economy and carbon emission levels. (5) Fifthly, use spatial econometric models to examine the impact of an integrated economy on carbon emissions and decompose its spatial effects. (6) Sixth, classify the Chinese region into eight major economic zones and once again use spatial econometric models to analyze the heterogeneity of the impact of the integrated economy on carbon emissions in each region. PCA and Coupled Coordination Model are used to measure the integrated economy in section 3.2.1. The Natural Breaks Classification is used to classify integrated economies and carbon emissions in section 3.2.2. Moran’s index is used to test spatial correlation in section 3.2.3. The determination of spatial econometric models is in section 3.2.4. The classification of the eight major economic zones is in section 3.3.5.

3.2.1 Measurement models of the core indicator

3.2.1.1 pca.

Using the principal component analysis method to measure the development level of DE and RE subsystems can avoid the subjectivity of human empowerment and has certain reliability. The specific steps are as follows.

a. Construct the matrix according to the selection of each subsystem index. If there are n samples and p indices, then the original matrix x of size n × p can be formed as shown in Equation 1 .

b. The original matrix is standardized to obtain a standardized matrix X as shown in Equations 2 – 4 .

c. Calculate the covariance matrix of the normalized sample as shown in Equations 5 , 6 .

d. Calculate the eigenvalue λ and eigenvalue vector a of R where R is a positive semidefinite matrix, eigenvalue λ 1 ≥ λ 2 ≥ ⋯ ≥ λ p ≥ 0 as shown in Equation 7 .

e. The principal component contribution rate c and the cumulative contribution rate s are calculated shown in Equation 8 , and the i-th principal component corresponding to the eigenvalues with a cumulative contribution rate of more than 80% is extracted. The index calculation result is Y i shown in Equation 9 .

3.2.1.2 The coupling coordination model

The coupling coordination degree model can measure the dependence and correlation between multiple subsystems to analyze the coordinated development level between subsystems, not only considering the overall coordination but also paying attention to the development of subsystems ( Shao et al., 2016 ). This paper uses the coupling coordination model to calculate IE . The steps are as follows:

a. The maximum and minimum normalization processing is performed on the principal component calculation results of DE and RE subsystem (the 0 value in the calculation result is translated, and the translation unit is 0.1). Both DE and RE system indicators are positive indicators, so the formula is as shown in Equation 10 .

where i,t,j refer to the region, year and index, respectively, j =1 refers to DE , j =2 refers to RE , Z i t j refers to the value of the normalized t year j index in region i , Y i t j refers to the value of the t year j index in region i , and max Y j and min Y j refer to the maximum and minimum values of the j index, respectively.

b. According to the calculation results, the comprehensive coordination index T ti is calculated. D E t i is Y i t 1 and R E t i is Y i t 2 . The calculation formula is as shown in Equation 11 .

α and β are coefficients of development and take 0.5 here.

c. Calculate the coupling level of the digital economy and real economy C ti . The calculation formula is as shown in Equation 12 .

d. Calculate the coupling coordination degree, that is, IE . The calculation formula is as shown in Equation 13 .

e. According to the value range of the coupling coordination degree, it is divided into 10 grades by referring to, as shown in Table 3 :

www.frontiersin.org

Table 3 IE grade division.

3.2.2 Natural breaks classification

This article uses QGIS software to draw a spatial distribution map of IE and CE in China, and the classification principle of the map is based on natural breaks classification. The natural breaks classification refers to a method of determining the segmentation structure based on the characteristics of the data itself. This method is commonly used for segmented analysis of time series or signal data, to identify turning points or structural changes in the data, thereby dividing it into different paragraphs or categories. The basic idea of natural breakpoint classification is to use the inherent properties of data to determine the optimal segmentation structure by detecting inflection or mutation points in the data. These inflection points or mutation points are called ‘natural breaks’, at which the properties of the data may undergo significant changes. By identifying these natural breakpoints, data can be divided into different paragraphs and further analyzed or processed for each paragraph. The natural breaks classification method can avoid the subjectivity of manual classification and classify data reasonably through machine clustering algorithms. This can reduce human bias and improve the objectivity and accuracy of classification results. This also helps to reveal the potential structure and patterns of data and improve the depth and accuracy of data analysis. To clarify the spatial distribution status of IE and CE in the 30 provinces studied in this article, the natural classification algorithm configured in QGIS software was used to divide the research data into three categories.

3.2.3 Spatial autocorrelation test method

We intend to use a spatial econometric model for regression analysis. Considering the possible spatial dependence and autocorrelation of IE and CE , we use Global Moran’s I to test the spatial autocorrelation of IE as shown in Equation 14 and CE as shown in Equation 15 . The calculation formula is as follows:

where n represents the number of research objects , I is Moran’s I,

S 2 is the variance, S I E   2 = ∑ i = 1 n ( I E i − I E ¯ ) 2 n , S C E   2 = ∑ i = 1 n ( C E i − C E ¯ ) 2 n , I E ¯ is the mean of IE , C E ¯ is the mean of CE and w i j is the spatial weight matrix.

To increase the accuracy of the analysis, this paper adopts a nested weights matrix by an inverse-distance-based spatial weights matrix and an economic-based weights matrix ( Case et al., 1993 ).

w = φ w 1 + ( 1 − φ ) w 2 ,

w 1 = { 1 / d i j ,   i   a n d   j   h a v e   a   c o m m o n   b o u n d a r y 0 , i   a n d   j   h a v e   n o   c o m m o n   b o u n d a r y   o r   i = j ,

w 2 = { 1 / | X ¯ i − X ¯ j | , i ≠ j 0 ,   i = j .

Refer to Zhang et al. (2022c) , φ =0.5,

∑ i = 1 n ∑ j = 1 n w i j is the sum of all spatial weights. The value range of I is [-1,1], I >0 represents spatial positive correlation, I <0 represents spatial negative correlation, The closer | I | is to 1, the stronger the spatial autocorrelation is.

3.2.4 Spatial econometric model

The spatial econometric model is different from the traditional econometric model, as it can consider spatial factors and reduce the estimation error. Traditional spatial econometric models include the spatial autoregressive model ( SAR ) as shown in Equation 16 , spatial error model ( SEM ) as shown in Equation 17 , and spatial Durbin model ( SDM ) as shown in Equation 18 . The specific expressions are as follows:

where i is area, t is time, k is the influencing factor( IE and 8 control variables are included), β 0 is a constant term, α k is the regression coefficient of the k -th influencing factor, X i k t is the k -th influencing factor at time t in region i , ρ and   λ   are the spatial autoregressive coefficients, W is the n × 1 -order spatial weight matrix, and δ i t , ϵ i t and μ i t are random error terms.

To determine which spatial econometric model to use, the Lagrange Multiplier Test (LM test) is carried out in this paper. The test results show that the statistic of Robust-LM in the two columns of Spatial error and Spatial lag rejects the null hypothesis at the significance level of 0.01, indicating that there are both error and lag effects, and the SDM model is selected. Subsequently, the Hausman test was used to determine whether the random effect model or the fixed effect model was used. The results show that the null hypothesis is rejected at the significance level of 0.01, that is, the fixed effect model is adopted. All test results are shown in Table 4 .

www.frontiersin.org

Table 4 Model test process and results.

3.2.5 The division of the eight major economic zones

To further analyze the regional heterogeneity of the carbon emission reduction effect of the integrated economy, we divide China (mainly refers to China’s inland areas excluding Hong Kong, Macao, Taiwan, and other places) into eight groups according to the eight economic zones in the Strategy and Policy for Coordinated Regional Development of the Development Research Center of the State Council. Figure 1 shows the distribution of China’s eight economic zones.

www.frontiersin.org

Figure 1 Distribution of China’s eight economic zones.

According to Figure 1 , the northern coastal comprehensive economic zone includes Beijing, Tianjin, Hebei and Shandong provinces. The Northeast Comprehensive Economic Zone includes Liaoning, Jilin and Heilongjiang provinces. The eastern coastal comprehensive economic zone includes Shanghai, Jiangsu and Zhejiang provinces. The southern coastal economic zone includes Fujian, Guangdong and Hainan provinces. The comprehensive economic zone in the middle reaches of the Yangtze River includes Hubei, Hunan, Jiangxi, and Anhui provinces. The southwest comprehensive economic zone includes Yunnan, Guizhou, Sichuan, Chongqing, and Guangxi provinces. The comprehensive economic zone of the middle reaches of the Yellow River includes Shaanxi, Shanxi, Henan, and Inner Mongolia provinces. The Northwest Comprehensive Economic Zone includes Gansu, Qinghai, Ningxia, Tibet, and Xinjiang provinces. It is worth noting that when dividing the region, Tibet belongs to the Northwest Comprehensive Economic Zone. However, due to the difficulty of counting data for Tibet, only the other four provinces in the Northwest Comprehensive Economic Zone are counted in this paper.

4.1 Measurement results of the integrated economy

According to the coupling coordination model, the results of the IE in China from 2011 to 2021 are shown in Table 5 .

www.frontiersin.org

Table 5 The level of IE in China from 2011 to 2021.

The grade of IE in the 30 provinces of China continued to rise from 2011 to 2021, and the overall transformation from primary integration (A2) to good integration (A4) and the integration status was good in recent years. In 2011, most provinces were in a state of primary integration (A2, 41.9%) and moderate integration (A3, 22.6%). In 2021, most provinces were in a state of moderate integration (22.6%) and good integration (45.2%). Guangdong has been at a high level of integration for a long time. Beijing, Jiangsu, Zhejiang, and other head provinces are second only to Guangdong. It is worth noting that the DRID in Hainan, Qinghai, Ningxia, and Tibet has been in a state of imbalance or low integration, showing a significant gap with the integration of other provinces.

4.2 The time evolution and spatial distribution of IE and CE

4.2.1 trends in time evolution.

The national average time evolution of IE and CE from 2011 to 2021 is shown in Figure 2 . It can be seen from Figure 2 that IE shows a growth trend, and CE shows a general downward trend. It can be seen that China has a significant implementation effect on the integration of the digital economy and real economy and the promotion of low-carbon emission reduction policies. With the development of digital technology, physical industry manufacturing began to shift to the intelligent trend, and the development of the integrated economy is bound to show an upward trend. However, after 2019, due to the impact of the epidemic, the overall pace of economic development has slowed down, which has caused a certain impact on both the physical industry and the digital industry. Therefore, the development of an integrated economy has a little downward trend. In recent years, President Xi Jinping has put forward the green development concept of ‘green mountains are golden mountains’ and the dual-carbon goal of ‘achieving carbon peak by 2030 and carbon neutrality by 2060’, which has made people more concerned about green development and reducing carbon emissions. Carbon emissions have begun to show a downward trend year by year. However, due to China’s large population and industrial base, energy consumption is still high all year round, and the downward trend is not obvious.

www.frontiersin.org

Figure 2 Time evolution trend of IE and CE from 2011 to 2021. The red font is a negative value.

4.2.2 Spatial distribution and evolutionary trends

In order to clarify the evolution trend of the spatial distribution of IE and CE , the spatial distribution maps of IE and CE in 2011 and 2021 are drawn respectively, as shown in Figures 3 – 6 . In this paper, the relevant maps are drawn by QGIS software, and the classification principle of drawing is based on Python’s natural breaks classification.

www.frontiersin.org

Figure 3 The spatial distribution of IE in 2011.

www.frontiersin.org

Figure 4 The spatial distribution of IE in 2021.

www.frontiersin.org

Figure 5 The spatial distribution of CE in 2011.

www.frontiersin.org

Figure 6 The spatial distribution of CE in 2021.

Figures 3 and 4 show the spatial distribution of IE in 2011 and 2021. From Figures 3 and 4 , it can be seen that, firstly, the regions with high IE values in 2011 are mainly concentrated in the eastern coastal provinces, while Qinghai, Gansu, Ningxia, Guizhou, and Hainan have low IE values, and most of the central regions have medium IE values. It can be seen that in 2011, IE had just started and had not yet been popularized nationwide, and the eastern region had been ahead of other regions in realizing the integration of the digital economy with the real economy. Second, in 2021, the IE dominant regions started to penetrate the interior, and Henan, like many coastal cities, had a high level of IE . The three western poor regions of Qinghai, Gansu, and Ningxia have relatively low IE , and most central provinces still have medium IE levels. Finally, according to the categorized data in the legend, it can be seen that from 2011 to 2021, the level of IE in each province has been increasing and the inter-provincial gap has been narrowing. In short, the spatial distribution of IE shows a trend of ‘decreasing from east to west’, with regional differences decreasing with the evolution of time.

Figures 5 and 6 show the spatial distribution of CE from 2011 to 2021. According to Figures 5 and 6 , firstly, the CE in 2011 shows a polarization trend of low in the south and high in the north, and regions with high CE account for the majority of the country. Ningxia and Shanxi may have a higher CE than the rest of the country because of the development of heavy-polluting industries such as coal, iron, and steel. Secondly, by 2021, the CE low-level areas in 30 provinces will be far more than the medium-level areas, and only Shanxi Province has a long-term high CE due to the development of coal and mineral resources. Finally, from 2011 to 2021, CE decreased to a certain extent, and the low-emission area expanded significantly, indicating that the carbon emission reduction policy has achieved some success. Overall, China’s CE shows a distribution of ‘low in the south and high in the north’, with low-carbon areas continuing to spread from south to north.

4.3 Spatial autocorrelation of IE and CE

4.3.1 global spatial autocorrelation.

Figure 7 shows the evolution of spatial correlation between IE and CE from 2011 to 2021.

www.frontiersin.org

Figure 7 Moran‘s I of IE and CE from 2011 to 2021. The red triangles in the figure are marked as insignificant results, and the others are significant at the significance level of 0.01 or 0.1. The red font is a negative value.

First of all, it can be seen from Figure 7 that IE has strong spatial autocorrelation, that is, places with strong IE tend to gather positively, and vice versa. Secondly, except for 2013 to 2015 (in 2013-2015, CE was negatively correlated but the results were not significant and not statistically significant), CE has a positive and significant spatial correlation, which indicates that CE has ‘good neighbors’ or ‘beggar neighbors’. Finally, the spatial correlation of IE is much higher than that of CE , indicating that the economic effect is more likely to form spatial agglomeration than the environmental effect.

4.3.2 Local spatial correlation

The local Moran’s I index is the key to accurately capturing the heterogeneity of local spatial elements, reflecting the correlation between the value of an attribute in a region and neighboring regions ( He et al., 2023 ). In this paper, the Moran index scatter plots of IE and CE from 2011 to 2021 are drawn to describe the local spatial correlation. Due to space limitations, only the Moran scatter plots of 2011 and 2021 are shown, as shown in Figures 8 , 9 .

www.frontiersin.org

Figure 8 Moran scatterplot of IE in 2011 and 2021.

www.frontiersin.org

Figure 9 Moran scatterplot of CE in 2011 and 2021.

According to Figure 8 , we can see that the IE of 30 provinces is mainly concentrated in the first and third quadrants from 2011 to 2021, indicating that ‘good neighbors’ and ‘beggar neighbors’ coexist. This two-way agglomeration may lead to the emergence of a gap. It can be seen from Figure 9 that the Moran scatter plot of 30 provinces in China in 2011 is mainly concentrated in the second and third quadrants, and the third quadrant is more, indicating that mixed agglomeration and ‘low-low agglomeration’ coexist, and the agglomeration of places with lower carbon emissions is more obvious. The Moran scatterplot of China’s 30 provinces in 2021 is mainly concentrated in the third quadrant, significantly more than in 2011, indicating that the carbon emission situation has eased in the past 10 years, and the low-carbon emission areas have increased and continued to gather.

4.4 The spatial effect of IE on CE

4.4.1 spatial econometric model results, 4.4.1.1 spatial modeling regression results.

The measurement results of SDM with fixed time are shown in Table 6 . According to Table 6 , first of all, the spatial autoregressive coefficient is -0.383, which is significant at the significance level of 0.05, indicating that the more concentrated the regions with large carbon emissions, the more conducive to centralized governance and the easier it is to reduce carbon emission intensity. Secondly, IE can significantly inhibit CE , and the coefficients are -0.146 and -0.305 without considering and considering the spatial effect of the spatial matrix, respectively. It can be seen that IE has a stronger inhibitory effect on CE when considering spatial spillover. Finally, under the consideration of the spatial matrix, the control variables such as TI , RC , GG , etc. have a significant reduction effect on CE .

www.frontiersin.org

Table 6 Model measurement results.

4.4.1.2 Spatial Spillover Effect Decomposition

To further analyze the spatial effect of IE on CE , the spatial spillover effect is decomposed, and the results are shown in Table 7 . It can be seen from Table 7 that the direct and spatial effects of IE on CE are significant, and the indirect inhibitory effect on CE is stronger than the direct effect.

www.frontiersin.org

Table 7 Spatial spillover effect decomposition results.

4.4.2 Robustness test

The robustness test of this paper is divided into two categories: First, the robustness test of the model. On the one hand, according to the model selection process in Table 4 in section 3.2.4, it can be determined that the model selected in this paper is appropriate. On the other hand, to further determine the credibility of the conclusions, the SDM model with both OLS and individual time fixed is selected for testing in this paper. Second, the robustness test of the spatial matrix. In this paper, the economic distance matrix is used for the test. The above test results are shown in Table 8 . According to Table 6 , it can be seen that IE has a significant reduction effect on CE (all at the 0.01 level of significance), indicating that the previous test results are robust.

www.frontiersin.org

Table 8 Results of the robustness test.

4.5 Regional heterogeneity analysis

Spatial econometric regression of the data for the eight integrated economic zones based on the selected time-fixed SDM model described above is shown in Table 9 . IE in the North Coastal Economic Zone all had a reducing effect on CE, but the results were not significant. The Northeast Economic Zone, the Southern Coastal Economic Zone, and the Southwest Economic Zone IE have significant decreasing effects on CE (coefficients of -0.220, -0.092, and -0.308), and the decreasing effects are even stronger when spatial effects are taken into account (-0.344, -0.118, and -0.724). The Eastern Coastal Economic Zone and the Middle Yangtze River Economic Zone IE have a significant contributing effect on CE . However, it is not significant when spatial effects are considered. The middle reaches of the Yellow River economic zone IE have a significant contribution to CE (3.890), which is stronger when spatial effects are considered (11.668). The Northwest Economic Zone IE has a facilitating effect on CE when spatial effects are considered (1.947). In addition, different control variables have different effects in different regions.

www.frontiersin.org

Table 9 The spatial econometric regression results of the eight comprehensive economic zones.

5 Discussion

5.1 discussion of results.

The main contributions of this article are reflected in the following aspects. Firstly, using reasonable methods and indicator systems to measure the integrated economy can fill the gap in the measurement of the integrated economy in the existing literature. Secondly, the innovative incorporation of integrated economy and carbon emissions into the same theoretical framework has deepened the theoretical research on low-carbon economy. Finally, analyze the current situation and inherent relationship between integrated economy and carbon emissions from a spatial perspective, and deepen relevant research in spatial economics. Therefore, for the discussion of the test results this paper will develop 3 aspects. (1) An in-depth discussion of the measured results of the integrated economy and carbon emissions, which includes a discussion of the temporal evolution, spatial distribution, and spatial correlation of IE and CE . (2) In-depth discussion of the test results of the impact of an integrated economy on carbon emissions. (3) In-depth discussion of the regional heterogeneity of the impact of the integrated economy on carbon emissions in the eight economic regions.

5.1.1 In-depth discussion of measurement results

In this section, the results of the IE and CE measurements are discussed, which are mainly divided into the discussion of the results of the IE measured by the coupled coordination model, the evolutionary characteristics of the IE and CE , and the spatial autocorrelation of IE and CE .

5.1.1.1 Measurement results of the integrated economy

Table 5 shows the measurement results of the integrated economy. Firstly, the IE grades of China’s 30 provinces show an upward trend during the study period, and the overall shift from A2 to A4 is realized, which is consistent with the findings of (Zhang et al., 2022). This indicates that China’s economy still maintains a high level of growth, and IE formed by the coupling and coordination of the digital economy and the real economy has become a new type of economic form. This is related to China’s policy move since 2015 to focus on the real economy and vigorously develop the digital economy. This paper constructs an indicator system to measure the development level of the real economy from three aspects: scale, structure, and development potential, which is different from the measurement of the real economy by scholars such as (Zhang et al., 2022; Shi and Sun, 2023 ), and has certain innovation and application value. Secondly, the level of IE varies among the 30 provinces in the country. Guangdong has been highly integrated for a long time. The headline provinces of Beijing, Jiangsu, and Zhejiang are second only to Guangdong. Notably, the IE of Hainan, Qinghai, Ningxia, and Tibet have been in an unbalanced or low integration state, with a large gap between their integration levels and those of other provinces. Differences in regional development are related to China’s policy preferences. China’s economic development started in the eastern coastal region and penetrated from the east to the west ( Chen, 2022 ). Thus Guangdong, Beijing, Jiangsu, and Zhejiang have higher levels of integrated economic development than Qinghai, Ningxia, and Tibet in the west. This reveals that China should make full use of the penetration effect of the eastern region in policy formulation to reduce regional differences.

5.1.1.2 The time evolution and spatial distribution of IE and CE

Figure 2 shows the change in national mean time for IE and CE from 2011 to 2021. From Figure 2 , it can be seen that IE shows an increasing trend (decreasing after 2019) and CE shows a slow decreasing trend in general. The fluctuation of IE in 2019 is related to the impact of the new crown epidemic on the development of the real economy ( Takyi et al., 2023 ). As China’s national attention to carbon reduction and emission reduction continues to increase, and policy pilots continue to grow, carbon emissions will also show a significant downward trend ( Feng et al., 2024 ). However, given China’s large energy consumption base, carbon emissions will only decline slowly.

Figures 3 and 4 show the spatial distribution of IE in 2011 and 2021. Comparing the two figures, it can be found that: firstly, from 2011 to 2021, the level of IE in each province has been increasing, and the inter-provincial gap has been decreasing. This suggests that China’s policy initiatives for IE have achieved some success, and the digital economy can effectively reduce regional disparities ( Zhou et al., 2023 ), which is consistent with the findings of (Zhang et al., 2022). The spatial distribution of IE shows the trend of “decreasing from the east to the west”, and regional disparities are reduced over time, which is similar to the results of the study of ( Wu et al., 2023 ). This is related to China’s long-standing policy bias, where all of China’s eastern coastal cities are developed cities, the western region is economically backward, and environmental and geographic factors are strong impediments to the development of the economy, so the regional distribution of most economic forms shows a decreasing trend from east to west. The results of this paper reveal the spatial evolution trend of IE , effectively proving the important role of the digital economy in narrowing regional gaps and promoting high-quality development.

Figures 5 and 6 show the spatial distribution of CE in China from 2011 to 2021. The comparison shows that the national distribution of China’s CE has changed from polarization (i.e., the gap between CE in the north and south regions was large in 2011) to a trend of concentration and diffusion (i.e., a smaller gap between CE in the north and south in 2021, with regional agglomeration). 2011, China’s industrial layout was that the north was dominated by heavy industry, the south was dominated by light industry and service industry, and the north’s carbon emissions were higher. In 2011, China’s industrial layout was dominated by heavy industries in the north and light industries and services in the south, with higher carbon emissions in the north. By 2021, after 10 years of industrial transformation and the application of decarbonization technologies, carbon emissions in the north will be lower, and thus the gap between the north and the south of CE will be gradually narrowed. The results of this study are similar to ( Wang et al., 2014 ), but this paper reveals the trend and characteristics of CE , which is a reference value for understanding the current situation of CE in China’s provinces.

5.1.1.3 Spatial autocorrelation of IE and CE

Figure 7 plots the trend of the global Moran’s index of IE from 2011 to 2021. First, IE has strong spatial autocorrelation (i.e., places with strong IE tend to be positively clustered and vice versa), which is consistent with the findings of (Zhang et al., 2022). IE belongs to the new economic form, which has strong industrial agglomeration characteristics. Relevant industries will cluster to give full play to the scale advantage of the industry, such as the Internet industry cluster, which can make full use of the infrastructure advantage and knowledge spillover effect in the space. Second, the CE all have significant positive spatial correlations (except for 2013-2015), indicating that carbon emissions also have spatial agglomeration characteristics. Because energy consumption is closely related to industrial layout, high-carbon emission industries tend to cluster to give full play to the scale effect of the industry. This is similar to the findings of ( Zhang et al., 2024 ). Finally, the spatial correlation of IE is much higher than that of CE, indicating that economic effects are more likely to form spatial agglomeration than environmental effects, which is because economic activities are more affected by distance, while environmental pollution is more likely to spread. This study reveals the important role of economic effects in regional agglomeration theory and also proves that environmental pollution can form regional agglomeration in the diffusion to surrounding areas, enriching relevant theoretical research.

Figures 8 , 9 shows the localized Moran’s index results for IE and CE in 2011 and 2021. It can be seen that from 2011 to 2021, the IE of the 30 provinces is mainly concentrated in the first and third quadrants. This is because regions with higher IE levels will have a diffusion effect on their neighbors, promoting IE in the surrounding provinces and forming ‘high - high agglomeration’, while places with lower IE levels are not led by the leading provinces and it is difficult for them to leap forward in the hierarchy, thus forming ‘low - low agglomeration’. The ‘low-low agglomeration’ is formed. This is consistent with the findings of (Zhang et al., 2022). However, this study finds that this two-way agglomeration may lead to the widening of the East-West regional gap and exacerbate the Matthew effect, and it is expected that ‘low-low agglomeration’ can be reduced through effective policy instruments. The Moran scatterplot of CE for 30 provinces in China in 2011 is mainly concentrated in the second and third quadrants, and there are more in the third quadrant. This suggests that ‘mixed agglomeration’ and ‘low-low agglomeration’ co-existed in 2011, and the agglomeration is more obvious in places with lower carbon emissions. In 2021, the Moran scatterplot of China’s 30 provinces mainly concentrates in the third quadrant and most of them are southern cities, and the number of low-carbon emission areas increases and continues to be agglomerated. This is related to the implementation of low-carbon pilot policies ( Feng et al., 2024 ). Unlike previous spatial agglomeration analyses of carbon emissions, the study in this paper can effectively demonstrate the impact of policy preferences on carbon emissions, for example, taking developed coastal cities (Zhejiang, Shanghai, Jiangsu, etc.) as the pilot areas for low-carbon policies can effectively reduce carbon emissions in the local area and neighboring regions.

5.1.2 In-depth discussion of the impact of IE on CE

5.1.2.1 spatial modeling regression results.

The results of the spatial effect test of IE on CE are shown in Table 6 . Firstly, the spatial autoregressive coefficient is negative and significant. This indicates that the more concentrated the area with large carbon emissions is, the more favorable it is for centralized management, and the easier it is to reduce the intensity of carbon emissions. Second, IE has an obvious inhibitory effect on CE , and the inhibitory effect is stronger when considering the spatial spillover effect. It can be seen that IE can give full play to the clean production characteristics of the digital economy and green the real economy, which is similar to the findings of ( Wu et al., 2023 ). The impacts of IE have spatial spillovers, i.e. the development of local IE can effectively reduce carbon emissions in neighboring areas. Unlike previous studies, this paper focuses on exploring the carbon reduction effect of IE from a spatial perspective, aiming to propose feasible regional policies. Finally, the control variables such as TI , RC , and GG have a significant reduction effect on CE when the spatial matrix is considered (the influence coefficients are -0.242, -0.081, and -0.016, respectively). This is an important finding of this paper that is different from other studies. Therefore, policymakers should fully consider the coordination and linkage among technological innovation, regional coordination, green development policies, and IE to help reduce carbon emissions.

5.1.2.2 Spatial Spillover Effect Decomposition

The decomposition results of the spatial spillover effects are shown in Table 7 . From Table 7 , it can be seen that the direct and spatial effects of IE on CE are both significant, and the indirect inhibition effect on CE is stronger than the direct effect. It shows that IE in this region and adjacent areas will reduce CE , and IE in adjacent areas has a stronger effect. The development of IE in this region will have a demonstration effect on IE in neighboring areas, prompting neighboring areas to vigorously promote IE , thereby reducing CE in neighboring areas. The results of the study can inform the formulation of regional development policies.

5.1.3 In-depth discussion of regional heterogeneity

Spatial econometric regression of the data for the eight integrated economic zones based on the selected time-fixed SDM model described above is shown in Table 9 . Table 9 shows that, firstly, the IE of the three regions of the Northeast, the Southern Coastal Region, and the Southwest Comprehensive Economic Zone can significantly reduce CE (similar to the results of Shi and Sun, 2023 ), and the carbon emission reduction effect is stronger after considering the spatial spillover effect. This is because the Northeast Economic Zone is an old industrial base with a larger carbon emission base, and IE has a stronger carbon emission reduction effect in the region. The southern coastal economic zone has a more developed digital economy, which has a double carbon reduction effect. The Southwest Comprehensive Economic Zone has a stronger carbon sink capacity, which can promote the IE effect to a large extent. Secondly, the IE to CE enhancement effect is obvious in the Middle reaches of the Yangtze River Comprehensive Economic Zone and the Middle reaches of the Yellow River Comprehensive Economic Zone, which may be related to the fact that the current comprehensive economies of these two regions are dominated by high-carbon manufacturing and supplemented by digital intelligent manufacturing. It is worth noting that, considering the spatial effect, the enhancement effect of IE on CE is more obvious in the Middle Yellow River Comprehensive Economic Zone. Finally, the effect of IE on CE in the North Coastal Integrated Economic Zone is negative and insignificant, which may be due to the inconsistent development of the internal provinces. The IE of the East Coast Comprehensive Economic Zone has an increasing effect on CE , but the effect is not strong, and the effect is not significant when spatial spillover effects are considered. Considering the spatial effect, the IE of the Greater Northwest Comprehensive Economic Zone can significantly increase CE , which may be due to the imperfect construction of digital infrastructure in the Northwest. The results of the study prove that the effects of IE on CE impacts in China’s eight economic regions are different, and not all regions have a lowering effect of IE on CE . This reveals that we should formulate policies according to the characteristics of regional development to avoid the enhancing effect of IE on CE . In addition, different control variables have different effects in different regions, which also makes the eight economic zones IE have different effects on CE , and the result has important implications for the harmonization of different regional policies.

5.2 Policy implications

Based on these findings and discussion, this paper offers the following policy implications. These policy insights, combined with regional development characteristics and the findings of this paper, can provide a reference for policymakers to effectively reduce carbon emissions and achieve green and high-quality development.

(1) Continuously strengthening investment in digital infrastructure. Accelerating the construction of new digital infrastructure such as 5G, data centers, artificial intelligence, the Internet of Things and the industrial Internet in all provinces, especially in the western provinces, so as to build a firm foundation of integration for the development of an integrated economy and promote the interconnection of the digital economy and the real economy. The real economy will be transformed and upgraded through intelligent and collaborative new modes of production, and the divide in the development of the convergence economy will be reduced with the help of the digital economy dividend.

(2) Give full play to the spatial spillover effect of the integrated economy to reduce carbon emissions. First, the development advantages of the head provinces, such as Beijing, Shanghai, and Jiangsu, should be promoted to establish ‘demonstration zones’ for the integration and development of the digital economy and the real economy, so as to form a diffusion effect and drive the development of the surrounding regions with the center. Secondly, the central region should fully cooperate with the east, fully absorb the overflow from the east, and realize a new situation of regional green development. Finally, the disadvantaged western provinces should make full use of the role of the ‘One Belt and One Road’ and ‘Western Development’ strategies to reduce the spatial spillover effect of the disadvantaged regions and embark on the road of ecological protection and green development based on their resource endowments and environmental characteristics.

(3) ‘Tailor-made’ regional economic development policies. Differentiated macroeconomic control policies have been implemented by the actual situation of the economic zones, and different high-quality development policies have been focused on promoting integrated economic development and carbon emission reduction. On the one hand, encourage the construction of a digital economy in the Northeast, Southern Coastal, and Southwest Comprehensive Economic Zones, to promote industrial integration through the development of a digital economy and realize the effective reduction of carbon emissions. On the other hand, strengthen the development of industrial modernization in the middle reaches of the Yangtze River and the middle reaches of the Yellow River comprehensive economic zones, reduce the proportion of high-energy-consuming industries in the integration economy, and reduce carbon emissions within the economic zones. In addition, regional economic development strategies under the global framework are formulated to reduce the overall differences in the integrated economy.

5.3 Research limitations

Taking China as the research object, this study analyzes the spatial impact effect of the integration economy on carbon emissions using data from 30 provinces from 2011 to 2021. However, this study has certain limitations in terms of variable selection, data collection, and research object, which need to be improved and refined in subsequent studies. First, this study examined the spatial impact of the integration economy on carbon emissions at the national level, but it lacked the consideration of intermediate action mechanisms. Future studies should analyze in depth the intrinsic mechanisms through which the integration economy acts on carbon emissions. Second, due to the limitation of data availability, the relevant calculation results may not accurately represent the variables. Therefore, future research should start with data to enhance the accuracy and completeness of variable measurement. Finally, based on eight comprehensive economic zones, this study analyzed the regional heterogeneity of the impact of IE on CE based on provincial data but did not consider the city, county, and district levels. Subsequent studies could focus on specific regions such as the city and county levels.

6 Conclusions

This paper takes 30 inland provinces in China (Hong Kong Special Administrative Region, Macao Special Administrative Region, Taiwan, and Tibet Autonomous Region are excluded from the study due to data acquisition problems) as the research subjects. Based on the panel data from 2011 to 2021, this paper analyzes the spatial characteristics of the impact of the integrated economy on carbon emissions by using principal component analysis, coupled coordination degree model, Moran index, and spatial econometrics. The contributions of this article are reflected in the following aspects. Firstly, using reasonable methods and indicator systems to measure the integrated economy can fill the gap in the measurement of the integrated economy in the existing literature. Secondly, the innovative incorporation of integrated economy and carbon emissions into the same theoretical framework has deepened the theoretical research on low-carbon economy. Finally, analyze the current situation and inherent relationship between integrated economy and carbon emissions from a spatial perspective, and deepen relevant research in spatial economics. The main conclusions of the study are as follows.

(1) Characterizing the spatial and temporal evolution of the integrated economy and carbon emissions. Over the study period, the integrated economy showed a yearly increase while carbon emissions showed a yearly decrease. The spatial distribution of IE shows a trend of ‘decreasing from east to west’, with regional differences decreasing with the evolution of time. China’s CE shows a distribution of ‘low in the south and high in the north’, with low-carbon areas continuing to spread from south to north.

(2) Analyzing the spatial correlation between the integrated economy and carbon emissions. From the global perspective of China, both integrated economy and carbon emissions have significant positive spatial correlations. From the local perspective, an integrated economy is mainly characterized by ‘high-high agglomeration’ and ‘low-low agglomeration’, while carbon emissions are characterized by ‘low-low agglomeration’.

(3) Exploring the spatial impact effects of an integrated economy on carbon emissions. Using the time-fixed SDM model, it is found that the integrated economy has a significant negative effect on carbon emissions, and the negative effect is even stronger when spatial spillover effects are considered, and the result still holds under multiple robustness tests. This suggests that the integrated economy has a strong spatial effect and can effectively reduce carbon emissions in China.

(4) Discussing the spatial heterogeneity of the impact of the integrated economy on carbon emissions. The impact of an integrated economy on carbon emissions varies from one integrated economic zone to another. The integrated economy of the three regions of the Northeast, the Southern Coastal Region, and the Southwest Comprehensive Economic Zone can significantly reduce carbon emissions. The integrated economy to carbon emissions enhancement effect is obvious in the Middle reaches of the Yangtze River Comprehensive Economic Zone and the Middle reaches of the Yellow River Comprehensive Economic Zone. The effect of an integrated economy on carbon emissions in the North Coastal Integrated Economic Zone is negative and insignificant. The integrated economy of the East Coast Comprehensive Economic Zone has an increasing effect on carbon emissions, but the effect is not strong.

(5) Providing insights for policy development. First, investment in digital infrastructure should be continuously strengthened. Accelerate the construction of new digital infrastructure in all provinces, especially in the western provinces, and promote the interconnection of the digital economy with the real economy. Second, give full play to the spatial spillover effect of the integrated economy to reduce carbon emissions. Promote the development advantages of headline provinces such as Beijing, Shanghai, and Jiangsu, and establish “demonstration zones” for the integrated development of the digital economy and the real economy, so that the center can drive the development of the surrounding areas. Finally, ‘tailor-made’ regional economic development policies. Implement differentiated macro-control policies based on the actual situation of economic zones, and implement different high-quality development policies around promoting integrated economic development and carbon emission reduction.

Data availability statement

The original contributions presented in the study are included in the article/ Supplementary Material . Further inquiries can be directed to the corresponding author.

Author contributions

YW: Funding acquisition, Methodology, Resources, Supervision, Writing – review & editing. QK: Conceptualization, Data curation, Methodology, Resources, Validation, Writing – original draft, Writing – review & editing. SL: Data curation, Methodology, Writing – review & editing, Funding acquisition.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was supported by the National Natural Science Foundation of China Project Research on The Mechanism, Dynamic Evaluation and Implementation Path of Environmental Protection and Industrial Collaborative Development in the Yellow River Basin [grant numbers 72273103], Major Project of the Key Research Base of Humanities and Social Sciences of the Ministry of Education the Integration Path and Policy of Digital Economy and Real Economy in Western China [grant numbers 22JJD790063] and Shaanxi Province Philosophy and Social Science Research Special Youth Program Research on the Constraints and Promotion Mechanisms of the Integration and Development of the Digital Economy and the Real Economy in Shaanxi Province[grant numbers 2024QN341].

Acknowledgments

We thank the precious suggestions by reviewers and editors, which have greatly helped the improvement of the paper.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fevo.2024.1374724/full#supplementary-material

Ahmadi M. H., Madvar M. D., Sadeghzadeh M., Rezaei M. H., Herrera M., Shamshirband S. (2019). Current status investigation and predicting carbon dioxide emission in latin american countries by connectionist models. Energies 12. doi: 10.3390/en12101916

CrossRef Full Text | Google Scholar

Case A. C., Rosen H. S., Hines J. R. (1993). Budget spillovers and fiscal policy interdependence: Evidence from the states. J. Public Economics 52, 285–307. doi: 10.1016/0047-2727(93)90036-S

Chang H., Ding Q., Zhao W., Hou N., Liu W. (2023). The digital economy, industrial structure upgrading, and carbon emission intensity – empirical evidence from China’s provinces. Energy STRATEGY Rev. 50. doi: 10.1016/j.esr.2023.101218

Chen P. (2022). Is the digital economy driving clean energy development? -New evidence from 276 cities in China. J. Cleaner Production 372, 13378. doi: 10.1016/j.jclepro.2022.133783

Chen H., Yi J., Chen A., Peng D., Yang J. (2023). Green technology innovation and CO 2 emission in China: Evidence from a spatial-temporal analysis and a nonlinear spatial durbin model. Energy Policy 172. doi: 10.1016/j.enpol.2022.113338

Cheng Z., Li L., Liu J., Zhang H. (2018). Total-factor carbon emission efficiency of China’s provincial industrial sector and its dynamic evolution. Renewable Sustain. Energy Rev. 94, 330–339. doi: 10.1016/j.rser.2018.06.015

Cheng Y., Zhou X., Li Y. (2023). The effect of digital transformation on real economy enterprises’ total factor productivity. Int. Rev. Economics Finance 85, 488–501. doi: 10.1016/j.iref.2023.02.007

Dai D., Li K., Zhao S., Zhou B. (2022). Research on prediction and realization path of carbon peak of construction industry based on EGM-BP model. Front. Energy Res. 10. doi: 10.3389/fenrg.2022.981097

Fatima S., Desouza K. C., Dawson G. S. (2020). National strategic artificial intelligence plans: A multi-dimensional analysis. Econ Anal. Policy 67, 178–194. doi: 10.1016/j.eap.2020.07.008

Feng Y., Li L., Chen H. (2023a). Carbon emission reduction effect of digital infrastructure: from the “Broadband China” Strategy. Ecol. Chem. Eng. S-Chema I Inzynieria Ekologicza S 30, 283–289. doi: 10.2478/eces-2023-0030

Feng Z., Song D., Xie W. (2023b). DigitalEconomy helps realizethe ‘DoubleCarbon’Goal : basic approaches,Internal mechanismsandActionStrategies. J. Beijing Normal University(Social Sciences) 01), 52–61.

Google Scholar

Feng X., Zhao Y., Yan R. (2024). Does carbon emission trading policy has emission reduction effect?—An empirical study based on quasi-natural experiment method. J. Environ. Manage. 351, 119791. doi: 10.1016/j.jenvman.2023.119791

PubMed Abstract | CrossRef Full Text | Google Scholar

Gao P., Yue S., Chen H. (2021). Carbon emission efficiency of China’s industry sectors: From the perspective of embodied carbon emissions. J. Cleaner Prod. 283, 124655. doi: 10.1016/j.jclepro.2020.124655

Granados N., Gupta A. (2013). Transparency strategy: competing with information in a digital world. Mis Q. 37, 637–641. doi: 10.5555/2535658.2535676

Guan Y., Shan Y., Huang Q., Chen H., Wang D., Hubacek K. (2021). Assessment to China’s recent emission pattern shifts. Earth’s Future 9, e2021EF002241. doi: 10.1029/2021EF002241

He W., Zhang K., Kong Y., Yuan L., Peng Q., Mulugeta Degefu D., et al. (2023). Reduction pathways identification of agricultural water pollution in Hubei Province, China. Ecol. Indic. 153, 110464. doi: 10.1016/j.ecolind.2023.110464

Hong Y., Ren B. (2023). Connotation and approach of deep integration of the digital economy and the real economy. China Ind. Economics 02), 5–16. doi: 10.19581/j.cnki.ciejournal.2023.02.001

Jiang H., Chen Z., Liang Y., Zhao W., Liu D., Chen Z. (2023). The impact of industrial structure upgrading and digital economy integration on China’s urban carbon emissions. Front. Ecol. Evol. 11. doi: 10.3389/fevo.2023.1231855

Jiang S., Sun Y. (2020). An empirical study on the effect of digital economy on real economy. Sci. Res. Manage. 41, 32–39.

Klippert M., Marthaler F., Spadinger M., Albers A. (2020). Industrie 4.0 – An empirical and literature-based study how product development is influenced by the digital transformation. Proc. CIRP 91, 80–86. doi: 10.1016/j.procir.2020.02.152

Li J., Huang X., Yang H., Chuai X., Li Y., Qu J., et al. (2016). Situation and determinants of household carbon emissions in Northwest China. Habitat Int. 51, 178–187. doi: 10.1016/j.habitatint.2015.10.024

Li Z., Zhou Q. (2021). Research on the spatial effect and threshold effect of industrial structure upgrading on carbon emissions in China. J. Water Climate Change 12, 3886–3898. doi: 10.2166/wcc.2021.216

Liu Z., Liu H., Lang W., Fang S., Chu C., He F. (2022b). Scaling law reveals unbalanced urban development in China. Sustain. Cities AND Soc. 87. doi: 10.1016/j.scs.2022.104157

Liu Y., Tan H., Chen X., Yang C. (2022a). Research on the impact of the digital economy on the investment efficiency of the real economy. China Soft Sci. 10), 20–29.

Liu H., Wang L., Shen Y. (2023). Can digital technology reduce carbon emissions? Evidence from Chinese cities. Front. Ecol. Evol. 11. doi: 10.3389/fevo.2023.1205634

Liu Y., Zheng M., Shum W. Y. (2024). On the linkages between digital finance and real economy in China: A cointegration analysis. Innovation Green Dev. 3, 100109. doi: 10.1016/j.igd.2023.100109

Lopes de Sousa Jabbour A. B., Chiappetta Jabbour C. J., Choi T.-M., Latan H. (2022). ‘Better together’: Evidence on the joint adoption of circular economy and industry 4.0 technologies. Int. J. Production Economics 252, 108581. doi: 10.1016/j.ijpe.2022.108581

Meng Z., Wang H., Wang B. (2018). Empirical analysis of carbon emission accounting and influencing factors of energy consumption in China. Int. J. OF Environ. Res. Public Health 15 (11), 2467. doi: 10.3390/ijerph15112467

Peng J., Chen H., Jia L., Fu S., Tian J. (2023). Impact of digital industrialization on the energy industry supply chain: evidence from the natural gas industry in China. Energies 16 (4), 1564. doi: 10.3390/en16041564

Qi Y., Yang Y., Jin F. (2013). China’s economic development stage and its spatio-temporal evolution: A prefectural-level analysis. J. Geographical Sci. 23, 297–314. doi: 10.1007/s11442-013-1011-0

Shan Y., Guan D., Zheng H., Ou J., Li Y., Meng J., et al. (2018). China CO2 emission accounts 1997–2015. Sci. Data 5, 170201. doi: 10.1038/sdata.2017.201

Shan Y., Huang Q., Guan D., Hubacek K. (2020). China CO2 emission accounts 2016–2017. Sci. Data 7, 54. doi: 10.1038/s41597-020-0393-y

Shan Y., Liu J., Liu Z., Xu X., Shao S., Wang P., et al. (2016). New provincial CO2 emission inventories in China based on apparent energy consumption data and updated emission factors. Appl. Energy 184, 742–750. doi: 10.1016/j.apenergy.2016.03.073

Shao S., Li X., Cao J., Yang L. (2016). China’s economic policy choices for governing smog pollution-based on spatial spillover effects. Econ Res. J. 51, 73–88.

Shi D., Sun G. (2023). The influence of the integration of digital economy and real economy on green innovation. Reform 02), 1–13.

Sun G., Fang J., Li J., Wang X. (2024). Research on the impact of the integration of digital economy and real economy on enterprise green innovation. Technol Forecasting Soc. Change 200, 123097. doi: 10.1016/j.techfore.2023.123097

Takyi P. O., Dramani J. B., Akosah N. K., Aawaar G. (2023). Economic activities’ response to the COVID-19 pandemic in developing countries. Sci. Afr. 20, e01642–e01642. doi: 10.1016/j.sciaf.2023.e01642

Tang K., Yang G. (2023). Does digital infrastructure cut carbon emissions in Chinese cities? Sustain. Prod. Consumption 35, 431–443. doi: 10.1016/j.spc.2022.11.022

Tian G., Yu S., Wu Z., Xia Q. (2022). Study on the emission reduction effect and spatial difference of carbon emission trading policy in China. Energies 15, 1921. doi: 10.3390/en15051921

Tong X. (2020). The spatiotemporal evolution pattern and influential factor of regional carbon emission convergence in China. Adv. Meteorol 2020. doi: 10.1155/2020/4361570

Wang J., Cai B., Zhang L., Cao D., Liu L., Zhou Y., et al. (2014). High resolution carbon dioxide emission gridded data for China derived from point sources. Environ. Sci. Technol. 48, 7085–7093. doi: 10.1021/es405369r

Wang G., Feng Y. (2024). Analysis of carbon emission drivers and peak carbon forecasts for island economies. Ecol. Model. 489, 110611. doi: 10.1016/j.ecolmodel.2023.110611

Wang J., Wu H., Chen Y. (2020). Made in China 2025 and manufacturing strategy decisions with reverse QFD. Int. J. Production Economics 224, 107539. doi: 10.1016/j.ijpe.2019.107539

Wang H., Wu D. L., Zeng Y. M. (2023). Digital economy, market segmentation and carbon emission performance. Environ. Dev. Sustainbil . doi: 10.1007/s10668-023-03465-w

Wu R., Hua X., Peng L., Liao Y., Yuan Y. (2022). Nonlinear effect of digital economy on carbon emission intensity—Based on dynamic panel threshold model. Front. Environ. Sci. 10. doi: 10.3389/fenvs.2022.943177

Wu T., Peng Z., Yi Y., Chen J. (2023). The synergistic effect of digital economy and manufacturing structure upgrading on carbon emissions reduction: Evidence from China. Environ. Sci. pollut. Res. 30, 87981–87997. doi: 10.1007/s11356-023-28484-y

Xu J. (2023). Study on spatiotemporal distribution characteristics and driving factors of carbon emission in Anhui Province. Sci. Rep. 13 (1), 14400. doi: 10.1038/s41598-023-41507-5

Xu Q., Dong Y.-x., Yang R., Zhang H.-o., Wang C.-j., Du Z.-w. (2019). Temporal and spatial differences in carbon emissions in the Pearl River Delta based on multi-resolution emission inventory modeling. J. Cleaner Production 214, 615–622. doi: 10.1016/j.jclepro.2018.12.280

Xu B., Li E., Zheng H., Sang F., Shi P. (2017). The remanufacturing industry and its development strategy in China. Strategic Study CAE 19, 61–65.

Xu G., Lu T., Liu Y. (2021). Symmetric reciprocal symbiosis mode of China’s digital economy and real economy based on the logistic model. Symmetry-Basel 13 (7), 1136. doi: 10.3390/sym13071136

Xu A., Song M., Wu Y., Luo Y., Zhu Y., Qiu K. (2024). Effects of new urbanization on China’s carbon emissions: A quasi-natural experiment based on the improved PSM-DID model. Technol Forecasting Soc. Change 200, 123164. doi: 10.1016/j.techfore.2023.123164

Xu L. D., Xu E. L., Li L. (2018). Industry 4.0: state of the art and future trends. Int. J. Prod. Res. 56, 2941–2962. doi: 10.1080/00207543.2018.1444806

Yi Y., Cheng R., Wang H., Yi M., Huang Y. (2023). Industrial digitization and synergy between pollution and carbon emissions control: new empirical evidence from China. Environ. Sci. pollut. Res. 30, 36127–36142. doi: 10.1007/s11356-022-24540-1

Yuan L., Qi Y., He W., Wu X., Kong Y., Ramsey T. S., et al. (2024). A differential game of water pollution management in the trans-jurisdictional river basin. J. Cleaner Prod. 438, 140823. doi: 10.1016/j.jclepro.2024.140823

Zha Q., Liu Z., Wang J. (2023). Spatial pattern and driving factors of synergistic governance efficiency in pollution reduction and carbon reduction in Chinese cities. Ecol. Indic. 156. doi: 10.1016/j.ecolind.2023.111198

Zhang C., Fang J., Ge S., Sun G. (2024). Research on the impact of enterprise digital transformation on carbon emissions in the manufacturing industry. Int. Rev. Economics Finance 92, 211–227. doi: 10.1016/j.iref.2024.02.009

Zhang G., Wang T., Lou Y., Guan Z., Zheng H., Li Q., et al. (2022a). Research on China’s provincial carbon emission peak path based on a LSTM neural network approach. Chin. J. Manage. Sci. 11, 1–12. doi: 10.16381/j.cnki.issn1003-207x.2022.0097

Zhang L., Mu R., Zhan Y., Yu J., Liu L., Yu Y., et al. (2022b). Digital economy, energy efficiency, and carbon emissions: Evidence from provincial panel data in China. Sci. Total Environ. 852. doi: 10.1016/j.scitotenv.2022.158403

Zhang S., Wu Z., Lu Z., Zhang N. (2022c). Spatio-temporal evolution characteristics and driving factors of the integration between digital economy and real economy in China. Econ Geogr. 42, 22–32.

Zhao B., Sun L., Qin L. (2022). Optimization of China’s provincial carbon emission transfer structure under the dual constraints of economic development and emission reduction goals. Environ. Sci. pollut. Res . 29, 50335–50351. doi: 10.1007/s11356-022-19288-7

Zhao T., Zhang Z., Liang S. (2020). Digital economy, entrepreneurship, and high-quality economic development: empirical evidence from urban China. J. Manage. World 36, 65–76. doi: 10.19744/j.cnki.11-1235/f.2020.0154

Zhao X. G., Zhu J. (2022). Impacts of two-way foreign direct investment on carbon emissions: from the perspective of environmental regulation. Environ. Sci. pollut. Res. 29, 52705–52723. doi: 10.1007/s11356-022-19598-w

Zhou X., Du M., Dong H. (2023). Spatial and temporal effects of China’s digital economy on rural revitalization. Front. Energy Res. 11. doi: 10.3389/fenrg.2023.1061221

Keywords: integrated economy, carbon emissions, digital economy, real economy, spatial effect, China

Citation: Wang Y, Ke Q and Lei S (2024) The spatial effect of integrated economy on carbon emissions in the era of big data: a case study of China. Front. Ecol. Evol. 12:1374724. doi: 10.3389/fevo.2024.1374724

Received: 22 January 2024; Accepted: 11 April 2024; Published: 24 April 2024.

Reviewed by:

Copyright © 2024 Wang, Ke and Lei. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Qian Ke, [email protected]

† These authors have contributed equally to this work and share first authorship

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

COMMENTS

  1. PDF Uber & Big Data a case study

    Petabytes. Uber relies heavily on making data-driven decisions at every level. Forecasting rider demand during high traffic events. Addressing bottlenecks in driver-partner signup process. Need for store, clean and serve over 100 Petabytes of data (2017) with minimum latency. Need for a big data solution: Reliable. Scalable.

  2. Uber's Big Data Platform: 100+ Petabytes with Minute Latency

    Reza is one of the founding engineers of Uber's data team and helped scale Uber's data platform from a few terabytes to over 100 petabytes while reducing data latency from 24+ hours to minutes. Uber's Hadoop platform ensures data reliability, scalability, and ease-of-use with minimal latency.

  3. Solving Big Data Challenges with Data Science at Uber

    Ritesh Agrawal is a senior data scientist on Uber's Data Science team, leading the intelligent infrastructure and developer platform teams. His work is focused on finding innovative ways to use data science and AI to make Uber's infrastructure more adaptive and scalable and enhance developer productivity. How engineers and data scientists at ...

  4. Unleashing the power of Presto: The Uber case study

    Uber's success as a data-driven company is no accident. It's the result of a deliberate strategy to leverage cutting-edge technologies like Presto to unlock the insights hidden in vast volumes of data. Presto has become an integral part of Uber's data ecosystem, enabling the company to process petabytes of data, support diverse analytical ...

  5. How Uber uses data science to reinvent transportation?

    Understand how the ride sharing service Uber uses big data and data science to reinvent transportation and logistics globally. With more than 8 million users, 1 billion Uber trips and 160,000+ people driving for Uber across 449 cities in 66 countries - Uber is the fastest growing startup standing at the top of its game.

  6. PDF USING BIG DATA TO ESTIMATE CONSUMER SURPLUS

    four U.S. cities included in our analysis. For each dollar spent by consumers, a. ut $1.60 of consumer surplus is generated. Back-of-the-envelope calculations suggest that the overall consumer surplus generated by the UberX service in. Peter Cohen Uber 1455 Market Street San Francisco, CA 94102 [email protected].

  7. How Uber Uses Data and Analytics (Case Study)

    How Uber Uses Data and Analytics (Case Study) Everyone knows Uber as a shared service for point-to-point transportation, but not everyone knows Uber as a data and analytics company. In this EMA technical case study, sponsored by Ahana, you'll learn about: What is Presto? The evolution of its use at Uber. The analytical use cases of Presto ...

  8. Uber knows you: how data optimizes our rides

    Drivers, in turn, get more time to earn. 1. Surge Pricing. The instant implementation of live data allows Uber to effectively operate a dynamic pricing model. Using geo-location coordinates from drivers, street traffic and ride demand data, the so called Geosurge-algorithm compares theoretical ideals with what is actually implemented in the ...

  9. (PDF) BIG DATA ANALYTICS IN UBER

    Uber's value chain analysis is a strategic analytical technique that aids in pinpointing the comp any's sources. of value and competitive advantage. It invol ves Uber's core functions, including ...

  10. 42: UBER: How Big Data Is At The Centre Of Uber's Transportation

    42 UBER How Big Data Is At The Centre Of Uber's Transportation Business Background. Uber is a smartphone app-based taxi booking service which connects users who need to get somewhere with drivers willing to give them a ride. The service has been hugely popular. Since being launched to serve San Francisco in 2009, the service has been expanded ...

  11. Using Big Data to Estimate Consumer Surplus: The Case of Uber

    The paper uses Uber's surge pricing algorithm and individual-level data to estimate demand elasticities and consumer surplus for the UberX service in four U.S. cities. It finds that in 2015, the service generated about $2.9 billion in consumer surplus and $6.8 billion in the U.S.

  12. Computing the User Experience via Big Data Analysis: A Case of Uber

    Based on these findings, we also provide some theoretical implications for future UX literature and some core suggestions related to establishing strategies for Uber and similar services. The proposed big data approach may be utilized in other UX studies in the future.

  13. Using 'Big Data' to understand the impacts of Uber ...

    The case study is based on New York City data which shows that the taxi market may be oversupplied and underpriced, which confirms findings from other studies and price hikes in 2012.

  14. Uber's Strategy for Global Success

    Harvard Business School assistant professor Alexander MacKay describes Uber's global market strategy and responses by regulators and local competitors in his case, " Uber: Competing Globally ...

  15. Computing the User Experience via Big Data Analysis: A Case of Uber

    Most of the studies based on Uber and similar services have explored UX by analyzing a limited (fewer than 1,000) number of samples. Thus, in the context of Uber services, we attempt to use a big data approach to explore user satisfaction, which may be strongly related to the continuance intention and loyalty of an individual [13-16]. The ...

  16. End-to-End Predictive Analysis on Uber's Data

    Image 1. Uber is an international company located in 69 countries and around 900 cities around the world. Lyft, on the other hand, operates in approximately 644 cities in the US and 12 cities in Canada alone. However, in the US, it is the second-largest passenger company with a market share of 31%. From booking a taxi to paying a bill, both ...

  17. Uber Case Study

    Through hands-on upskilling on our platform, thousands of Uber employees now use data in their daily work. Operations, Marketing and Product teams use it for planning and decision-making. Download the case study and learn more about Uber's journey to data-driven decision making. Download the Uber Case Study.

  18. How Uber Uses Data to Improve Their Service and Create the New Wave of

    Focus areas. Data Analytics & Insights Gain a deeper understanding of your customers and marketing performance through forecasting, full-funnel exploration, and campaign impact analyses.; Dashboard Development Our dashboards provide easy-to-read marketing performance visuals based on your preferred metrics hierarchy.; Conversion Rate Optimization Increase conversion rates and decrease customer ...

  19. The Big Problem with Uber's Big Data: Ethics and Regulation of Data

    Big Data gave Uber enough power and agency to be able to attract workers with its ease-of-use and escape the classic employee-employer relationship, defining itself as a data-powered platform that serves as a mediator between drivers and consumers (Wilhelm 2018). ... but recent court decisions are turning the debate in favour of workers ...

  20. 5 Big Data Case Studies

    Following are the interesting big data case studies -. 1. Big Data Case Study - Walmart. Walmart is the largest retailer in the world and the world's largest company by revenue, with more than 2 million employees and 20000 stores in 28 countries. It started making use of big data analytics much before the word Big Data came into the picture.

  21. Machine Learning

    Steps to Implement Machine Learning for Uber Use Case. Data Collection: Collect data on ride requests, driver locations, travel times, and other relevant metrics. Data Preprocessing: Clean and preprocess the data to remove errors, handle missing values, and prepare it for analysis. Feature Engineering:

  22. Uber's Journey to Modernizing Big Data Infrastructure with Google Cloud

    In a recent post on its official engineering blog, Uber, disclosed its strategy to migrate the batch data analytics and machine learning (ML) training stack to Google Cloud Platform (GCP). Uber, runs

  23. The Daily Show Fan Page

    The source for The Daily Show fans, with episodes hosted by Jon Stewart, Ronny Chieng, Jordan Klepper, Dulcé Sloan and more, plus interviews, highlights and The Weekly Show podcast.

  24. Frontiers

    1 School of Economic and Management, Xi'an University of Technology, Xi'an, China; 2 School of Business and Circulation, Shaanxi Polytechnic Institute, Xian Yang, China; The digital economy has the characteristics of resource conservation, which can solve China's high carbon emissions problems. The digital economy can quickly integrate with the real economy, forming an integrated economy.