Modern Data Science with R

3rd edition (light edits and updates)

Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton

March 24, 2024

3rd edition

This is the work-in-progress of the 3rd edition. At present, there are relatively modest changes from the second edition beyond those necessitated by changes in the R ecosystem.

Key changes include:

  • Transition to Quarto from RMarkdown
  • Transition from magrittr pipe ( %>% ) to base R pipe ( |> )
  • Minor updates to specific examples (e.g., updating tables scraped from Wikipedia) and code (e.g., new group options within the dplyr package).

At the main website for the book , you will find other reviews, instructor resources, errata, and other information.

Do you see issues or have suggestions? To submit corrections, please visit our website’s public GitHub repository and file an issue.

Known issues with the 3rd edition

This is a work in progress. At present there are a number of known issues:

  • nuclear reactors example ( 6.4.4 Example: Japanese nuclear reactors ) needs to be updated to account for Wikipedia changes
  • Python code not yet implemented ( Chapter  21  Epilogue: Towards “big data” )
  • Spark code not yet implemented ( Chapter  21  Epilogue: Towards “big data” )
  • SQL output captions not working ( Chapter  15  Database querying using SQL )
  • Open street map geocoding not yet implemented ( Chapter  18  Geospatial computations )
  • ggmosaic() warnings ( Figure  3.19 )
  • RMarkdown introduction ( Appendix  Appendix D — Reproducible analysis and workflow ) not yet converted to Quarto examples
  • issues with references in Appendix  Appendix A — Packages used in the book
  • Exercises not yet available (throughout)
  • Links have not all been verified (help welcomed here!)

2nd edition

The online version of the 2nd edition of Modern Data Science with R is available. You can purchase the book from CRC Press or from Amazon .

The main website for the book includes more information, including reviews, instructor resources, and errata.

To submit corrections, please visit our website’s public GitHub repository and file an issue.

case study using r programming

1st edition

The 1st edition may still be available for purchase. Although much of the material has been updated and improved, the general framework is the same ( reviews ).

© 2021 by Taylor & Francis Group, LLC . Except as permitted under U.S. copyright law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by an electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

Background and motivation

The increasing volume and sophistication of data poses new challenges for analysts, who need to be able to transform complex data sets to answer important statistical questions. A consensus report on data science for undergraduates ( National Academies of Science, Engineering, and Medicine 2018 ) noted that data science is revolutionizing science and the workplace. They defined a data scientist as “a knowledge worker who is principally occupied with analyzing complex and massive data resources.”

Michael I. Jordan has described data science as the marriage of computational thinking and inferential (statistical) thinking. Without the skills to be able to “wrangle” or “marshal” the increasingly rich and complex data that surround us, analysts will not be able to use these data to make better decisions.

Demand is strong for graduates with these skills. According to the company ratings site Glassdoor , “data scientist” was the best job in America every year from 2016–2019 ( Columbus 2019 ) .

New data technologies make it possible to extract data from more sources than ever before. Streamlined data processing libraries enable data scientists to express how to restructure those data into a form suitable for analysis. Database systems facilitate the storage and retrieval of ever-larger collections of data. State-of-the-art workflow tools foster well-documented and reproducible analysis. Modern statistical and machine learning methods allow the analyst to fit and assess models as well as to undertake supervised or unsupervised learning to glean information about the underlying real-world phenomena. Contemporary data science requires tight integration of these statistical, computing, data-related, and communication skills.

Intended audience

This book is intended for readers who want to develop the appropriate skills to tackle complex data science projects and “think with data” (as coined by Diane Lambert of Google). The desire to solve problems using data is at the heart of our approach.

We acknowledge that it is impossible to cover all these topics in any level of detail within a single book: Many of the chapters could productively form the basis for a course or series of courses. Instead, our goal is to lay a foundation for analysis of real-world data and to ensure that analysts see the power of statistics and data analysis. After reading this book, readers will have greatly expanded their skill set for working with these data, and should have a newfound confidence about their ability to learn new technologies on-the-fly.

This book was originally conceived to support a one-semester, 13-week undergraduate course in data science. We have found that the book will be useful for more advanced students in related disciplines, or analysts who want to bolster their data science skills. At the same time, Part I of the book is accessible to a general audience with no programming or statistics experience.

Key features of this book

Focus on case studies and extended examples.

We feature a series of complex, real-world extended case studies and examples from a broad range of application areas, including politics, transportation, sports, environmental science, public health, social media, and entertainment. These rich data sets require the use of sophisticated data extraction techniques, modern data visualization approaches, and refined computational approaches.

Context is king for such questions, and we have structured the book to foster the parallel developments of statistical thinking, data-related skills, and communication. Each chapter focuses on a different extended example with diverse applications, while exercises allow for the development and refinement of the skills learned in that chapter.

The book has three main sections plus supplementary appendices. Part I provides an introduction to data science, which includes an introduction to data visualization, a foundation for data management (or “wrangling”), and ethics. Part II extends key modeling notions from introductory statistics, including regression modeling, classification and prediction, statistical foundations, and simulation. Part III introduces more advanced topics, including interactive data visualization, SQL and relational databases, geospatial data, text mining, and network science.

We conclude with appendices that introduce the book’s R package, R and RStudio , key aspects of algorithmic thinking, reproducible analysis, a review of regression, and how to set up a local SQL database.

The book features extensive cross-referencing (given the inherent connections between topics and approaches).

Supporting materials

In addition to many examples and extended case studies, the book incorporates exercises at the end of each chapter along with supplementary exercises available online. Many of the exercises are quite open-ended, and are designed to allow students to explore their creativity in tackling data science questions. (A solutions manual for instructors is available from the publisher.)

The book website at https://mdsr-book.github.io/mdsr3e includes the table of contents, the full text of each chapter, and bibliography. The instructor’s website at https://mdsr-book.github.io/ contains code samples, supplementary exercises, additional activities, and a list of errata.

Changes in the second edition

Data science moves quickly. A lot has changed since we wrote the first edition. We have updated all chapters to account for many of these changes and to take advantage of state-of-the-art R packages.

First, the chapter on working with geospatial data has been expanded and split into two chapters. The first focuses on working with geospatial data, and the second focuses on geospatial computations. Both chapters now use the sf package and the new geom_sf() function in ggplot2 . These changes allow students to penetrate deeper into the world of geospatial data analysis.

Second, the chapter on tidy data has undergone significant revisions. A new section on list-columns has been added, and the section on iteration has been expanded into a full chapter. This new chapter makes consistent use of the functional programming style provided by the purrr package. These changes help students develop a habit of mind around scalability: if you are copying-and-pasting code more than twice, there is probably a more efficient way to do it.

Third, the chapter on supervised learning has been split into two chapters and updated to use the tidymodels suite of packages. The first chapter now covers model evaluation in generality, while the second introduces several models. The tidymodels ecosystem provides a consistent syntax for fitting, interpreting, and evaluating a wide variety of machine learning models, all in a manner that is consistent with the tidyverse . These changes significantly reduce the cognitive overhead of the code in this chapter.

The content of several other chapters has undergone more minor—but nonetheless substantive—revisions. All of the code in the book has been revised to adhere more closely to the tidyverse syntax and style. Exercises and solutions from the first edition have been revised, and new exercises have been added. The code from each chapter is now available on the book website. The book has been ported to bookdown , so that a full version can be found online at https://mdsr-book.github.io/mdsr2e .

Key role of technology

While many tools can be used effectively to undertake data science, and the technologies to undertake analyses are quickly changing, R and Python have emerged as two powerful and extensible environments. While it is important for data scientists to be able to use multiple technologies for their analyses, we have chosen to focus on the use of R and RStudio (an open source integrated development environment created by Posit) to avoid cognitive overload. We describe a powerful and coherent set of tools that can be introduced within the confines of a single semester and that provide a foundation for data wrangling and exploration.

We take full advantage of the ( RStudio ) environment. This powerful and easy-to-use front end adds innumerable features to R including package support, code-completion, integrated help, a debugger, and other coding tools. In our experience, the use of ( RStudio ) dramatically increases the productivity of R users, and by tightly integrating reproducible analysis tools, helps avoid error-prone “cut-and-paste” workflows. Our students and colleagues find ( RStudio ) to be an accessible interface. No prior knowledge or experience with R or ( RStudio ) is required: we include an introduction within the Appendix.

As noted earlier, we have comprehensively integrated many substantial improvements in the tidyverse , an opinionated set of packages that provide a more consistent interface to R ( Wickham 2023 ) . Many of the design decisions embedded in the tidyverse packages address issues that have traditionally complicated the use of R for data analysis. These decisions allow novice users to make headway more quickly and develop good habits.

We used a reproducible analysis system ( knitr ) to generate the example code and output in this book. Code extracted from these files is provided on the book’s website. We provide a detailed discussion of the philosophy and use of these systems. In particular, we feel that the knitr and rmarkdown packages for R , which are tightly integrated with Posit’s ( RStudio ) IDE, should become a part of every R user’s toolbox. We can’t imagine working on a project without them (and we’ve incorporated reproducibility into all of our courses).

Modern data science is a team sport. To be able to fully engage, analysts must be able to pose a question, seek out data to address it, ingest this into a computing environment, model and explore, then communicate results. This is an iterative process that requires a blend of statistics and computing skills.

How to use this book

The material from this book has supported several courses to date at Amherst, Smith, and Macalester Colleges, as well as many others around the world. From our personal experience, this includes an intermediate course in data science (in 2013 and 2014 at Smith College and since 2017 at Amherst College), an introductory course in data science (since 2016 at Smith), and a capstone course in advanced data analysis (multiple years at Amherst).

The introductory data science course at Smith has no prerequisites and includes the following subset of material:

  • Data Visualization: three weeks, covering Chapters  1  Prologue: Why data science? – 3  A grammar for graphics
  • Data Wrangling: five weeks, covering Chapters  4  Data wrangling on one table – 7  Iteration
  • Ethics: one week, covering Chapter  8  Data science ethics
  • Database Querying: two weeks, covering Chapter  15  Database querying using SQL
  • Geospatial Data: two weeks, covering Chapter  17  Working with geospatial data and part of Chapter  18  Geospatial computations

A intermediate course at Amherst followed the approach of Baumer ( 2015 ) with a pre-requisite of some statistics and some computer science and an integrated final project. The course generally covers the following chapters:

  • Data Visualization: two weeks, covering Chapters  1  Prologue: Why data science? – 3  A grammar for graphics and 14  Dynamic and customized data graphics
  • Data Wrangling: four weeks, covering Chapters  4  Data wrangling on one table – 7  Iteration
  • Unsupervised Learning: one week, covering Chapter  12  Unsupervised learning
  • Database Querying: one week, covering Chapter  15  Database querying using SQL
  • Geospatial Data: one week, covering Chapter  17  Working with geospatial data and some of Chapter  18  Geospatial computations
  • Text Mining: one week, covering Chapter  19  Text as data
  • Network Science: one week, covering Chapter  20  Network science

The capstone course at Amherst reviewed much of that material in more depth:

  • Data Visualization: three weeks, covering Chapters  1  Prologue: Why data science? – 3  A grammar for graphics and Chapter  14  Dynamic and customized data graphics
  • Data Wrangling: two weeks, covering Chapters  4  Data wrangling on one table – 7  Iteration
  • Simulation: one week, covering Chapter  13  Simulation
  • Statistical Learning: two weeks, covering Chapters  10  Predictive modeling – 12  Unsupervised learning
  • Databases: one week, covering Chapter  15  Database querying using SQL and Appendix  Appendix F — Setting up a database server
  • Spatial Data: one week, covering Chapter  17  Working with geospatial data
  • Big Data: one week, covering Chapter  21  Epilogue: Towards “big data”

We anticipate that this book could serve as the primary text for a variety of other courses, such as a Data Science 2 course, with or without additional supplementary material.

The content in Part I—particularly the ggplot2 visualization concepts presented in Chapter  3  A grammar for graphics and the dplyr data wrangling operations presented in Chapter  4  Data wrangling on one table —is fundamental and is assumed in Parts II and III. Each of the topics in Part III are independent of each other and the material in Part II. Thus, while most instructors will want to cover most (if not all) of Part I in any course, the material in Parts II and III can be added with almost total freedom.

The material in Part II is designed to expose students with a beginner’s understanding of statistics (i.e., basic inference and linear regression) to a richer world of statistical modeling and statistical inference.

Acknowledgments

We would like to thank John Kimmel at Informa CRC/Chapman and Hall for his support and guidance. We also thank Jim Albert, Nancy Boynton, Jon Caris, Mine Çetinkaya-Rundel, Jonathan Che, Patrick Frenett, Scott Gilman, Maria-Cristiana Gîrjău, Johanna Hardin, Alana Horton, John Horton, Kinari Horton, Azka Javaid, Andrew Kim, Eunice Kim, Caroline Kusiak, Ken Kleinman, Priscilla (Wencong) Li, Amelia McNamara, Melody Owen, Randall Pruim, Tanya Riseman, Gabriel Sosa, Katie St. Clair, Amy Wagaman, Susan (Xiaofei) Wang, Hadley Wickham, J. J. Allaire and the Posit (formerly RStudio) developers, the anonymous reviewers, multiple classes at Smith and Amherst Colleges, and many others for contributions to the R and ( RStudio ) environment, comments, guidance, and/or helpful suggestions on drafts of the manuscript. Rose Porta was instrumental in proofreading and easing the transition from Sweave to R Markdown. Jessica Yu converted and tagged most of the exercises from the first edition to the new format based on etude .

Above all we greatly appreciate Cory, Maya, and Julia for their patience and support.

Northampton, MA and St. Paul, MN August, 2023 (third edition [light edits and updates])

Northampton, MA and St. Paul, MN December, 2020 (second edition)

TechVidvan

  • R Tutorials

R Applications – 9 Real-world Use Cases of R programming

What is R used for R ? R Applications across 9 sectors

Data Science and Big Data have proved themselves useful and even necessary in many different fields and industries today. It helps them to keep up with the trends and capitalize on every opportunity.

R acts as a tool, used to make sense of big data and to gain use from it. R has also proved itself usefulness in research by processing large amounts of data in less time.

Let’s take a look at an interesting fact. “According to O’Reilly, R is the most-used data science language after SQL.”

In this article, we will explore the various R applications in the real world.

Applications of R Programming

Let’s start from the beginning and explore the uses of R for research purposes:

R Applications

R in Research and Academics

R is a statistical research tool . It is still used by statisticians and students to perform various statistical computations and analyses. Statistical techniques like linear and non-linear modeling, time-series analysis, classification, classical statistic tests, clustering, and others are all implemented by R and its libraries.

R is also used for machine learning research and deep learning as well. With libraries that facilitate monitored and unmonitored learning, R is one of the most commonly used languages for machine learning.

Other research involving large data sets like big data, finding genetic anomalies and patterns, various drug compositions, all use R to sift through a large collection of relevant data and to draw meaningful conclusions from it.

R Use Cases in Research & Academics

  • Cornell University : Cornell recommends their researchers and students to use R for all their researches involving statistical computing.
  • UCLA : The University of California, Los Angeles uses R to teach statistics and data analysis to its students.

Apart from research, R also has its applications in IT companies.

Revise the basics concepts of R programming with R tutorial

R in IT Sector

R in IT sector - R applications

IT companies not only use R for their own business intelligence but offer such services to other small, medium and large scale businesses as well. They use it for their machine learning products too.

They use R to build statistical computing tools and data handling products and to create other data manipulation services.

Some big IT companies that use R:

  • Tata Consultancy Services

R Use Cases in IT Sector

  • Mozilla : Mozilla uses R to visualize web activity for their browser firefox.
  • Microsoft : Microsoft uses R as a statistical engine within the Azure Machine Learning framework. They also use it for the Xbox matchmaking service.
  • Foursquare : R works behind-the-scenes on Foursquare’s recommendation engine.
  • Google : Google uses R to improve its search results, to provide better search suggestions, to calculate the ROI of their advertising campaigns, to increase the efficiency of online advertising and to predict their economic activity.

R in Finance

R in finance - R applications

Other than the finance sector which industry will be dealing more with statistics as R is a statistical programming language.

R and data science find widespread use in the finance sector. R provides an advanced statistical suite for all the financial tasks and computations. Moving averages, auto-regression, time-series analysis, stock-market modeling, financial data mining, downside risk assessment are all easily done through R and its libraries.

R is also used to support the business decision-making process. R’s data visualization powers can represent the findings of data analysis in multiple graphical formats like candlestick charts, density plots, and drawdown plots of high quality.

This helps the business minds to connect with the technical aspect of data analyses and their results. Companies like American Express, Bajaj Allianz Insurance, JP Morgan and Standard Chartered use R.

R Use Cases in Finance

  • Lloyds of London : Lloyds of London use R for risk analysis.
  • Bajaj Allianz Insurance : Bajaj Allianz uses R to make their upsell propensity models and recommendation engines. They also use it to mine data and generate actionable insights to improve customer experience.

The digital revolution has changed the world drastically. One of the most prominent changes is the fact that marketplaces have moved to the internet. The E-commerce industry makes heavy usage of R for varying purposes.

Check the various Features of R that you didn’t know about

R in E-commerce

In the field of finance and retail, analytics is useful for risk assessment and to devise marketing-strategy. E-commerce goes beyond that in its usage of data science. E-commerce companies use R to improve the user’s experience on their site as well as for marketing and finance purposes. They use R to improve cross-product selling. When a customer is buying a product the site suggests additional products that complement their original purchase. These suggestions also work for products purchased by the customer in the past. Internet-based companies like various e-commerce sites gather and process structured and unstructured data from varying sources. R proves to be highly useful for this.

Apart from this, R is also used to help with marketing-strategy, targeted advertising, sales modeling, and financial data processing.

R Use Cases in E-commerce

  • Amazon : Amazon uses R and data analysis to improve their cross-product suggestions.
  • Flipkart : Flipkart uses R for predictive analysis which helps them with targetted advertisements.

Social media is the most common generator of big data today. Therefore, the most advanced and cutting-edge uses of data science can be found in the social media industry.

If you have any queries in R applications till now, mention in the comment section.

R in Social Media

R in Social Media - R applications

Social media companies like Facebook use R for behavior analysis and sentiment analysis. They can alter and improve their suggestions to users based on the user’s history, and the mood and tone of their recent posts and viewed content. The advertisements shown to the user are also adjusted according to user sentiment and history. R is also used to analyze traffic, user sessions, and content, all in an effort to improve user experience.

R Use Cases in Social Media

  • Facebook : Facebook uses R to predict colleague interactions and update its social network graph.
  • Twitter : Twitter uses R for semantic clustering. They also use it for data visualization.

Another sector making great use of R’s statistical computation abilities is the banking sector.

R in Banking

Banking firms use R for credit risk modeling and other forms of risk analytics. Banks often use R along with other proprietary software like SAS. It is also used for fraud detection, mortgage haircut modeling, stat modeling, volatility modeling, loan stress test simulation, client assessment, and much more. Apart from statistics, banks also use R for business intelligence and data visualization.

Another use of R is in the calculation of customer segmentation, customer quality, and customer retention.

R Use Cases in Banking

  • ANZ : ANZ bank uses R for credit risk modeling and also in models for mortgage loss.
  • Bank of America : Bank of America uses R for financial reporting and to calculate financial losses.

The healthcare industry is not one to be left behind when it comes to cutting-edge technologies:

R in Healthcare

R in healthcare - R applications

With R, you can crunch data and process information, providing an essential backdrop for further analysis and data processing . Genetics, drug discovery, bioinformatics, epidemiology, etc. are some fields in the healthcare industry that use R heavily. It is used to analyze and predict the spreading of various diseases, for analyzing genetic sequences, to analyze drug-safety data, and to analyze various permutations and combinations of drugs and chemicals as well. R’s Bioconductor package provides facilities for analyzing genomic data. Lastly, R is a god-send for pre-clinical trials of all new drugs and medical techniques.

R Use Case in Healthcare

Merck : Merck & co. use the R programming language for clinical trials and drug testing.

“We use R for adaptive designs frequently because it’s the fastest tool to explore designs that interest us. Off-the-shelf software gives you off-the-shelf options. Those are a good first-order approximation, but if you really want to nail down a design, R is going to be the fastest way to do that.”

Keaven Anderson (Executive Director of Late Stage Biostatistics, Merck)

Manufacturing companies also use R to make use of big data and to be ahead of the curve.

R in Manufacturing

Various manufacturing companies use R to complement their marketing and business strategies. They analyze customer feedback to help streamline and improve their products. They also use the data to support their marketing strategies. Predicting demand and market trends to adjust their manufacturing practices is yet another use of R and data analytics.

R Use Cases in Manufacturing

  • Ford Motor Company : Ford uses R for statistical analyses to support its business strategy and to analyze customer sentiment about its product which helps them in improving their future designs.
  • John Deere : John Deere uses R to forecast demand for their products and spare parts. They also use it to forecast crop yield and use that data for their business strategy and to meet market demand and downturns.

Every government has to handle a large amount of data. A country’s worth of data! Many governmental departments across the world use R as well.

R in Governmental Use

Many governmental departments use R for record-keeping and processing their censuses. This helps them in effective law-making and governance. They also use it for essential services like drug regulation, weather forecasting, disaster-impact analysis and much more.

R Use Cases in Governmental Activities

  • Food and Drug Administration : FDA uses R for drug evaluation and to perform pre-clinical trials. It also uses R to predict possible reactions and medical issues caused by various food products.
  • National Weather Service : The National Weather Service uses R for weather forecast and for disaster prediction. They also use it to visualize their forecasts and predictions to analyze the areas affected.

In this article of R applications, we learned about the various sectors and industries using R. We also explored the various purposes these industries use R for. Then, we looked at some companies that use R to satisfy their various needs.

Earlier R was used for research and academic purposes only but the times have changed and you can find R being used in every industry from IT, finance to healthcare.

Want to make a Career in R programming? Select the job profile that suits you the best.

If you like the R applications article, do rate us on Google and share your feedback in the comment section.

case study using r programming

Case Study Project

“Case Study Project” is an initiative that looks to promote and facilitate the usage of R programming language for data manipulation, data analysis, programming, and computational aspects of statistical topics in authentic research applications. Specifically, it looks to solve various research problems using the extensive resource database of different packages available in R, related to mathematics or analytics. Furthermore, it aims to provide students and researchers with guidance in the process of approaching and solving a research problem or a complete data analysis project using R. 

The case study project encourages you to solve a feasible statistical problem statement of reasonable complexity using R. Once it passes through our quality checks, it will become part of this common database of solved case studies. To ensure that your efforts are recognized and benefit your career, an approved case study will be provided with an  eCertificate and an honorarium .

Technical Requirements: 

  • Any student (UG, PG, research scholar, etc.) or faculty member can submit the case study project who has a knowledge of R. 
  • A clearly defined straightforward case study problem solved using data and statistical or machine learning models with the expected output.

case study using r programming

rstudio::conf 2020

R + Tidyverse in Sports

January 30, 2020

There are many ways in which R and the Tidyverse can be used to analyze sports data and the unique considerations that are involved in applying statistical tools to sports problems. See more

Putting the Fun in Functional Data: A tidy pipeline to identify routes in NFL tracking data

Putting the Fun in Functional Data: A tidy pipeline to identify routes in NFL tracking data

Currently in football many hours are spent watching game film to manually label the routes run on passing plays. See more

Professional Case Studies

Professional Case Studies

The path to becoming a world-class, data-driven organization is daunting. See more

Making better spaghetti (plots): Exploring the individuals in longitudinal data with the brolgar pac

Making better spaghetti (plots): Exploring the individuals in longitudinal data with the brolgar pac

There are two main challenges of working with longitudinal (panel) data: 1) Visualising the data, and 2) Understanding the model. See more

Journalism with RStudio, R, and the Tidyverse

Journalism with RStudio, R, and the Tidyverse

The Associated Press data team primarily uses R and the Tidyverse as the main tool for doing data processing and analysis. See more

How Vibrant Emotional Health Connected Siloed Data Sources and Streamlined Reporting Using R

How Vibrant Emotional Health Connected Siloed Data Sources and Streamlined Reporting Using R

Vibrant Emotional Health is the mental health not-for-profit behind the US National Suicide Prevention Lifeline, New York City's NYC Well program, and various other emotional health contact center... See more

How to win an AI Hackathon, without using AI

How to win an AI Hackathon, without using AI

Once “big data” is thrown into the mix, the AI solution is all but certain. But is AI always needed? See more

Building a new data science pipeline for the FT with RStudio Connect

Building a new data science pipeline for the FT with RStudio Connect

We have recently implemented a new Data Science workflow and pipeline, using RStudio Connect and Google Cloud Services. See more

Teach the Tidyverse to beginners

rstudio::conf 2018

Teach the Tidyverse to beginners

March 4, 2018

Storytelling with R

Storytelling with R

Imagine Boston 2030: Using R-Shiny to keep ourselves accountable and empower the public

Imagine Boston 2030: Using R-Shiny to keep ourselves accountable and empower the public

How I Learned to Stop Worrying and Love the Firewall

How I Learned to Stop Worrying and Love the Firewall

Differentiating by data science

Differentiating by data science

Agile data science

Agile data science

Achieving impact with advanced analytics: Breaking down the adoption barrier

Achieving impact with advanced analytics: Breaking down the adoption barrier

A SAS-to-R success story

A SAS-to-R success story

Our biostatistics group has historically utilized SAS for data management and analytics for biomedical research studies, with R only used occasionally for new methods or data visualization. Several years ago and with the encouragement of leadership, we initiated a movement to increase our usage of R significantly. See more

Understanding PCA using Shiny and Stack Overflow data

Understanding PCA using Shiny and Stack Overflow data

February 26, 2018

The unreasonable effectiveness of empathy

The unreasonable effectiveness of empathy

Rapid prototyping data products using Shiny

Rapid prototyping data products using Shiny

Phrasing: Communicating data science through tweets, gifs, and classic misdirection

Phrasing: Communicating data science through tweets, gifs, and classic misdirection

Open-source solutions for medical marijuana

Open-source solutions for medical marijuana

Developing and deploying large scale shiny applications

Developing and deploying large scale shiny applications

Accelerating cancer research with R

Accelerating cancer research with R

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

A data science case study in r.

Posted on March 13, 2017 by Robert Grünwald in R bloggers | 0 Comments

[social4i size="small" align="align-left"] --> [This article was first published on R-Programming – Statistik Service , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Demanding data science projects are becoming more and more relevant, and the conventional evaluation procedures are often no longer sufficient. For this reason, there is a growing need for tailor-made solutions, which are individually tailored to the project’s goal, which is often implemented by  R programming . T o provide our readers with support in their own R programming, we have carried out an example evaluation which demonstrates several application possibilities of the R programming.

Data Science Projects

Approaching your first data science project can be a daunting task. Luckily, there are rough step-by-step outlines and heuristics than can help you on your way to becoming a data ninja. In this article, we review some of these guidelines and apply them to an example project in R.

For our analysis and the R programming, we will make use of the following R packages:

Anatomy of a Data Science project

A basic data science project consists of the following six steps:

  • State the problem you are trying to solve. It has to be an unambiguous question that can be answered with data and a statistical or machine learning model. At least, specify: What is being observed? What has to be predicted?
  • Collect the data, then clean and prepare it. This is commonly the most time-consuming task, but it has to be done in order to fit a prediction model with the data.
  • Explore the data. Get to know its properties and quirks. Check numerical summaries of your metric variables, tables of the categorical data, and plot univariate and multivariate representations of your variables. By this, you also get an overview of the quality of the data and can find outliers.
  • Check if any variables may need to be transformed. Most commonly, this is a logarithmic transformation of skewed measurements such as concentrations or times. Also, some variables might have to be split up into two or more variables.
  • Choose a model and train it on the data. If you have more than one candidate model, apply each and evaluate their goodness-of-fit using independent data that was not used for training the model.
  • Use the best model to make your final predictions.

We apply the principles on an example data set that was used in the ASA’s 2009 Data expo . The given data are around 120 million commercial and domestic flights within the USA between 1987 and 2008. Measured variables include departure and arrival airport, airline, and scheduled and actual departure and arrival times.

We will focus on the 2008 subset of this data. Because even this is a 600MB subset, it makes sense to start a first analysis on a random sample of this data to be able to quickly explore and develop your code, and then, periodically verify on the real data set that your results are still valid on the complete data.

The following commands read in our subset data and display the first three observations:

Fortunately, the ASA provides a code book with descriptions of each variable here . For example, we now know that for the Variable DayOfWeek, a 1 denotes Monday, a 2 is Tuesday, and so on.

The problem

With this data, it is possible to answer many interesting questions. Examples include:

Do planes with a delayed departure fly with a faster average speed to make up for the delay?

How does the delay of arriving flights vary during the day are planes more delayed on weekends.

  • How has the market share of different airlines shifted over these 20 years?
  • Are there specific planes that tend to have longer delays? What characterizes them? Maybe the age, or the manufacturer?

Additionally to these concrete questions, the possibilities for explorative, sandbox-style data analysis are nearly limitless.

Here, we will focus on the first two boldened questions.

Data cleaning

You should always check out the amount of missing values in your data. For this, we write an sapply-loop over each column in the flights data and report the percentage of missing values:

We see that most variables have at most a negligible amount of missing values. However, the last five variables, starting at the CarrierDelay, show almost 80% missing values. This is usually an alarmingly high amount of missing data that would suggest dropping this variable from the analysis altogether, since not even a sophisticated imputing procedure can help here. But, as further inspection shows, these variables only apply for delayed flights, i.e. a positive value in the ArrDelay column.

When selecting only the arrival delay and the five sub-categories of delays, we see that they add up to the total arrival delay. For our analysis here, we are not interested in the delay reason, but view only the total ArrDelay as our outcome of interest.

The pipe operator %>%, by the way, is a nice feature of the magrittr package (also implemented in dplyr) that resembles the UNIX-style pipe. The following two lines mean and do exactly the same thing, but the second version is much easier to read:

The pipe operator thus takes the output of the left expression, and makes it the first argument of the right expression.

We have surprisingly clean data where not much has to be done before proceeding to feature engineering.

Explorative analyses

Our main variables of interest are:

  • The date, which conveniently is already split up in the columns Year, Month, and DayOfMonth, and even contains the weekday in DayOfWeek. This is rarely the case, you mostly get a single column with a name like date and entries such as „2016-06-24“. In that case, the R package lubridate provides helpful functions to efficiently work with and manipulate these dates.
  • CRSDepTime, the scheduled departure time. This will indicate the time of day for our analysis of when flights tend to have higher delays.
  • ArrDelay, the delay in minutes at arrival. We use this variable (rather than the delay at departure) for the outcome in our first analysis, since the arrival delay is what has the impact on our day.
  • For our second question of whether planes with delayed departure fly faster, we need DepDelay, the delay in minutes at departure, as well as a measure of average speed while flying. This variable is not available, but we can compute it from the available variables Distance and AirTime. We will do that in the next section, „Feature Engineering“.

Let’s have an exploratory look at all our variables of interest.

Flight date

Since these are exploratory analyses that you usually won’t show anyone else, spending time on pretty graphics does not make sense here. For quick overviews, I mostly use the standard graphics functions from R, without much decoration in terms of titles, colors, and such.

case study using r programming

Since we subsetted the data beforehand, it makes sense that all our flights are from 2008. We also see no big changes between the months. There is a slight drop after August, but the remaining changes can be explained by the number of days in a month.

The day of the month shows no influence on the amount of flights, as expected. The fact that the 31st has around half the flights of all other days is also obvious.

When plotting flights per weekday, however, we see that Saturday is the most quiet day of the week, with Sunday being the second most relaxed day. Between the remaining weekdays, there is little variation.

Departure Time

case study using r programming

A histogram of the departure time shows that the number of flights is relatively constant from 6am to around 8pm and dropping off heavily before and after that.

Arrival and departure delay

case study using r programming

Both arrival and departure delay show a very asymmetric, right-skewed distribution. We should keep this in mind and think about a logarithmic transformation or some other method of acknowledging this fact later.

The structure of the third plot of departure vs. arrival delay suggests that flights that start with a delay usually don’t compensate that delay during the flight. The arrival delay is almost always at least as large as the departure delay.

To get a first overview for our question of how the departure time influences the average delay, we can also plot the departure time against the arrival delay:

case study using r programming

Aha! Something looks weird here. There seem to be periods of times with no flights at all. To see what is going on here, look at how the departure time is coded in the data:

A departure of 2:55pm is written as an integer 1455. This explains why the values from 1460 to 1499 are impossible. In the feature engineering step, we will have to recode this variable in a meaningful way to be able to model it correctly.

Distance and AirTime

case study using r programming

Plotting the distance against the time needed, we see a linear relationship as expected, with one large outlier. This one point denotes a flight of 2762 miles and an air time of 823 minutes, suggesting an average speed of 201mph. I doubt planes can fly at this speed, so we should maybe remove this observation.

Feature Engineering

Feature engineering describes the manipulation of your data set to create variables that a learning algorithm can work with. Often, this consists of transforming a variable (through e.g. a logarithm), or extracting specific information from a variable (e.g. the year from a date string), or converting something like the ZIP code to a

For our data, we have the following tasks:

  • Convert the weekday into a factor variable so it doesn’t get interpreted linearly.
  • Create a log-transformed version of the arrival and departure delay.
  • Transform the departure time so that it can be used in a model.
  • Create the average speed from the distance and air time variables.

Converting the weekday into a factor is important because otherwise, it would be interpreted as a metric variable, which would result in a linear effect. We want the weekdays to be categories, however, and so we create a factor with nice labels:

log-transform delay times

When looking at the delays, we note that there are a lot of negative values in the data. These denote flights that left or arrived earlier than scheduled. To allow a log-transformation, we set all negative values to zero, which we interpret as „on time“:

Now, since there are zeros in these variables, we create the variables log(1+ArrDelay) and log(1+DepDelay):

Transform the departure time

The departure time is coded in the format hhmm, which is not helpful for modelling, since we need equal distances between equal durations of time. This way, the distance between 10:10pm and 10:20pm would be 10, but the distance between 10:50pm and 11:00pm, the same 10 minutes, would be 50.

For the departure time, we therefore need to convert the time format. We will use a decimal format, so that 11:00am becomes 11, 11:15am becomes 11.25, and 11:45 becomes 11.75.

The mathematical rule to transform the „old“ time in hhmm-format into a decimal format is:

Here, the first part of the sum generates the hours, and the second part takes the remainder when dividing by 100 (i.e., the last two digits), and divides them by 60 to transform the minutes into a fraction of one hour.

Let’s implement that in R:

Of course, you should always verify that your code did what you intended by checking the results.

Create average speed

The average flight speed is not available in the data – we have to compute it from the distance and the air time:

case study using r programming

We have a few outliers with very high, implausible average speeds. Domain knowledge or a quick Google search can tell us that speeds of more than 800mph are not maintainable with current passenger planes. Thus, we will remove these flights from the data:

Choosing an appropriate Method

For building an actual model with your data, you have the choice between two worlds, statistical modelling and machine learning.

Broadly speaking, statistical models focus more on quantifying and interpreting the relationships between input variables and the outcome. This is the case in situations such as clinical studies, where the main goal is to describe the effect of a new medication.

Machine learning methods on the other hand focus on achieving a high accuracy in prediction, sometimes sacrificing interpretability. This results in what is called „black box“ algorithms, which are good at predicting the outcome, but it’s hard to see how a model computes a prediction for the outcome. A classic example for a question where machine learning is the appropriate answer is the product recommendation algorithm on online shopping websites.

For our questions, we are interested in the effects of certain input variables (speed and time of day / week). Thus we will make use of statistical models, namely a linear model and a generalized additive model.

To answer our first question, we first plot the variables of interest to get a first impression of the relationship. Since these plots will likely make it to the final report or at least a presentation to your supervisors, it now makes sense to spend a little time on generating a pretty image. We will use the ggplot2 package to do this:

case study using r programming

It seems like there is a slight increase in average speed for planes that leave with a larger delay. Let’s fit a linear model to quantify the effect:

There is a highly significant effect of 0.034 for the departure delay. This represents the increase in average speed for each minute of delay. So, a plane with 60 minutes of delay will fly 2.04mph faster on average.

Even though the effect is highly significant with a p value of less than 0.0001, its actual effect is negligibly small.

For the second question of interest, we need a slightly more sophisticated model. Since we want to know the effect of the time of day on the arrival delay, we cannot assume a linear effect of the time on the delay. Let’s plot the data:

case study using r programming

We plot both the actual delay and our transformed log-delay variable. The smoothing line of the second plot gives a better image of the shape of the delay. It seems that delays are highest at around 8pm, and lowest at 5am. This emphasizes the fact that a linear model would not be appropriate here.

We fit a generalized additive model , a GAM, to this data. Since the response variable is right skewed, a Gamma distribution seems appropriate for the model family. To be able to use it, we have to transform the delay into a strictly positive variable, so we compute the maximum of 1 and the arrival delay for each observation first.

case study using r programming

We again see the trend of lower delays in the morning before 6am, and high delays around 8pm. To differentiate between weekdays, we now include this variable in the model:

With this model, we can create an artificial data frame x_new, which we use to plot one prediction line per weekday:

case study using r programming

We now see several things:

  • The nonlinear trend over the day is the same shape on every day of the week
  • Fridays are the worst days to fly by far, with Sunday being a close second. Expected delays are around 20 minutes during rush-hour (8pm)
  • Wednesdays and Saturdays are the quietest days
  • If you can manage it, fly on a Wednesday morning to minimize expected delays.

Closing remarks

As noted in the beginning of this post, this analysis is only one of many questions that can be tackled with this enormous data set. Feel free to browse the data expo website and especially the „Posters & results“ section for many other interesting analyses.

Der Beitrag A Data Science Case Study in R erschien zuerst auf Statistik Service .

To leave a comment for the author, please follow the link and comment on their blog: R-Programming – Statistik Service . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job . Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Copyright © 2022 | MH Corporate basic by MH Themes

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

R case studies

Practice on your own, or with others

Our case studies collection

Here we have collected R case studies and walk-throughs so that you can practice and expand your R skills. You can access them by simply creating a free Applied Epi account and clicking the appropriate link below. You will need to have R installed to follow the case studies.

Target audience

Public health practitioners, epidemiologists, clinicians, and researchers who already have a basic competency in R who want additional practice or exposure to new uses of R in public health. All of our training materials focus on challenges and solutions for frontline practitioners and are built or curated by our team with extensive ground-level experience. Read more about our educational approach

These case studies are all either built by our team, or are open-access tutorials that have been translated by our team into R from another language (e.g. from Stata or SAS). Individual credits are provided in the case studies.

Our partners

To curate this collection of case studies, we hav partnered with the EPIET Alumni Network, Médecins Sans Frontières (MSF) / Doctors without Borders and TEPHINET.

https://github.com/appliedepi/emory_training

COVID-19 - Fulton County, Georgia, USA

case study using r programming

This case study walks the reader through analyzing COVID-19 data from Fulton County (near Atlanta, Georgia, USA). The result is an R Markdown report with data cleaning and analyses of demographics, temporal trends, spatial (GIS) mapping, etc.

Click here to access the website to download the RStudio project, data files, and to access the accompanying slides and instructions.

Please note that all these training materials use fake example data in which no person is identifiable and the actual values have been scrambled/jittered.

Please note that all these training materials use fake example data in which no person is identifiable and the actual values have been scrambled.

Foodborne outbreak investigation - Stengen, Germany

This case study is coming soon

GIS mapping case study - Am Timan, Chad

Time series case study - scotland, uk.

Creative Commons License

  • Case Study: Exploratory Data Analysis in R
  • by Daniel Pinedo
  • Last updated over 3 years ago
  • Hide Comments (–) Share Hide Toolbars

Twitter Facebook Google+

Or copy & paste this link into an email or IM:

A Free, Interactive Course Using Tidy Tools

Predictive modeling, or supervised machine learning, is a powerful tool for using data to make predictions about the world around us. Once you understand the basic ideas of supervised machine learning, the next step is to practice your skills so you know how to apply these techniques wisely and appropriately. In this course, you will work through four case studies using data from the real world; you will gain experience in exploratory data analysis, preparing data so it is ready for predictive modeling, training supervised machine learning models using tidymodels, and evaluating those models.

To take this course, you need some familiarity with tidyverse packages like dplyr and ggplot2 and exposure to machine learning basics. Now let's get started!

Chapter 1: Not mtcars AGAIN

In this first case study, you will predict fuel efficiency from a US Department of Energy data set for real cars of today.

Chapter 2: Stack Overflow Developer Survey

Stack Overflow is the world's largest online community for developers, and you have probably used it to find an answer to a programming question. The second chapter of this course uses data from the annual Stack Overflow Developer Survey to practice predictive modeling and find which developers are more likely to work remotely.

Chapter 3: Get out the vote

In the third case study, you will use data on attitudes and beliefs in the United States to predict voter turnout. You will apply your skills in dealing with imbalanced data and explore more resampling options.

Chapter 4: But what do the nuns think?

The last case study in this course uses an extensive survey of Catholic nuns fielded in 1967 to once more put your practical machine learning skills to use. You will predict the age of these religious women from their responses about their beliefs and attitudes.

Introduction to Data Science

Chapter 1 getting started with r and rstudio.

R is not a programming language like C or Java. It was not created by software engineers for software development. Instead, it was developed by statisticians as an interactive environment for data analysis. You can read the full history in the paper A Brief History of S 1 . The interactivity is an indispensable feature in data science because, as you will soon learn, the ability to quickly explore data is a necessity for success in this field. However, like in other programming languages, you can save your work as scripts that can be easily executed at any moment. These scripts serve as a record of the analysis you performed, a key feature that facilitates reproducible work. If you are an expert programmer, you should not expect R to follow the conventions you are used to since you will be disappointed. If you are patient, you will come to appreciate the unequal power of R when it comes to data analysis and, specifically, data visualization.

Other attractive features of R are:

  • R is free and open source 2 .
  • It runs on all major platforms: Windows, Mac Os, UNIX/Linux.
  • Scripts and data objects can be shared seamlessly across platforms.
  • There is a large, growing, and active community of R users and, as a result, there are numerous resources for learning and asking questions 3 4 5 .
  • It is easy for others to contribute add-ons which enables developers to share software implementations of new data science methodologies. This gives R users early access to the latest methods and to tools which are developed for a wide variety of disciplines, including ecology, molecular biology, social sciences, and geography, just to name a few examples.

1.2 The R console

Interactive data analysis usually occurs on the R console that executes commands as you type them. There are several ways to gain access to an R console. One way is to simply start R on your computer. The console looks something like this:

As a quick example, try using the console to calculate a 15% tip on a meal that cost $19.71:

Note that in this book, grey boxes are used to show R code typed into the R console. The symbol #> is used to denote what the R console outputs.

1.3 Scripts

One of the great advantages of R over point-and-click analysis software is that you can save your work as scripts. You can edit and save these scripts using a text editor. The material in this book was developed using the interactive integrated development environment (IDE) RStudio 6 . RStudio includes an editor with many R specific features, a console to execute your code, and other useful panes, including one to show figures.

Most web-based R consoles also provide a pane to edit scripts, but not all permit you to save the scripts for later use.

All the R scripts used to generate this book can be found on GitHub 7 .

1.4 RStudio

RStudio will be our launching pad for data science projects. It not only provides an editor for us to create and edit our scripts but also provides many other useful tools. In this section, we go over some of the basics.

1.4.1 The panes

When you start RStudio for the first time, you will see three panes. The left pane shows the R console. On the right, the top pane includes tabs such as Environment and History , while the bottom pane shows five tabs: File , Plots , Packages , Help , and Viewer (these tabs may change in new versions). You can click on each tab to move across the different features.

To start a new script, you can click on File, then New File, then R Script.

This starts a new pane on the left and it is here where you can start writing your script.

1.4.2 Key bindings

Many tasks we perform with the mouse can be achieved with a combination of key strokes instead. These keyboard versions for performing tasks are referred to as key bindings . For example, we just showed how to use the mouse to start a new script, but you can also use a key binding: Ctrl+Shift+N on Windows and command+shift+N on the Mac.

Although in this tutorial we often show how to use the mouse, we highly recommend that you memorize key bindings for the operations you use most . RStudio provides a useful cheat sheet with the most widely used commands. You can get it from RStudio directly:

You might want to keep this handy so you can look up key-bindings when you find yourself performing repetitive point-and-clicking.

1.4.3 Running commands while editing scripts

There are many editors specifically made for coding. These are useful because color and indentation are automatically added to make code more readable. RStudio is one of these editors, and it was specifically developed for R. One of the main advantages provided by RStudio over other editors is that we can test our code easily as we edit our scripts. Below we show an example.

When you ask for the document to be saved for the first time, RStudio will prompt you for a name. A good convention is to use a descriptive name, with lower case letters, no spaces, only hyphens to separate words, and then followed by the suffix .R . We will call this script my-first-script.R .

Now we are ready to start editing our first script. The first lines of code in an R script are dedicated to loading the libraries we will use. Another useful RStudio feature is that once we type library() it starts auto-completing with libraries that we have installed. Note what happens when we type library(ti) :

Another feature you may have noticed is that when you type library( the second parenthesis is automatically added. This will help you avoid one of the most common errors in coding: forgetting to close a parenthesis.

Now we can continue to write code. As an example, we will make a graph showing murder totals versus population totals by state. Once you are done writing the code needed to make this plot, you can try it out by executing the code. To do this, click on the Run button on the upper right side of the editing pane. You can also use the key binding: Ctrl+Shift+Enter on Windows or command+shift+return on the Mac.

Once you run the code, you will see it appear in the R console and, in this case, the generated plot appears in the plots console. Note that the plot console has a useful interface that permits you to click back and forward across different plots, zoom in to the plot, or save the plots as files.

To run one line at a time instead of the entire script, you can use Control-Enter on Windows and command-return on the Mac.

1.4.4 Changing global options

You can change the look and functionality of RStudio quite a bit.

As an example we show how to make a change that we highly recommend . This is to change the Save workspace to .RData on exit to Never and uncheck the Restore .RData into workspace at start . By default, when you exit R saves all the objects you have created into a file called .RData. This is done so that when you restart the session in the same folder, it will load these objects. We find that this causes confusion especially when we share code with colleagues and assume they have this .RData file. To change these options, make your General settings look like this:

1.5 Installing R packages

The functionality provided by a fresh install of R is only a small fraction of what is possible. In fact, we refer to what you get after your first install as base R . The extra functionality comes from add-ons available from developers. There are currently hundreds of these available from CRAN and many others shared via other repositories such as GitHub. However, because not everybody needs all available functionality, R instead makes different components available via packages . R makes it very easy to install packages from within R. For example, to install the dslabs package, which we use to share datasets and code related to this book, you would type:

In RStudio, you can navigate to the Tools tab and select install packages. We can then load the package into our R sessions using the library function:

As you go through this book, you will see that we load packages without installing them. This is because once you install a package, it remains installed and only needs to be loaded with library . The package remains loaded until we quit the R session. If you try to load a package and get an error, it probably means you need to install it first.

We can install more than one package at once by feeding a character vector to this function:

Note that installing tidyverse actually installs several packages. This commonly occurs when a package has dependencies , or uses functions from other packages. When you load a package using library , you also load its dependencies.

Once packages are installed, you can load them into R and you do not need to install them again, unless you install a fresh version of R. Remember packages are installed in R not RStudio.

It is helpful to keep a list of all the packages you need for your work in a script because if you need to perform a fresh install of R, you can re-install all your packages by simply running a script.

You can see all the packages you have installed using the following function:

https://pdfs.semanticscholar.org/9b48/46f192aa37ca122cfabb1ed1b59866d8bfda.pdf ↩︎

https://opensource.org/history ↩︎

https://stats.stackexchange.com/questions/138/free-resources-for-learning-r ↩︎

https://www.r-project.org/help.html ↩︎

https://stackoverflow.com/documentation/r/topics ↩︎

https://www.rstudio.com/ ↩︎

https://github.com/rafalab/dsbook ↩︎

case study using r programming

Advanced R Statistical Programming and Data Models

Analysis, Machine Learning, and Visualization

  • © 2019
  • Matt Wiley 0 ,
  • Joshua F. Wiley 1

Columbia City, USA

You can also search for this author in PubMed   Google Scholar

  • Demonstrates applied R programming to make analyses more efficient and effective
  • Shows how to handle machine learning using R
  • Includes case studies throughout book

79k Accesses

13 Citations

6 Altmetric

This is a preview of subscription content, log in via an institution to check access.

Access this book

  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

Table of contents (13 chapters)

Front matter, univariate data visualization.

Matt Wiley, Joshua F. Wiley

Multivariate Data Visualization

Ml: introduction, ml: unsupervised, ml: supervised, missing data, glmms: introduction, glmms: linear, glmms: advanced, modelling iiv, back matter.

  • programming
  • data science

About this book

  • Conduct advanced analyses in R including: generalized linear models, generalized additive models, mixedeffects models, machine learning, and parallel processing
  • Carry out regression modeling using R data visualization, linear and advanced regression, additive models, survival / time to event analysis
  • Handle machine learning using R including parallel processing, dimension reduction, and feature selection and classification
  • Address missing data using multiple imputation in R
  • Work on factor analysis, generalized linear mixed models, and modeling intraindividual variability 

Authors and Affiliations

About the authors, bibliographic information.

Book Title : Advanced R Statistical Programming and Data Models

Book Subtitle : Analysis, Machine Learning, and Visualization

Authors : Matt Wiley, Joshua F. Wiley

DOI : https://doi.org/10.1007/978-1-4842-2872-2

Publisher : Apress Berkeley, CA

eBook Packages : Professional and Applied Computing , Apress Access Books , Professional and Applied Computing (R0)

Copyright Information : Matt Wiley and Joshua F. Wiley 2019

Softcover ISBN : 978-1-4842-2871-5 Published: 21 February 2019

eBook ISBN : 978-1-4842-2872-2 Published: 20 February 2019

Edition Number : 1

Number of Pages : XX, 638

Number of Illustrations : 80 b/w illustrations, 127 illustrations in colour

Topics : Programming Languages, Compilers, Interpreters , Programming Techniques , Probability and Statistics in Computer Science

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Breadcrumbs Section. Click here to navigate to respective pages.

Data Mining with R

Data Mining with R

DOI link for Data Mining with R

Get Citation

Data Mining with R: Learning with Case Studies, Second Edition uses practical examples to illustrate the power of R and data mining. Providing an extensive update to the best-selling first edition, this new edition is divided into two parts. The first part will feature introductory material, including a new chapter that provides an introduction to data mining, to complement the already existing introduction to R. The second part includes case studies, and the new edition strongly revises the R code of the case studies making it more up-to-date with recent packages that have emerged in R.

The book does not assume any prior knowledge about R. Readers who are new to R and data mining should be able to follow the case studies, and they are designed to be self-contained so the reader can start anywhere in the document.

The book is accompanied by a set of freely available R source files that can be obtained at the book’s web site. These files include all the code used in the case studies, and they facilitate the "do-it-yourself" approach followed in the book.

Designed for users of data analysis tools, as well as researchers and developers, the book should be useful for anyone interested in entering the "world" of R and data mining.

About the Author

Luís Torgo is an associate professor in the Department of Computer Science at the University of Porto in Portugal. He teaches Data Mining in R in the NYU Stern School of Business’ MS in Business Analytics program. An active researcher in machine learning and data mining for more than 20 years, Dr. Torgo is also a researcher in the Laboratory of Artificial Intelligence and Data Analysis (LIAAD) of INESC Porto LA.

TABLE OF CONTENTS

Chapter 1 | 4  pages, introduction, part i | 186  pages, r and data mining, chapter 2 | 36  pages, introduction to r, chapter 3 | 148  pages, introduction to data mining, part 2 | 192  pages, case studies, chapter 4 | 48  pages, predicting algae blooms, chapter 5 | 54  pages, predicting stock market returns, chapter 6 | 58  pages, detecting fraudulent transactions, chapter 7 | 30  pages, classifying microarray samples.

  • Privacy Policy
  • Terms & Conditions
  • Cookie Policy
  • Taylor & Francis Online
  • Taylor & Francis Group
  • Students/Researchers
  • Librarians/Institutions

Connect with us

Registered in England & Wales No. 3099067 5 Howick Place | London | SW1P 1WG © 2024 Informa UK Limited

What Is the Best Programming Language to Learn?

Best programming language to learn

So you want to be a developer? Now’s a great time to learn programming. The salaries are high , the job demand is growing, and a new and exciting coding language tops the popularity rankings every year.

But which is the best programming language to learn? If you’re new to this, there are many choices to pick from, making it difficult to know where to start.

Programming languages are continually evolving—while one dies, it gets replaced by the new big thing. If you time it well, you stand to earn a lot of money. Or you can go with the tried and tested languages that have stood the test of time.

All of it is enough to make your head spin. Which programming language is best to learn? Which give the highest salaries? Are they easier to start learning? This list will guide you through everything you need to know about picking the best programming language to learn.

Let’s explore!

What Are the Different Types of Programming Languages?

The most popular programming languages to learn

Every programming language is different. While each programming language has its unique syntax, how it’s written, run, and compiled can change everything.

Many developers prefer to work with certain types of languages. It’s also a lot easier to migrate between similar languages, so the first programming language you learn matters.

So what are the different types of programming languages? There’s a difference between scripting and programming languages ? Let’s take a look.

  • Programming languages can be just about anything, but they’re often best for software development as they can be used on various platforms and tend to be compiled rather than interpreted. Most programming languages are intended for software development, to develop programs you download and run on your device.
  • Scripting languages are just the opposite. They’re often interpreted, which means their code runs on the fly rather than going through a compiling process inside a program. Web development languages are often scripting languages.
  • Markup languages aren’t precisely programming languages, but they’re used for web development. These are human-readable tags used to format a document.
  • Web development languages are specialized for creating websites, either in the frontend or backend.
  • Frontend (or client-side) languages change a website in the user’s browser . For instance, clicking something on the website and producing an animation would result from frontend programming like CSS , HTML5, and JavaScript . It’s usually a combination of all the above languages.
  • Backend (or server-side) languages change a site from the server or application layer. For example, submitting form data, or changing something in the database, is backend programming.

You should also note that many people just use “programming languages” to refer to all coding languages as a whole. A scripting language is a specialized programming language, but not all programming languages are scripting languages.

Not all languages below are programming languages in the strictest sense, but they’re all used for coding.

It’s time to dive in!

HTML and CSS

HTML5 and CSS3 logos

Though HTML and CSS aren’t technically programming languages, HTML5 and CSS3 are the ideal places to start if you want to be a front-end web developer. This duo makes up the building blocks of any webpage , with HTML structuring a page’s content, and CSS styling and modifying that structure.

HTML and CSS are a great jumping-off point for any young developer. While HTML is an easy-to-learn markup language, CSS will challenge you a bit more, but it’s not incredibly difficult to pick up either.

Frontend web development.
HTML is straightforward to learn; CSS is also reasonably easy to learn.
 on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
, according to PayScale. Also, check out the .
It is practically required to be a frontend web developer. Very easy to learn, even for people with no programming experience. Its high popularity makes finding free resources easy. Well-supported on all devices.
It doesn’t net you an exceptionally high salary because it’s a requirement for all frontend web development jobs. Cross-browser issues are always a concern.

JavaScript community logo

Though HTML and CSS are must-haves for frontend web developers, JavaScript is overwhelmingly popular. While HTML structures and CSS styles, JavaScript adds advanced client-side functions to a webpage.

Unlike HTML and CSS, JavaScript is a real programming and scripting language. It means you can do far more with it, but it’s also a bit harder to learn. Still, if you want to develop more than just a flat page, you almost definitely need to know JavaScript.

Frontend scripting. Rarely used for mobile/software development.
JavaScript is a real programming language, so those coming straight from HTML and CSS may have difficulty with it. Compared to other languages, it’s moderately easy to learn.
Very popular, often going hand in hand with HTML and CSS. in PYPL Popularity Index of Feb. 2021. in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
, according to PayScale.
The fastest and easiest way to code client-side scripts that run in the browser. A highly popular programming language. Broad support across various applications. It practically powers most of the modern websites, including this.
It may be overwhelming to learn for those who only know markup languages. Has issues with security and cross-browser stability.

PHP official logo

PHP is the language of choice for massive projects, most notably WordPress itself. It once divided the developer community due to being outdated and slow, and many still hold onto the belief that it isn’t worth the time to learn.

However, PHP has come back into style since PHP 5.x with a swath of improvements to speed and structure. According to W3Techs, 79% of the websites they surveyed use PHP .

One thing is clear—it’s a powerful way to code server-side applications, and compared to other scripting languages, it’s effortless to learn. Its popularity among new programmers and the plethora of open source projects like WordPress means there’s plenty of learning resources too.

Dozens of popular PHP frameworks such as Laravel exist, which can make your job coding with PHP even easier. With the launch of PHP 8.0 , PHP is trying to break away from being just a backend scripting language to being a general-purpose programming language.

Backend web scripting.
Easy to learn, especially for beginners.
in PYPL Popularity Index of Feb. 2021. in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020 (but dreaded by 63%!)
according to PayScale.
Very easy to learn. Well-established in web development and commonly found on websites. Modern versions are reasonably fast. Easy to find a job as a PHP developer.
Popularity is declining compared to hot new languages like Python. Many developers don’t consider it viable.

MySQL and MariaDB logos

Many say that SQL is not technically a programming language, as you can’t use it the same way as a general-purpose language like C++ or JavaScript. But however you define it, it’s a critical web development tool like any other.

SQL’s primary purpose is to interface with a website’s database . A variety of database management systems like MySQL, MariaDB , SQLite, and PostgreSQL exist to help you deal with it, but all of them use the SQL language.

Backend database management.
Moderately easy to learn for those with programming knowledge. It can get highly complicated if used for advanced functions.
in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
, according to PayScale.
Powerful and can handle large amounts of data. High in demand, database management is required for almost every website.
It can be challenging to master. Dealing with stored data can be tedious.

C++ logo

The original C language is still used today, but most modern developers have switched to C++. The general-purpose programming language is best known for being versatile; you can use it to make applications of all sorts.

Computer programs, mobile apps , video games, operating systems, entire browsers, even web development to some extent–if you can think of something, C++ can make it. And it will run it fast.

Mostly software development; can be used in a variety of situations.
Relatively difficult to learn, especially for beginners.
in PYPL Popularity Index of Feb. 2021. in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
according to PayScale.
Pure versatility. You can use it for truly anything. Skills translate well to other languages. Fast and powerful.
Not the right first language for total beginners. Incredibly complex due to being old and so universal in application. Not ideal for web development.

While C++ is a direct expansion of C, C# is simply inspired. Like C++, C# is known to be a more modern, more versatile version of the original, outdated language. So what’s the difference between the two?

For one, C# is a lot easier to learn. It’s simpler and less complex but can still be used to create a variety of different applications. It’s also a lot better for web development than C++. It is quite popular for game development and sits in the middle of the highest-salary languages .

Which one of these two should you pick? It depends. C++ is better if you need raw power. C# is simpler and easier to work with, but it’s still a versatile solution.

Mostly software and web development.
Moderately easy to learn.
in PYPL Popularity Index of Feb. 2021. in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
according to PayScale.
Relatively easy to learn, especially compared to C++. Versatile. You can use it can in a variety of different applications.
Not as powerful or as fast as C++.

Swift logo

As far as iOS and macOS development go, Swift is the gold standard. It’s modern, fast, and security-focused. All that has caused it to explode in popularity. Developers love to program with it as it is effortless to write in, so it’s managed to monopolize iOS development.

If you plan to develop programs or apps for Apple devices, Swift is a great programming language to get started.

iOS and macOS development.
Relatively easy to learn.
in PYPL Popularity Index of Feb. 2021. in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
according to PayScale.
Clean code with rapid development. The most popular language for iOS.
Very little cross-platform support. Growing in popularity, but still considered a young language.

Kotlin logo

If Swift is the language of choice for iOS development, then Kotlin is the primary language for Android. It doesn’t aim to be unique or groundbreaking, but instead draws inspiration from the languages you already know—that’s what makes it so loved by developers.

It’s also totally interoperable with Java, which makes it a direct replacement. But unlike Java, which is falling in popularity somewhat, Kotlin is only growing, and it’s a competitive language to learn.

Primarily Android development, but also web and software development.
Relatively easy to learn.
in PYPL Popularity Index of Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
according to PayScale.
An excellent language for Android development as well as general purposes. An easy to learn and growing language. Interoperability with Java.
Newer and less popular than Java, so fewer resources available online.

Rust logo

Rust is an almost universally loved programming language. In many ways, it’s similar to C++. It’s designed to be robust and fast, and you can use it in various projects.

Designed by Graydon Hoare at Mozilla Research, Rust was explicitly made to solve many problems that come with other languages. So if you’re struggling with older technologies like Java and C++, Rust might be the perfect successor.

Mostly software and application development, but you can use it for anything.
Considered challenging to learn.
in PYPL Popularity Index of Feb. 2021.  on IEEE’s top languages of 2023. Loved by (!) of StackExchange developers in 2020 (2016, 2017, 2018, and 2019). Dreaded the least of any language.
according to PayScale.
Powerful, fast, and efficient. Able to be used in a wide variety of situations. Loved by programmers.
Stricter than many other languages, so difficult to code in as well as challenging to learn.

Python logo

Python’s surge of popularity seemed to come out of nowhere, but it’s taken over almost all development fields. Currently, it’s the second most popular coding language on GitHub (after JavaScript). From powering server backend to machine learning software, Python can do it all .

It has almost everything you could ask for in a programming language: versatility, speed, and efficiency. Plus, it’s super easy to learn.

If potential and popularity alone are enough to get you motivated, Python is the one to choose. There’s no shortage of demand for it right now.

If you want to find out more about how to learn Python, read our post on the best  Python tutorials .

Web and software development.
Easy to learn.
Python has exploded in popularity in the past few years. in PYPL Popularity Index of Feb. 2021. in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020, and wanted by 30%, the most of any language.
according to PayScale.
It can be used pretty much anywhere, from web applications to software development to game creation. Cross-platform. High popularity means lots of resources and even more jobs.
Slower than other languages. Python knowledge doesn’t translate well to other systems.

Ruby logo

Ruby is all about simplicity. Its elegant syntax is easy to work with and designed to be as painless as possible, and it works in both front and backend development. Ruby language is practically written in English, so it’s elementary to learn its syntax.

But don’t be fooled by the appearance of simplicity. Ruby may not be a language of raw power, but it gets the job done well.

Mostly frontend and backend web development.
Extremely easy to learn.
in PYPL Popularity Index of Feb. 2021. in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
, according to PayScale.
Clean, efficient code. The language of choice for effortless web development (especially with the framework ). A large and vibrant community.
Not as flexible or as popular as other languages.

Java feels like the grandfather of all coding languages, but it’s not even as old as C++ in reality. Though many consider it outdated, it’s still used worldwide and on all sorts of devices.

Python is overtaking it, and while Java is generally declining in popularity, it certainly isn’t dead. There are thousands of jobs for Java backend developers , and demand remains high, making it a reliable choice.

Software/Application, Web, and Mobile Development.
Tricky to pick up.
Despite being an older language that’s fallen out of favor over the years, Java remains a top contender for popularity. in PYPL Popularity Index of Feb. 2021. in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
according to PayScale.
Cross-platform and general-purpose. Longstanding and popular despite its age. Higher security.
Difficult to learn. Bad performance. Popularity trends towards a decline.

Much like Kotlin, Scala was created to replace Java and its many issues. It’s a powerful, multi-purpose language, but it’s much more concise. Though you can use it for the same systems as Java, it’s mostly used for big data processing and machine learning.

While it’s not exploding with the same popularity as other languages, there’s a lot of potential in this one.

Mostly software/application development.
Complex and difficult to learn.
in PYPL Popularity Index of Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
, according to PayScale.
Less verbose and more concise than Java. Very powerful and can be used anywhere.
Just as difficult to learn as Java. Not the right language for first-time programmers.

Go logo

Last is Go, a Java and C++ alternative designed by Google. Performance is what it does best, eliminating the considerable compile times that afflict many other languages. It’s concurrent, working in the background as it performs multiple functions at once.

However, unless maximum speed is your only goal, other languages can better accomplish much of what Go does. It’s not always clear what exactly you should use Go for. Still, developers like it, and its popularity continues to grow. For instance, MailHog , the open source email testing tool that powers DevKinsta’s local email testing feature , is built with the Go programming language.

Mostly backend web development.
Extremely easy to learn.
in PYPL Popularity Index of Feb. 2021. in TIOBE Index for Feb. 2021. on IEEE’s top languages of 2023. Loved by of StackExchange developers in 2020.
, according to PayScale.
Very lightweight and fast. A modern solution to common programming problems.
A newer language, so resources are scarce. Simplicity makes it less flexible.

Which Programming Language Is the Best?

There’s no clear answer on which coding language is best. Each has its pros and cons and shines in different situations. But when it comes to specific categories, there are a few clear winners. These are the best languages for:

  • Beginners: For the absolute easiest languages to learn, even if you have no experience at all, start with HTML/CSS, Go, Ruby, PHP , or Python.
  • Web developers: HTML, CSS, and JavaScript are a necessity for frontend developers. Backend developers should look into Ruby, Python, PHP, and Go. And, of course, SQL for database management.
  • Software developers: C++ is undoubtedly the most powerful. But don’t overlook Rust, Python, Scala, or C#. It all depends on the software you’re building.
  • Mobile developers: Swift is best for iOS, and Kotlin for Android. But general-purpose languages like C++ and Java can work well too.
  • High salary: Swift, Rust, Scala, Kotlin, and Go all help you earn about $100k/year, with Scala and Go generating the highest salaries. Also, check out our developer salaries for various positions .
  • Popularity: Python is highly popular, and considering its versatility, it stands to stay at the top for a long time. JavaScript, Java, C#, and C++ are also quite popular themselves.
  • Flexibility: C++ might be the giant in flexibility, but Rust, Java, Scala, Kotlin, and Python all work well cross-platform and in various situations.

Picking a single programming language from a list isn’t an easy task. But if you know what you want to do and where to start, it’s much easier to sort that list down into a few of the best candidates. That’s true whether you want the “hottest” language, the one with the best salary, or just those that are easiest to learn.

Learning how to program will net you one of the most flexible jobs you can get, allowing you to work remotely and choose whatever technologies you prefer. While no one programming language is the best of them all, a push in the right direction can help you choose the right one.

Get coding!

Did we miss out on any other programming language? We’d love to hear your opinion in the comment section below. Check out Kinsta’s career page for programming and tech-related job opportunities. 

case study using r programming

Brenda Barron is a journalist and copywriter from California. She contributes to sites like WPMU DEV, Envato, and Torque.

Related Articles and Topics

case study using r programming

Powerful Managed WordPress Hosting

case study using r programming

Top 13 Scripting Languages You Should Pay Attention To

  • Website Performance
  • Website Security

Gravatar for this comment's author

C# is not heavily tied to Windows. .NET Core is cross platform and open source

Gravatar for this comment's author

TypeScript is a very useful language, which is upwards compatible with JavaScript. Therefore, it can be used in web development for any application in which you would otherwise use JavaScript. The main advantage of TypeScript is that you can explicitly declare types and define other constructs that enable you to write more readable code and to catch errors at development time that otherwise might elude detection until run time. Finally, TypeScript has a transpiler that converts it into standard JavaScript.

Gravatar for this comment's author

Regarding C#, “It is highly tied to Windows”. This statement is both false and misleading. With Mono and more recently, .NET Core, C# can run on Windows, Mac and Linux.

Gravatar for this comment's author

Hi Jonathan, thank you for your feedback it has been corrected.

Thanks, Tom. Needless to say this is a very comprehensive article and provides a useful overview of the most popular languages available today.

No worries. We read every comment and take content feedback into account.

Gravatar for this comment's author

I appreciate your efforts and the way you explained all the different languages. Thank you so much

Gravatar for this comment's author

Elaborative and useful article, specilly for new learners who intend to make career in programming.

Gravatar for this comment's author

I’m a new learner who intends to take a career in programming and I find your article quite helpful as it has given clear insight on what programming to major in. kudos✌️

Gravatar for this comment's author

Leave a Reply Cancel reply

By submitting this form: You agree to the processing of the submitted personal data in accordance with Kinsta's Privacy Policy , including the transfer of data to the United States.

You also agree to receive information from Kinsta related to our services, events, and promotions. You may unsubscribe at any time by following the instructions in the communications received.

case study using r programming

Get notified in your email when a new post is published to this blog

The Old New Thing

case study using r programming

Lock-free reference-counting a TLS slot using atomics, part 2

' data-src=

Lock-free reference-counting a TLS slot using atomics, part 1

The origin story of the windows 3d pipes screen saver, how do i get the name of a sid, and what does it mean when the conversion fails, what’s the deal with std::type_identity , can ini files be unicode yes, they can, but it has to be your idea, how 16-bit windows cached ini files for performance, why does global­lock max out at 255 locks, more on harmful overuse of std:: move, a graphical depiction of the steps in building a c++ executable, with xaml and packaging.

light-theme-icon

IMAGES

  1. A Complete Tutorial to learn Data Science in R from Scratch

    case study using r programming

  2. Case Study Solutions

    case study using r programming

  3. 10 Best R Programming Courses for Coding Interviews

    case study using r programming

  4. Functional programming in Go [case study] : r/programming

    case study using r programming

  5. CASE Study R

    case study using r programming

  6. r&r Case Study Analysis

    case study using r programming

VIDEO

  1. [R18] Case study 2 data analysis using R Language

  2. Introduction to R and RStudio

  3. Case Study

  4. Lecture 33

  5. Microarray data analysis using R-programming

  6. Data Visualization in R #datavisualization #rprogrammingforbeginners #tutorial

COMMENTS

  1. Cyclistic Bike Share

    Here, I will be using R programming language for this analysis because of its potential benefits to reproducibility, transparency, easy statistical analysis tools and data visualizations. The following sets of data analysis process will be followed: Ask, Prepare, Process, Analyze, Share, Act. The case study road map as listed below will be ...

  2. Modern Data Science with R

    At the same time, Part I of the book is accessible to a general audience with no programming or statistics experience. Key features of this book Focus on case studies and extended examples. We feature a series of complex, real-world extended case studies and examples from a broad range of application areas, including politics, transportation ...

  3. R Applications

    R Use Case in Healthcare. Merck: Merck & co. use the R programming language for clinical trials and drug testing. "We use R for adaptive designs frequently because it's the fastest tool to explore designs that interest us. Off-the-shelf software gives you off-the-shelf options. Those are a good first-order approximation, but if you really ...

  4. Programming in R: A Case Study

    To begin, we need to load the Palmer Penguins dataset in the environment. To install any package in an R environment, we follow the syntax: install.packages ("package name") This is a one-time ...

  5. Cyclistic

    Introduction. This project is a case study from the Google Data Analytics Professional Certificate course. For the analysis i have chosen to use R programming language and Rstudio IDE for its easy statisical analysis and data visualisations with large datasets.

  6. Case Study Project

    Case Study Project. "Case Study Project" is an initiative that looks to promote and facilitate the usage of R programming language for data manipulation, data analysis, programming, and computational aspects of statistical topics in authentic research applications. Specifically, it looks to solve various research problems using the ...

  7. case study

    A SAS-to-R success story. Our biostatistics group has historically utilized SAS for data management and analytics for biomedical research studies, with R only used occasionally for new methods or data visualization. Several years ago and with the encouragement of leadership, we initiated a movement to increase our usage of R significantly.

  8. A Data Science Case Study in R

    For this reason, there is a growing need for tailor-made solutions, which are individually tailored to the project's goal, which is often implemented by R programming. To provide our readers with support in their own R […] Der Beitrag A Data Science Case Study in R erschien zuerst auf Statistik Service.

  9. Data Analysis with R Programming Course by Google

    Module 1 • 7 hours to complete. R is a programming language that can help you in your data analysis process. In this part of the course, you'll learn about R and RStudio, the environment you'll use to work in R. You'll explore the benefits of using R and RStudio as well as the components of RStudio that will help you get started.

  10. Linear Regression in R: A Case Study

    Step 1: Save the data to a file (excel or CSV file) and read it into R memory for analysis. This step is completed by following the steps below. 1. Save the CSV file locally on desktop. 2. In RStudio, navigate to "Session" -> "Set Working Directory" ->"Choose Directory" -> Select folder where the file was saved in Step 1. 3.

  11. R case studies

    This case study walks the reader through analyzing COVID-19 data from Fulton County (near Atlanta, Georgia, USA). The result is an R Markdown report with data cleaning and analyses of demographics, temporal trends, spatial (GIS) mapping, etc. Click here to access the website to download the RStudio project, data files, and to access the ...

  12. Bellabeat Case Study Using R

    Most of the participants are lightly active. Participants' average sleep pattern is 7 hours once a day. The Average steps are 7638 per day which is a bit lesser than what is considered a healthy count of steps that is 8000 per for more benefit. Also taking 10,000 steps per day was associated with a 65% lower risk to disease mortality compared ...

  13. RPubs

    Forgot your password? Sign InCancel. RPubs. by RStudio. Sign inRegister. Case Study: Exploratory Data Analysis in R. by Daniel Pinedo. Last updatedover 3 years ago. HideComments(-)ShareHide Toolbars.

  14. Supervised machine learning case studies in R! · A free interactive course

    The last case study in this course uses an extensive survey of Catholic nuns fielded in 1967 to once more put your practical machine learning skills to use. You will predict the age of these religious women from their responses about their beliefs and attitudes. <p>This is a free, open source course on supervised machine learning in R.

  15. Chapter 1 Getting started with R and RStudio

    1.4.1 The panes. When you start RStudio for the first time, you will see three panes. The left pane shows the R console. On the right, the top pane includes tabs such as Environment and History, while the bottom pane shows five tabs: File, Plots, Packages, Help, and Viewer (these tabs may change in new versions).

  16. Advanced R Statistical Programming and Data Models

    Each chapter starts with conceptual background information about the techniques, includes multiple examples using R to achieve results, and concludes with a case study. Written by Matt and Joshua F. Wiley, Advanced R Statistical Programming and Data Models shows you how to conduct data analysis using the popular R language. You'll delve into ...

  17. Case Study Using R Programming

    If the issue persists, it's likely a problem on our side. Unexpected token < in JSON at position 4. keyboard_arrow_up. content_copy. SyntaxError: Unexpected token < in JSON at position 4. Refresh. Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources.

  18. Data Science Masterclass With R 8 Case Studies + 4 Projects

    Project 1 - Titanic Case Study which is based on Classification Problem. Project 2 - E-commerce Sale Data Analysis - based on Regression. Project 3 - Customer Segmentation which is based on Unsupervised learning. Final Project - Market Basket Analysis - based on Association rule mining.

  19. Data Mining with R

    The second part includes case studies, and the new edition strongly revises the R code of the case studies making it more up-to-date with recent packages that have emerged in R. ... He teaches Data Mining in R in the NYU Stern School of Business' MS in Business Analytics program. An active researcher in machine learning and data mining for ...

  20. PDF R and Data Mining: Examples and Case Studies

    This book introduces into using R for data mining. It presents many examples of various data mining functionalities in R and three case studies of real world applications. The supposed audience of this book are postgraduate students, researchers, data miners and data scientists who are interested in using R to do their data mining research and ...

  21. Elsevier Education Portal

    Skip to main content

  22. Case Studies

    Featured Case Studies. Avison Young Increases CRM Adoption from 23% to 90%. Real Estate. Team Adoption. Revenue Forecasting. Read more. Momentive Aligns Marketing Processes With HubSpot. Software & Technology. Integrations.

  23. Browse journals and books

    Abstract Domains in Constraint Programming. Book • 2015. Abstracts and Abstracting. A Genre and Set of Skills for the Twenty-First Century. Book • 2010. AC Power Conditioners. Design and Application ... Case Study-Based Assessment of Current Experience, Cross-sectorial Effects, and Socioeconomic Transformations. Book • 2019. Accelerator ...

  24. Case Study: Incorporating ODMAP Data in the Overdose Fatality Review

    This publication by the Bureau of Justice Assistance (BJA) Comprehensive Opioid, Stimulant, and Substance Use Program (COSSUP), along with Overdose Fatality Review (OFR) and ODMAP (Overdose Detection Mapping Application Program), provides a case study on the incorporation of ODMAP Data in the Overdose Fatality Review Process in Ocean County, New Jersey.

  25. What Is the Best Programming Language to Learn?

    For one, C# is a lot easier to learn. It's simpler and less complex but can still be used to create a variety of different applications. It's also a lot better for web development than C++. It is quite popular for game development and sits in the middle of the highest-salary languages.

  26. IBM Cybersecurity Analyst Professional Certificate

    Throughout the program, you will use virtual labs and internet sites that will provide you with practical skills with applicability to real jobs that employers value, including: ... Breach Response Case Studies IBM digital badge. In this course, you will learn to: Apply incident response methodologies. Research and describe a watering hole ...

  27. What Is Artificial Intelligence? Definition, Uses, and Types

    Artificial intelligence (AI) is the theory and development of computer systems capable of performing tasks that historically required human intelligence, such as recognizing speech, making decisions, and identifying patterns. AI is an umbrella term that encompasses a wide variety of technologies, including machine learning, deep learning, and ...

  28. The Old New Thing

    Lock-free reference-counting a TLS slot using atomics, part 1. June 12, 2024 Jun 12, 2024 06/12/24 Raymond Chen. First, we do it with locks. 3 0. Code. The origin story of the Windows 3D Pipes screen saver. June 11, 2024 Jun 11, 2024 06/11/24 Raymond Chen. Looking for a place to show off. ...

  29. CCNA

    Identify key elements of a security program, like user awareness and training. Demonstrate practical skills like setting up secure access to devices and networks. ... A variety of resources are available to help you study - from guided learning to self-study and a community forum. Explore exams and training. Unlock your career potential

  30. CASE STUDY: Building Mutually Beneficial Partnerships With Universities

    This publication by the Bureau of Justice Assistance (BJA) Comprehensive Opioid, Stimulant, and Substance Use Program (COSSUP) provides a case study on building relationships with universities. The Martinsburg, West Virginia, Initiative (TMI) has engaged Shepherd University in efforts to support evaluation and workforce development.