Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

EveThan/IBM-Applied-Data-Science-Capstone-Project

Folders and files.

NameName
5 Commits

Repository files navigation

Ibm applied data science capstone project.

The PowerPoint slides for this project can be found at Capstone_Presentation.pptx or Capstone_Presentation.pdf .

Executive summary

In this capstone project, we will predict if the SpaceX Falcon 9 first stage will land successfully using several machine learning classification algorithms. The main steps in this project include:

  • Data collection, wrangling, and formatting
  • Exploratory data analysis
  • Interactive data visualization
  • Machine learning prediction

Our graphs show that some features of the rocket launches have a correlation with the outcome of the launches, i.e., success or failure. It is also concluded that decision tree may be the best machine learning algorithm to predict if the Falcon 9 first stage will land successfully.

Introduction

In this capstone, we will predict if the Falcon 9 first stage will land successfully. SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.

Most unsuccessful landings are planned. Sometimes, SpaceX will perform a controlled landing in the ocean. The main question that we are trying to answer is, for a given set of features about a Falcon 9 rocket launch which include its payload mass, orbit type, launch site, and so on, will the first stage of the rocket land successfully?

Methodology

The overall methodology includes:

  • Data collection, wrangling, and formatting, using:
  • Web scraping
  • Exploratory data analysis (EDA), using:
  • Pandas and NumPy
  • Data visualization, using:
  • Matplotlib and Seaborn
  • Machine learning prediction, using
  • Logistic regression
  • Support vector machine (SVM)
  • Decision tree
  • K-nearest neighbors (KNN)

Data collection using SpaceX API

1_Data Collection API.ipynb

Libraries or modules used: requests, pandas, numpy, datetime

  • The API used is here .
  • The API provides data about many types of rocket launches done by SpaceX, the data is therefore filtered to include only Falcon 9 launches.
  • The API is accessed using requests.get().
  • The json result is converted to a dataframe using the json_normalize() function from pandas.
  • Every missing value in the data is replaced the mean the column that the missing value belongs to.
  • We end up with 90 rows or instances and 17 columns or features.

Data Collection with Web Scraping

2_Data Collection with Web Scraping.ipynb

Libraries or modules used: sys, requests, BeautifulSoup from bs4, re, unicodedata, pandas

  • The data is scraped from List of Falcon 9 and Falcon Heavy launches .
  • The website contains only the data about Falcon 9 launches.
  • First, the Falcon9 Launch Wiki page is requested from the url and a BeautifulSoup object is created from response of requests.get().
  • Next, all column/variable names are extracted from the HTML table header by using the find_all() function from BeautifulSoup.
  • A dataframe is then created with the extracted column names and entries filled with launch records extracted from table rows.
  • We end up with 121 rows or instances and 11 columns or features.

EDA with Pandas and Numpy

3_EDA.ipynb

Libraries or modules used: pandas, numpy

Functions from the Pandas and NumPy libraries such as value_counts() are used to derive basic information about the data collected, which includes:

  • The number of launches on each launch site
  • The number of occurrence of each orbit
  • The number and occurrence of each mission outcome

EDA with SQL

4_EDA with SQL.ipynb

Framework used: IBM DB2

Libraries or modules used: ibm_db

The data is queried using SQL to answer several questions about the data such as:

  • The names of the unique launch sites in the space mission
  • The total payload mass carried by boosters launched by NASA (CRS)
  • The average payload mass carried by booster version F9 v1.1

The SQL statements or functions used include SELECT, DISTINCT, AS, FROM, WHERE, LIMIT, LIKE, SUM(), AVG(), MIN(), BETWEEN, COUNT(), and YEAR().

Data Visualization using Matplotlib and Seaborn

5_EDA Visualization.ipynb

Libraries or modules used: pandas, numpy, matplotlib.pyplot, seaborn

Functions from the Matplotlib and Seaborn libraries are used to visualize the data through scatterplots, bar charts, and line charts. The plots and charts are used to understand more about the relationships between several features, such as:

  • The relationship between flight number and launch site
  • The relationship between payload mass and launch site
  • The relationship between success rate and orbit type

Examples of functions from seaborn that are used here are scatterplot(), barplot(), catplot(), and lineplot().

Picture 1

Data Visualization using Folium

6_Interactive Visual Analytics with Folium lab.ipynb

Libraries or modules used: folium, wget, pandas, math

Functions from the Folium libraries are used to visualize the data through interactive maps. The Folium library is used to:

  • Mark all launch sites on a map
  • Mark the succeeded launches and failed launches for each site on the map
  • Mark the distances between a launch site to its proximities such as the nearest city, railway, or highway

These are done using functions from folium such as add_child() and folium plugins which include MarkerCluster, MousePosition, and DivIcon.

Picture 2

Data Visualization using Dash

7_spacex_dash_app.py

Libraries or modules used: pandas, dash, dash_html_components, dash_core_components, Input and Output from dash.dependencies, plotly.express

Functions from Dash are used to generate an interactive site where we can toggle the input using a dropdown menu and a range slider. Using a pie chart and a scatterplot, the interactive site shows:

  • The total success launches from each launch site
  • The correlation between payload mass and mission outcome (success or failure) for each launch site

The application is launched on a terminal on the IBM Skills Network website.

Picture 3

Machine Learning Prediction

8_Machine Learning Prediction.ipynb

Libraries or modules used: pandas, numpy, matplotlib.pyplot, seaborn, sklearn

Functions from the Scikit-learn library are used to create our machine learning models. The machine learning prediction phase include the following steps:

  • Standardizing the data using the preprocessing.StandardScaler() function from sklearn
  • Splitting the data into training and test data using the train_test_split function from sklearn.model_selection
  • Creating machine learning models, which include:
  • Logistic regression using LogisticRegression from sklearn.linear_model
  • Support vector machine (SVM) using SVC from sklearn.svm
  • Decision tree using DecisionTreeClassifier from sklearn.tree
  • K nearest neighbors (KNN) using KNeighborsClassifier from sklearn.neighbors
  • Fit the models on the training set
  • Find the best combination of hyperparameters for each model using GridSearchCV from sklearn.model_selection
  • Evaluate the models based on their accuracy scores and confusion matrix using the score() function and confusion_matrix from sklearn.metrics

Putting the results of all 4 models side by side, we can see that they all share the same accuracy score and confusion matrix when tested on the test set. Therefore, their GridSearchCV best scores are used to rank them instead. Based on the GridSearchCV best scores, the models are ranked in the following order with the first being the best and the last one being the worst:

  • Decision tree (GridSearchCV best score: 0.8892857142857142)
  • K nearest neighbors, KNN (GridSearchCV best score: 0.8482142857142858)
  • Support vector machine, SVM (GridSearchCV best score: 0.8482142857142856)
  • Logistic regression (GridSearchCV best score: 0.8464285714285713)

Picture 5

From the data visualization section, we can see that some features may have correlation with the mission outcome in several ways. For example, with heavy payloads the successful landing or positive landing rate are more for orbit types Polar, LEO and ISS. However, for GTO, we cannot distinguish this well as both positive landing rate and negative landing(unsuccessful mission) are both there here.

Therefore, each feature may have a certain impact on the final mission outcome. The exact ways of how each of these features impact the mission outcome are difficult to decipher. However, we can use some machine learning algorithms to learn the pattern of the past data and predict whether a mission will be successful or not based on the given features.

In this project, we try to predict if the first stage of a given Falcon 9 launch will land in order to determine the cost of a launch. Each feature of a Falcon 9 launch, such as its payload mass or orbit type, may affect the mission outcome in a certain way.

Several machine learning algorithms are employed to learn the patterns of past Falcon 9 launch data to produce predictive models that can be used to predict the outcome of a Falcon 9 launch. The predictive model produced by decision tree algorithm performed the best among the 4 machine learning algorithms employed.

~ Project created in January 2022 ~

  • Jupyter Notebook 99.5%
  • Python 0.5%

Data Science Capstone Project: Milestone Report

Alexey serdyuk, table of content, prerequisites, obtaining the data, splitting the data, first glance on the data and general plan, cleaning up and preprocessing the corpus, analyzing words (1-grams), analyzing bigrams, pruning bigrams, 3-grams to 6-grams, conclusions and next steps.

This is a milestone report for Week 2 of the capstone project for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University .

The purpose of the capstone project is to build a Natural Language Processing (NLP) application, that, given a chunk of text, predicts the next most probable word. The application may be used, for example, in mobile devided to provide suggestions as the user tips in some text.

In this report we will provide initial analysis of the data, as well as discuss approach to building the application.

An important question is which library to use for processing and analyzing the corpora, as R provides several alternatives. Initially we attempted to use the library tm , but quickly found that the library is very memory-hungry, and an attempt to build bi- or trigrams for a large corpus are not practical. After some googling we decided to use the library quanteda instead.

We start by loading required libraries.

To speed up processing of large data sets, we will apply parallel version of lapply function from the library parallel . To use all the available resources, we detect a number of CPU cores and configure the library to use them all.

Here and at some times later we use caching to speed up rendering of this document. Results of long-running operations are stored, and used again during the next run. If you wish to re-run all operations, just remove the cache directory.

We download the data from the URL provided in the course description, and unzip it.

The downloaded zip file contains corpora in several languages: English, German, Russian and Finnish. In our project we will use only English corpora.

Corpora in each language, including English, contains 3 files with content obtained from different sources: news, blogs and twitter.

As the first step, we will split each relevant file on 3 parts:

  • Training set (60%) will be used to build and train the algorithm.
  • Testing set (20%) will be used to test the algorithm during it’s development. This set may be used more than once.
  • Validation set (20%) will be used for a final validation and estimation of out-of-sample performance. This set will be used only once.

We define a function which splits the specified file on parts described above:

To make results reproduceable, we set the seed of the random number generator.

Finally, we split each of the data files.

As a sanity check, we count a number of lines in each source file, as well in the partial files produced by the split.

Rows % Rows % Rows %
Training 539572 59.99991 606145 59.99998 1416088 59.99997
Testing 179858 20.00004 202048 19.99996 472030 20.00002
Validation 179858 20.00004 202049 20.00006 472030 20.00002
Total 899288 100.00000 1010242 100.00000 2360148 100.00000
Control (expected to be 0) 0 NA 0 NA 0 NA

As the table shows, we have splitted the data on sub-sets as intended.

In the section above we have already counted a number of lines. Let us load training data sets and take a look on the first 3 lines of each data set.

we could see that the data contains not only words, but also numbers and punctuation. The punctuation may be non-ASCII (Unicode), as the first example in the blogs sample shows (it contains a character “…”, which is different from 3 ASCII point characters “. . .”). Some lines may contain multiple sentences, and probably we have to take this into account.

Here is our plan:

  • Split text on sentences.
  • Clean up the corpus: remove non-language parts such as e-mail addresses and URLs, etc.
  • Preprocess the corpus: remove punctuation and numbers, change all words to lower-case.
  • Analyze distribution of words to decide if we should base our prediction on the full dictionary, or just on some sub-set of it.
  • Analyze n-grams for small n.

We decided to split text on sentences and do not attempt to predict words across sentence border. We still may use information about sentences to improve prediction of the first word, because the frequency of the first word in a sentence may be very different from an average frequency.

Libraries contains some functions for cleaning up and pre-processing, but for some steps we have to write functions ourselves.

Now we pre-process the data.

In this section we will study distribution of words in corpora, ignoring for the moment interaction between words (n-grams).

We define two helper functions. The first one creates a Document Feature Matrix (DFM) for n-grams in documents, and aggregates it over all documents to a Feature Vector. The second helper function enriches the Feature Vector with additional values useful for our analysis, such as cumulated coverage of text.

Now we may calculate frequency of words in each source, as well as in all sources together (aggregated).

The following chart displays 20 most-frequent words in each source, as well as in the aggregated corpora.

As we see from the chart, top 20 most-frequent words differs between sources. For example, the most frequent word in news is “said”, but this word is not included in the top-20 list for blogs and Twitter at all. At the same time, some words are shared between lists: the word “can” is 2nd most-frequent in blogs, 3rd-most frequest in Twitter, and 5th in and news.

Our next step is to analyze the intersection, that is to find how many words are common to all sources, and how many are unique to a particular source. Not only just a number of words is important, but also a source coverage, that is what percentage of the whole text of a particular source is covered by a particular subset of all words.

The following Venn diagram shows a number of unique words (stems) used in each source, as well as a percentage of the aggregated corpora covered by those words.

As we may see, 46686 words are shared by all 3 corpora, but those words cover 97.46% of the aggregated corpora. On the other hand, there are 83185 words unique to blogs, but these words appear very infrequently, covering just 0.43% of the aggregated corpora.

The Venn diagram indicates that we may get a high coverage of all corpora by choosing common words. Coverage by words specific to a particular corpus is negligible.

The next step in our analysis is to find out how many common words we should choose to achieve a decent coverage of the text. From the Venn diagram we already know that by choosing 46686 words we will cover 97.46% of the aggregated corpora, but maybe we may reduce a number of words without significantly reducing the coverage.

The following chart shows a number of unique words in each source which cover particular percentage of the text. For example, 1000 most-frequent words cover 68.09% of the Twitter corpus. An interesting observation is that Twitter requires less words to cover particular percentage of the text, whereas news requires more words.

Corpora Coverage Blogs News Twitter Aggregated
75% 2,004 2,171 1,539 2,136
90% 6,395 6,718 5,325 6,941
95% 13,369 13,689 11,922 15,002
99% 63,110 53,294 71,575 88,267
99.9% 149,650 126,585 161,873 302,693

The table shows that in order to cover 95% of blogs, we require 13,369 words. The same coverage of news require 13,689 words, and the coverage of twitter 11,922 words. To cover 95% of the aggregated corpora, we require 15,002 unique words. We may use this fact later to reduce a number of n-grams required for predictions.

In this section we will study distribution bigrams, that is combinations of two words.

Using previously defined functions, we may calculate frequency of bigrams in each source, as well as in all sources together (aggregated).

The following chart displays 20 most-frequent bigrams in each source, as well as in the aggregated corpora.

We immediately see a difference with lists of top 20 words: there were much more common words between sources, as there are common bigrams. There are still some common bigrams, but the intersection is smaller.

Similar to how we proceed with words, now we will analyze intersections, that is we will find how many bigrams are common to all sources, and how many are unique to a particular source. We also calculate a percentage of the whole source covered by a particular subset of all bigrams.

The following Venn diagram shows a number of unique bigrams used in each source, as well as a percentage of the aggregated corpora covered by those bigrams.

The difference between words and bigrams is even more pronounced here. Bigrams common to all sources cover just 46.23% of the text, compared to more than 95% covered by words common to all sources.

The next step in our analysis is to find out how many common bigrams we should choose to achieve a decent coverage of the text.

The following chart shows a number of unique bigrams in each source which cover particular percentage of the text. For example, 1000 most-frequent bigrams cover 8.66% of the Twitter corpus.

Corpora Coverage Blogs News Twitter Aggregated
75% 1,945,493 1,810,320 1,154,697 2,697,841
90% 3,449,146 3,393,854 2,329,516 6,772,302
95% 3,950,364 3,921,699 2,721,122 8,192,971
99% 4,351,338 4,343,975 3,034,407 9,329,506
99.9% 4,441,557 4,438,987 3,104,896 9,585,226

The table shows that in order to cover 95% of blogs, we require 3,950,364 bigrams. The same coverage of news require 3,921,699 bigrams, and the coverage of Twitter 2,721,122 bigrams. To cover 95% of the aggregated corpora, we require 8,192,971 bigrams.

The chart is also very different from a similar chart for words. The curve for words had an “S”-shape, that is it’s growth slowed down after some number of words, so that adding more words results in diminishing returns. For bigrams, there is no point of diminishing returns: curves are just rising.

As we have found in the section Analyzing words (1-grams) , our corpora contains \(N_1=\) 335,906 unique word stems. Potentially there could be \(N_1^2=\) 112,832,840,836 bigrams, but we have observed only \(N_2=\) 9,613,640, that is 0.0085% of all possible. Still, the number of observed bigrams is pretty large. In the section Analyzing words (1-grams) we have found that we may cover large part of the corpus by relatively small number of unique word stems. In the next section we will see if we may reduce a number of unique 2-grams by utilizing that knowledge.

We have found in the section Analyzing words (1-grams) , that our corpora contains \(N_1=\) 335,906 unique word stems, but just 15,002 of them cover 95% of the corpus. In this section we will analyze whether we may reduce a number of bigrams by utilizing that knowledge.

We will replace seldom words with a speial token UNK . This will reduce a number of bigrams, because different word sequences may now produce the same bigram, if those word sequences contains seldom words. For example, our word list contains names “Andrei”, “Charley” and “Fabio”, but these words do not belong to a subset of most common words required to achieve 95% coverage of the corpus. If our corpus contains bigrams “Andrei told”, “Charley told” and “Fabio told”, we will replace them all with a bigram “UNK told”.

Since we will apply the same approach to 3-grams, 4-grams etc, to save time we prune the corpora once and save results to files which we may load later.

We start by defining a function that accepts a sequence of words, a white-list and a replacement token. All words in the sentence which are not included in the white-list are replaced by the token.

Now we create a white-list that contains:

  • frequent words which covers 95% of the corpus,
  • stop-words, that is functional words like “a” or “the” which we excluded from our word frequency analysis,
  • special tokens Stop-Of-Sentence and End-Of-Sentence introduced earlier.

And now we apply the function defined above to replace all words not included in the white-list with the token UNK .

After pruning seldom words, we re-calculate bigrams. From now on, we will analyze only the aggregated corpus.

The chart shows coverage of corpora by pruned bigrams, where different types of bigrams are indicated by different color. The chart also shows for several numbers points where bigrams were encountered a particular number of times. For example, there are 104,625 unique bigrams encountered more than 30 times.

By pruning we have reduced a number of unique bigrams from 9,613,640 to 7,341,432, that is by 23.64%. On this stage it is hard to tell whether pruning makes sense: on one hand, it reduces the number of unique 2-grams and thus memory requirements of our application, on the other hand it removes information which may be required to achieve good prediction rate.

After analyzing bigrams, now there is a time to take a look on a longer n-grams. We decided to analyze 3-grams to 6-grams.

Charts below show coverage of corpora by pruned 2- to 6-grams, where different color indicates n-grams with a different number of pruned words ( UNK tokens). The same as for 2-grams, charts also show for several number points where n-grams were encountered a particular number of times.

As \(n\) grows, the number of repeated n-grams decreases. This property is quite obvious: for example, there are much more common 2-grams (like “last year” or “good luck”) as common 6-grams. A consequence of this property is less obvious, but is clearly visible on charts: as \(n\) grows, one require more and more unique n-grams to cover the same percentage of the text. For single words we could choose a small subsets that covers 95% of the corpora, but for n-grams achieving a high corpora coverage with a small subset is impossible.

Corpora Coverage 2-grams 3-grams 4-grams 5-grams 6-grams
25% 0.27 8.40 22.18 23.75 24.07
50% 3.06 38.82 48.12 49.17 49.38
75% 20.12 69.41 74.06 74.58 74.69
95% 80.34 93.88 94.81 94.92 94.94

The table above shows a percentage of n-grams required to cover a particular percentage of the aggregated corpus for various n. For example, one require 3.06% of 2-grams to cover 50% of the corpus, but the same coverage requires 38.82% of 3-grams. As we could see, even for 2-grams we can’t significantly reduce a number of unique n-grams without significantly reducing the coverage as well.

Conclusions from the data analysis:

  • If we keep only most often used words required for 95% coverage of the corpora, we may significatly reduce the number of distinct words, but the number of 2- to 6-grams is not significantly affected.
  • To get a decent coverage of the corpora by 2- to 6-grams, we require millions of entries. To reduce memory usage, probably we have to use some encoding shema. Here we may again return to the idea of keeping only most often used words to reduce the size of the encoded data.
  • For small \(n\) , many n-grams are encountered in the corpora multiple times, but for large \(n\) most n-grams are encountered just once. This is intuitively obvious: we expect common bigrams “last year” or “good luck” to be repeated many time, but a probability that a particular 6-gram repeats is pretty low. When developing a prediction model, we have to test if it makes sense at all to include in the model n-grams with large \(n\) , since this could be an overfitting to the training set without any benefits for the prediction quality.

Open questions:

  • Should we include stop-words in our prediction set? On one hand, stop-words are too common and may be considered a syntactical “garbage”, on the other hand they are an important part of a natural language.
  • How should we use the stemming? On one hand, stemming may reduce the number of n-grams and improve prediction quality. On the other hand, we want to predict full words, not just stems. At the moment we are inclined to stem words used for the prediction, but not the predicted word. This approach may require a custom implementation of the tokenization algorithm.
  • Should we replace seldom words with the special token UNK or should we keep such words?
  • Does it make sense to use n-grams for large \(n\) , or could we overfit our model with them?
  • Should we replace common abbreviations with full text before training our algorithm, or should we keep the abbreviations (for example, AFAIK = “as far as I know”)?
  • Should we remove profanity from the corpora, or should we keep it as a part of the natural language? At the moment we are inclined to keep profanity in the dictionary, but never predict it. Keeping profanity in the dictionary may improve quality of prediction for non-profane words. For example, after the bigram “for fuck’s” we may predict the non-profane word “sake”. If we would have excluded profanity from our dictionary, we may miss the right prediction in this case.

To answer most questions above, we have to create several models and run them against a test data set.

Next steps:

  • Implement the simplest possible prediction algorithm.
  • Test the algorithm on the test data set, analyze quality of prediction and optimize the algorithm.

Coursera Data Science Capstone - Milestone Report

Linnaeus bundalian, january 08, 2021, introduction.

The Coursera Data Science Capstone - Milestone Report (aka, “the report” ) is intended to give an introductory look at analyzing the SwiftKey data set and figuring out:

  • What the data consists of, and
  • Identifying the standard tools and models used for this type of data.

The report is then to be written in a clear, concise, style that a data scientist OR non-data scientist can understand and make sense of.

The purpose of the report is a four-fold exploratory data analysis that will:

Demonstrate that the data has been downloaded from Swiftkey (via Coursera) and successfully loaded into R.

Create a basic report of summary statistics about the data sets to include:

  • Word counts, line counts and basic data tables,
  • Basic plots, such as histograms and pie charts, to illustrate features of the data.

Report any interesting findings about the data so far amassed.

Present the basic plan behind creating a prediction algorithm and Shiny app from the data.

Load Data, Calculate Size, Word Count, and Summarize

The Swiftkey data consists of four datasets, each in a different language (one each in German, English, Russian, and Finnish), containing random:

  • blog entries
  • news entries
  • twitter feeds

For this report, we will process the English data , and reference the German, Finnish, and Russian sets to possibly match foreign language characters and/or words embedded in the English data.

Download the data and unzip on Windows 10 PC

Read the data into r from the connection (simple text files), calculate the size of the english dataset (in megabytes) and display, calculate the line count of the english dataset and display, calculate the word count of the english dataset and display, summerize the english dataset and display two sample entries from each file, basic plots.

Because of the amount of data that needs processing, and because this is an exploratory data analysis , we will extract the seventy-five most frequently used words from within each data file, and then proceed with some basic plotting.

Reduced Sample Size - training sets

We will use 1/100th of each data file for our reduced sample size, and create the necessary data subsets from them.

Combine the Subsets - final training set

We will now create a corpus from the data subsets. The tm Library will assist us in this task. The process involves removing all non-ASCII character data, punctuation marks, excess white space, numeric data, converting the remaining alpha characters to lower case, and generating the entire corpus in plain text. A brief summary of the corpus is provided.

Display as a HISTOGRAM

We will now use our cleaned data subset to generate a histogram of the thirty most frequently used words in the corpus. The libraries slam and ggplot2 will help with this task.

Display as a PIE CHART

We will now use our cleaned data subset to generate a pie chart of the five most frequently used words in the corpus. The library plotrix will help with this task.

Interesting Findings

With the exploratory data analysis done on the English data set to this point, the findings regarding the top 30 most frequently occurring words, to include the five most frequently occurring words, are not that surprising. The bulk of them are articles and pronouns. Further analysis using bigrams and trigrams would give better most frequently used phrase distribution. This type of finding could then be used to predict trends in the data and to create a predictive model of English text.

Basic Project Plan

The basic plan is to use the initial data analysis presented herein to further progress with the prediction algorithm necessary for the Shiny application - a predictive model of English text. One way of doing this might be to investigate what is possible using Markov Chains. Further analysis will be done using NGram modeling, to predict next-word selection with accuracy. All will be incorporated into a user-friendly Shiny front end that will allow the user to interact with the data and make logical next-word selections.

CodeAvail

21 Interesting Data Science Capstone Project Ideas [2024]

data science capstone project ideas

Data science, encompassing the analysis and interpretation of data, stands as a cornerstone of modern innovation. 

Capstone projects in data science education play a pivotal role, offering students hands-on experience to apply theoretical concepts in practical settings. 

These projects serve as a culmination of their learning journey, providing invaluable opportunities for skill development and problem-solving. 

Our blog is dedicated to guiding prospective students through the selection process of data science capstone project ideas. It offers curated ideas and insights to help them embark on a fulfilling educational experience. 

Join us as we navigate the dynamic world of data science, empowering students to thrive in this exciting field.

Data Science Capstone Project: A Comprehensive Overview

Table of Contents

Data science capstone projects are an essential component of data science education, providing students with the opportunity to apply their knowledge and skills to real-world problems. 

Capstone projects challenge students to acquire and analyze data to solve real-world problems. These projects are designed to test students’ skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning. 

In addition, capstone projects are conducted with industry, government, and academic partners, and most projects are sponsored by an organization. 

The projects are drawn from real-world problems, and students work in teams consisting of two to four students and a faculty advisor. 

However, the goal of the capstone project is to create a usable/public data product that can be used to show students’ skills to potential employers. 

Best Data Science Capstone Project Ideas – According to Skill Level

Data science capstone projects are a great way to showcase your skills and apply what you’ve learned in a real-world context. Here are some project ideas categorized by skill level:

best data science capstone project ideas - according to skill level

Beginner-Level Data Science Capstone Project Ideas

beginner-level data science capstone project ideas

1. Exploratory Data Analysis (EDA) on a Dataset

Start by analyzing a dataset of your choice and exploring its characteristics, trends, and relationships. Practice using basic statistical techniques and visualization tools to gain insights and present your findings clearly and understandably.

2. Predictive Modeling with Linear Regression

Build a simple linear regression model to predict a target variable based on one or more input features. Learn about model evaluation techniques such as mean squared error and R-squared, and interpret the results to make meaningful predictions.

3. Classification with Decision Trees

Use decision tree algorithms to classify data into distinct categories. Learn how to preprocess data, train a decision tree model, and evaluate its performance using metrics like accuracy, precision, and recall. Apply your model to practical scenarios like predicting customer churn or classifying spam emails.

4. Clustering with K-Means

Explore unsupervised learning by applying the K-Means algorithm to group similar data points together. Practice feature scaling and model evaluation to identify meaningful clusters within your dataset. Apply your clustering model to segment customers or analyze patterns in market data.

5. Sentiment Analysis on Text Data

Dive into natural language processing (NLP) by analyzing text data to determine sentiment polarity (positive, negative, or neutral). 

Learn about tokenization, text preprocessing, and sentiment analysis techniques using libraries like NLTK or spaCy. Apply your skills to analyze product reviews or social media comments.

6. Time Series Forecasting

Predict future trends or values based on historical time series data. Learn about time series decomposition, trend analysis, and seasonal patterns using methods like ARIMA or exponential smoothing. Apply your forecasting skills to predict stock prices, weather patterns, or sales trends.

7. Image Classification with Convolutional Neural Networks (CNNs)

Explore deep learning concepts by building a basic CNN model to classify images into different categories. 

Learn about convolutional layers, pooling, and fully connected layers, and experiment with different architectures to improve model performance. Apply your CNN model to tasks like recognizing handwritten digits or classifying images of animals.

Intermediate-Level Data Science Capstone Project Ideas

intermediate-level data science capstone project ideas

8. Customer Segmentation and Market Basket Analysis

Utilize advanced clustering techniques to segment customers based on their purchasing behavior. Conduct market basket analysis to identify frequent item associations and recommend personalized product suggestions. 

Implement techniques like the Apriori algorithm or association rules mining to uncover valuable insights for targeted marketing strategies.

9. Time Series Anomaly Detection

Apply anomaly detection algorithms to identify unusual patterns or outliers in time series data. Utilize techniques such as moving average, Z-score, or autoencoders to detect anomalies in various domains, including finance, IoT sensors, or network traffic. 

Develop robust anomaly detection models to enhance data security and predictive maintenance.

10. Recommendation System Development

Build a recommendation engine to suggest personalized items or content to users based on their preferences and behavior. Implement collaborative filtering, content-based filtering, or hybrid recommendation approaches to improve user engagement and satisfaction. 

Evaluate the performance of your recommendation system using metrics like precision, recall, and mean average precision.

11. Natural Language Processing for Topic Modeling

Dive deeper into NLP by exploring topic modeling techniques to extract meaningful topics from text data. 

Implement algorithms like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) to identify hidden themes or subjects within large text corpora. Apply topic modeling to analyze customer feedback, news articles, or academic papers.

12. Fraud Detection in Financial Transactions

Develop a fraud detection system using machine learning algorithms to identify suspicious activities in financial transactions. Utilize supervised learning techniques such as logistic regression, random forests, or gradient boosting to classify transactions as fraudulent or legitimate. 

Employ feature engineering and model evaluation to improve fraud detection accuracy and minimize false positives.

13. Predictive Maintenance for Industrial Equipment

Implement predictive maintenance techniques to anticipate equipment failures and prevent costly downtime. 

Analyze sensor data from machinery using machine learning algorithms like support vector machines or recurrent neural networks to predict when maintenance is required. Optimize maintenance schedules to minimize downtime and maximize operational efficiency.

14. Healthcare Data Analysis and Disease Prediction

Utilize healthcare datasets to analyze patient demographics, medical history, and diagnostic tests to predict the likelihood of disease occurrence or progression. 

Apply machine learning algorithms such as logistic regression, decision trees, or support vector machines to develop predictive models for diseases like diabetes, cancer, or heart disease. Evaluate model performance using metrics like sensitivity, specificity, and area under the ROC curve.

Advanced Level Data Science Capstone Project Ideas

advanced level data science capstone project ideas

15. Deep Learning for Image Generation

Explore generative adversarial networks (GANs) or variational autoencoders (VAEs) to generate realistic images from scratch. Experiment with architectures like DCGAN or StyleGAN to create high-resolution images of faces, landscapes, or artwork. 

Evaluate image quality and diversity using perceptual metrics and human judgment.

16. Reinforcement Learning for Game Playing

Implement reinforcement learning algorithms like deep Q-learning or policy gradients to train agents to play complex games like Atari or board games. 

Experiment with exploration-exploitation strategies and reward-shaping techniques to improve agent performance and achieve superhuman levels of gameplay.

17. Anomaly Detection in Streaming Data

Develop real-time anomaly detection systems to identify abnormal behavior in streaming data streams such as network traffic, sensor readings, or financial transactions. 

Utilize online learning algorithms like streaming k-means or Isolation Forest to detect anomalies and trigger timely alerts for intervention.

18. Multi-Modal Sentiment Analysis

Extend sentiment analysis to incorporate multiple modalities such as text, images, and audio to capture rich emotional expressions. 

However, utilize deep learning architectures like multimodal transformers or fusion models to analyze sentiment across different modalities and improve understanding of complex human emotions.

19. Graph Neural Networks for Social Network Analysis

Apply graph neural networks (GNNs) to model and analyze complex relational data in social networks. Use techniques like graph convolutional networks (GCNs) or graph attention networks (GATs) to learn node embeddings and predict node properties such as community detection or influential users.

20. Time Series Forecasting with Deep Learning

Explore advanced deep learning architectures like long short-term memory (LSTM) networks or transformer-based models for time series forecasting. 

Utilize attention mechanisms and multi-horizon forecasting to capture long-term dependencies and improve prediction accuracy in dynamic and volatile environments.

21. Adversarial Robustness in Machine Learning

Investigate techniques to improve the robustness of machine learning models against adversarial attacks. 

Explore methods like adversarial training, defensive distillation, or certified robustness to mitigate vulnerabilities and ensure model reliability in adversarial perturbations, particularly in critical applications like autonomous vehicles or healthcare.

These project ideas cater to various skill levels in data science, ranging from beginners to experts. Choose a project that aligns with your interests and skill level, and don’t hesitate to experiment and learn along the way!

Factors to Consider When Choosing a Data Science Capstone Project

Choosing the right data science capstone project is crucial for your learning experience and effectively showcasing your skills. Here are some factors to consider when selecting a data science capstone project:

Personal Interest

Select a project that aligns with your passions and career goals to stay motivated and engaged throughout the process.

Data Availability

Ensure access to relevant and sufficient data to complete the project and draw meaningful insights effectively.

Complexity Level

Consider your current skill level and choose a project that challenges you without overwhelming you, allowing for growth and learning.

Real-World Impact

Aim for projects with practical applications or societal relevance to showcase your ability to solve tangible problems.

Resource Requirements

Evaluate the availability of resources such as time, computing power, and software tools needed to execute the project successfully.

Mentorship and Support

Seek projects with opportunities for guidance and feedback from mentors or peers to enhance your learning experience.

Novelty and Innovation

Explore projects that push boundaries and explore new techniques or approaches to demonstrate creativity and originality in your work.

Tips for Successfully Completing a Data Science Capstone Project

Successfully completing a data science capstone project requires careful planning, effective execution, and strong communication skills. Here are some tips to help you navigate through the process:

  • Plan and Prioritize: Break down the project into manageable tasks and create a timeline to stay organized and focused.
  • Understand the Problem: Clearly define the project objectives, requirements, and expected outcomes before analyzing.
  • Explore and Experiment: Experiment with different methodologies, algorithms, and techniques to find the most suitable approach.
  • Document and Iterate: Document your process, results, and insights thoroughly, and iterate on your analyses based on feedback and new findings.
  • Collaborate and Seek Feedback: Collaborate with peers, mentors, and stakeholders, actively seeking feedback to improve your work and decision-making.
  • Practice Communication: Communicate your findings effectively through clear visualizations, reports, and presentations tailored to your audience’s understanding.
  • Reflect and Learn: Reflect on your challenges, successes, and lessons learned throughout the project to inform your future endeavors and continuous improvement.

By following these tips, you can successfully navigate the data science capstone project and demonstrate your skills and expertise in the field.

Wrapping Up

In wrapping up, data science capstone project ideas are invaluable in bridging the gap between theory and practice, offering students a chance to apply their knowledge in real-world scenarios.

They are a cornerstone of data science education, fostering critical thinking, problem-solving, and practical skills development. 

As you embark on your journey, don’t hesitate to explore diverse and challenging project ideas. Embrace the opportunity to push boundaries, innovate, and make meaningful contributions to the field. 

Share your insights, challenges, and successes with others, and invite fellow enthusiasts to exchange ideas and experiences. 

1. What is the purpose of a data science capstone project?

A data science capstone project serves as a culmination of a student’s learning experience, allowing them to apply their knowledge and skills to solve real-world problems in the field of data science. It provides hands-on experience and showcases their ability to analyze data, derive insights, and communicate findings effectively.

2. What are some examples of data science capstone projects?

Data science capstone projects can cover a wide range of topics and domains, including predictive modeling, natural language processing, image classification, recommendation systems, and more. Examples may include analyzing customer behavior, predicting stock prices, sentiment analysis on social media data, or detecting anomalies in financial transactions.

3. How long does it typically take to complete a data science capstone project?

The duration of a data science capstone project can vary depending on factors such as project complexity, available resources, and individual pace. Generally, it may take several weeks to several months to complete a project, including tasks such as data collection, preprocessing, analysis, modeling, and presentation of findings.

Related Posts

Science Fair Project Ideas For 6th Graders

Science Fair Project Ideas For 6th Graders

When it comes to Science Fair Project Ideas For 6th Graders, the possibilities are endless! These projects not only help students develop essential skills, such…

Java Project Ideas For Beginners

Java Project Ideas for Beginners

Java is one of the most popular programming languages. It is used for many applications, from laptops to data centers, gaming consoles, scientific supercomputers, and…

Data Science: Capstone

To become an expert you need practice and experience..

Show what you’ve learned from the Professional Certificate Program in Data Science.

Harvard School of Public Health Logo

What You'll Learn

To become an expert data scientist you need practice and experience. By completing this capstone project you will get an opportunity to apply the knowledge and skills in R data analysis that you have gained throughout the series. This final project will test your skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning.

Unlike the rest of our Professional Certificate Program in Data Science , in this course, you will receive much less guidance from the instructors. When you complete the project you will have a data product to show off to potential employers or educational programs, a strong indicator of your expertise in the field of data science.

The course will be delivered via edX and connect learners around the world. By the end of the course, participants will understand the following concepts:

  • How to apply the knowledge base and skills learned throughout the series to a real-world problem
  • How to independently work on a data analysis project

Your Instructors

Rafael Irizarry

Rafael Irizarry

Professor of Biostatistics at Harvard University Read full bio.

Ways to take this course

When you enroll in this course, you will have the option of pursuing a Verified Certificate or Auditing the Course.

A Verified Certificate costs $149 and provides unlimited access to full course materials, activities, tests, and forums. At the end of the course, learners who earn a passing grade can receive a certificate. 

Alternatively, learners can Audit the course for free and have access to select course material, activities, tests, and forums.  Please note that this track does not offer a certificate for learners who earn a passing grade.

Introduction to Linear Models and Matrix Algebra

Learn to use R programming to apply linear models to analyze data in life sciences.

High-Dimensional Data Analysis

A focus on several techniques that are widely used in the analysis of high-dimensional data.

Introduction to Bioconductor

Join Harvard faculty in this online course to learn the structure, annotation, normalization, and interpretation of genome scale assays.

Capstone Projects

The culminating experience in the Master’s in Applied Data Science program is a Capstone Project where you’ll put your knowledge and skills into practice . You will immerse yourself in a real business problem and will gain valuable, data driven insights using authentic data. Together with project sponsors, you will develop a data science solution to address organization problems, enhance analytics capabilities, and expand talent pools and employment opportunities. Leveraging the university’s rich research portfolio, you also have the option to join a research-focused team .

Selected Capstone Projects

Copd readmission and cost reduction assessment, an nfl ticket pricing study: optimizing revenue using variable and dynamic pricing methods, using image recognition to identify yoga poses, using image recognition to measure the speed of a pitch, real-time credit card fraud detection, interested in becoming a capstone sponsor.

The Master’s in Applied Data Science program accepts projects year-round for placement at the beginning of every quarter, with the Spring quarter being the largest cohort. All projects must be submitted no later than one month prior to the beginning of the preferred starting quarter based on the UChicago academic calendar .

Capstone Sponsor Incentives

Sponsors derive measurable benefits from this unique opportunity to support higher education. Partner organizations propose real-world problems, untested ideas or research queries. Students review them from the perspective of data scientists trained to generate actionable insights that provide long-term value. Through the project, Capstone partners gain access to a symbiotic pool of world-class students, highly accomplished instructors, and cited researchers, resulting in optimized utilization of modern data science-based methods, using your data. Further, for many sponsors, the project becomes a meaningful source of recruitment through the excellent pool of students who work on your project.

Capstone Sponsor Obligations

While there is no monetary cost or contract necessary to sponsor a project, we do consider this a partnership. Teams comprised of four students and guided by an instructor and subject matter expert are provided with expectations from the capstone sponsor and learning objectives, assignments, and evaluation requirements from instructors. In turn, Capstone partners should be prepared to provide the following:

  • A detailed problem statement with a description of the data and expected results
  • Two or more points of contact
  • Access to data relevant to the project by the first week of the applicable quarter
  • Engagement through regular meetings (typically bi-weekly) while classes are in session
  • If requested, a non-disclosure agreement that may be completed by the student team

Interested in Becoming a Capstone or Industry Research Partner?

Get in touch with us to submit your idea for a collaboration or ask us questions about how the partnership process works.

  • Menu  Close 
  • Search 

Data Science Capstone Experience

Capstone experience – 1 course/final project.

The Capstone Experience in Data Science (EN.553.806) is a research-oriented project which must be approved by the research supervisor, academic advisor and the Internal Oversight Committee.  The Capstone Experience can be taken in multiple semesters, but the total number of credits required for successful completion is six (6).  Students must complete a Data Science Capstone Experience Proposal form and follow instructions below to submit for approval before enrollment in EN.553.806 will be approved by academic staff.

All students are REQUIRED to present their research findings in poster format at the event held in their final semester. A list of upcoming dates are provided below. Students must also submit a final report to their capstone supervisor. The grade for this course is based, in large part, upon the poster event and your final report. For more information on the poster and the report, please read below.

Fall 2023 F 12/8 T 11/28
Spring 2024 F 4/26 T 4/16/24
Fall 2024 F 12/6 T 11/26/24
Spring 2025 4/28 T 4/15/25
Fall 2025 F 12/5 T 11/25/25
Spring 2026 4/27 T 4/14/26

Finding a Research Supervisor

The student must identify and contact a research supervisor who will agree to supervise the capstone experience. The research supervisor must be a JHU faculty member.

The following list includes JHU faculty members who are willing to be contacted by DS students to supervise their capstone project. Click on their name to take you to their webpage so you can learn more about their research interests to see if they align with yours.  This list is not exhaustive and students should feel free to contact other JHU faculty with whom they would be interested to work.

  • Raman Arora  (Computer Science)
  • Amitabh Basu  (Applied Mathematics and Statistics)
  • Alexis Battle  (Biomedical Engineering)
  • Vladimir Braverman (Computer Science)
  • Tamás Budavári  (Applied Mathematics and Statistics)
  • Brian Caffo (Biostatistics, School of Public Health)
  • Adam Charles  (Biomedical Engineering)
  • Jason Eisner  (Computer Science)
  • David Elbert (HEMI – Hopkins Extreme Materials Institute)
  • Jean Fan  (Biomedical Engineering)
  • Mahyar Fazlyab  (Electrical and Computer Engineering)
  • Elana Fertig  (Oncology Center, Biostatistics/Bioinformatics, School of Medicine)
  • Helyette Geman  (Applied Mathematics and Statistics)
  • Edinah Gnang  (Applied Mathematics and Statistics)
  • Jeffrey J. Gray (Chemical and Biomolecular Engineering)
  • Matthew Ippolito, MD (School of Medicine, Molecular Microbiology and Immunology)
  • Rachel Karchin (Biomedical Engineering)
  • Michael Kazhdan  (Computer Science)
  • Yannis Kevrekidis  (Chemical and Biomolecular Engineering / Applied Mathematics and Statistics)
  • Sergey Kushnarev  (Applied Mathematics and Statistics)
  • Nicolas Loizou (Applied Mathematics and Statistics)
  • Mauro Maggioni  (Mathematics / Applied Mathematics and Statistics)
  • Enrique Mallada  (Electrical and Computer Engineering)
  • Mario Micheli  (Applied Mathematics and Statistics)
  • Daniel Q. Naiman  (Applied Mathematics and Statistics)
  • Vishal M. Patel  (Electrical and Computer Engineering)
  • Carey Priebe  (Applied Mathematics and Statistics) (currently on Sabbatical until Jan 2024)
  • Fadil Santosa  (Applied Mathematics and Statistics)
  • Ilya Shpitser  (Computer Science)
  • James C. Spall  (Applied Physics Laboratory / Applied Mathematics and Statistics)
  • Jeremias Sulam  (Biomedical Engineering)
  • Trac D. Tran  (Electrical and Computer Engineering)
  • Soledad Villar  (Applied Mathematics and Statistics)
  • Yanxun Xu  (Applied Mathematics and Statistics)
  • Laurent Younes  (Applied Mathematics and Statistics)

Writing the Proposal

The student will download and complete a Proposal Request for the Capstone Experience in Data Science describing the project goals.

The proposal should be written as follows:

  • Title of proposed project.
  • Project description, with enough details for evaluation (e.g., 200 words).
  • Completion timeline. Be sure to consider adequate time for review and edits by your research supervisor before the end of the semester .
  • Name and signature of capstone (faculty) supervisor.

TEAM PROJECTS:  Teams (of no more than 2 students) are acceptable to fulfill the capstone experience, with the following requirements:

  • The team submits a single proposal that, in addition to the aforementioned requirements, describes the composition of the group, and indicates how the work is divided among the members of the group. The project must be divided into subtasks and the proposal must indicate which group member will be in charge of each subtask, ensuring that the amount of work expected by each member aligns with the number of credits associated with the capstone (3 credits = 100 hours).
  • Teams should not include more than two students.
  • The poster may be represented individually or as a team.
  • The option to complete a team project is solely at the discretion of the research supervisor.

Signature Required for Capstone Proposal

Student should email the proposal to the project supervisor and supervisor should send signed form back to student.

ANY changes to an approved proposal (including title, team members, title and nature of research) require the student(s) to resubmit a proposal for review.  

Submitting your Capstone Proposal for Review

  • Once the proposal has been reviewed and signed by the research supervisor, student will upload the form here . Academic staff will receive notification of the file upload and begin the committee review.
  • If the deadline has passed for registration, the student must follow the instructions for a late add as indicated on the Registrar’s website .

Capstone Poster Presentation

  • Student should be prepared to present his/her research findings in the form of a poster presentation which will be scheduled during one of the final dates of the Department Seminar (EN.553.801).
  • All poster presentations must be IN PERSON .
  • Presentation Date : Poster presentations take place on Tuesday (at the same time as the Department Seminar), typically 1 week prior to the last day of classes.  For example, in the Fall semester, the presentations are held the last Tuesday of November. In the Spring semester, presentations take place on the last Tuesday of April. Below are upcoming dates for poster presentations:
Fall 2023 F 12/8 T 11/28
Spring 2024 F 4/26 T 4/16/24
Fall 2024 F 12/6 T 11/26/24
Spring 2025 4/28 T 4/15/25
Fall 2025 F 12/5 T 11/25/25
Spring 2026 4/27 T 4/14/26

Poster Specs :   The poster should be 24” x 36”.  There is no specific template.  You are encouraged to look at other professional posters of this kind for design ideas. You can also do a quick internet search for Neurips or ICML posters to see examples.

Printing and Cost :  The department recommends FedEx to print posters.  If you use the FedEx on Charles Street, the cost will be covered by the department.  Additional details on contacting FedEx will be providing to you via email.

Final Report

  • Each student must write a paper or research report that must be approved by the research supervisor and advisor. The final paper should be 6-12 pages in LaTeX full page format (1 inch margins, times, 12pt) or ms-word equivalent.  Members of team projects must submit individual reports.
  • Submit your final report to your capstone supervisor no later than 1 week before the last day of classes so that your supervisor has time to review the paper and provide feedback to you. Once any corrections or updates have been made, you will submit your report again to your capstone supervisor for grade. The program coordinator (Lisa Wetzelberger) will contact your supervisor regarding your final grade and will upload those grades to SIS.

Completion and Grading

At the completion of the project, capstone supervisors will be contacted by the program coordinator (Lisa Wetzelberger) to provide a PASS/FAIL grade which will then be uploaded to SIS.

Previous Capstone Titles

  • Unix Cluster-Based Training of Large Language Models for User Query Processing
  • Smartwatch Insights: Predictive Health Analytics with Machine Learning
  • Enhancing Sepsis Management: A Machine Learning Approach to Predicting 30-Day Mortality in Sepsis Patients Using the MIMIC-III Database
  • Sports Games Data Extraction with Multiple Object Tracking
  • A model-free controller for supply chain management
  • Analysis of default and prepayment trend of Fannie Mae single family loan
  • Analysis of rare sinonasal tumors
  • Applying causal inference to determine the effect of gender of PI on the safety of clinical trials
  • Bias in college admissions
  • Changes in US methane emissions during the COVID-19 pandemic
  • Data analysis and prediction of successful startups
  • Development of a machine learning model to predict spatially localized gene expression from histology images
  • Online lidar camera calibration using deep learning
  • Self-identification in speech transcription using natural language processing models
  • Stock price prediction using deep learning with feature extension
  • US food desert analysis

Next topic>> Forms and Instructions

Data Science: Capstone

Show what you’ve learned from the Professional Certificate Program in Data Science.

Stained glass windows arranged in a spiraling shape

Associated Schools

Harvard T.H. Chan School of Public Health

Harvard T.H. Chan School of Public Health

What you'll learn.

How to apply the knowledge base and skills learned throughout the series to a real-world problem

Independently work on a data analysis project

Course description

To become an expert data scientist you need practice and experience. By completing this capstone project you will get an opportunity to apply the knowledge and skills in R data analysis that you have gained throughout the series. This final project will test your skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning.

Unlike the rest of our Professional Certificate Program in Data Science, in this course, you will receive much less guidance from the instructors. When you complete the project you will have a data product to show off to potential employers or educational programs, a strong indicator of your expertise in the field of data science.

Instructors

Rafael Irizarry

Rafael Irizarry

You may also like.

lines of genomic data (dna is made up of sequences of a, t, g, c)

High-Dimensional Data Analysis

A focus on several techniques that are widely used in the analysis of high-dimensional data.

lines of genomic data (dna is made up of sequences of a, t, g, c)

Advanced Bioconductor

Learn advanced approaches to genomic visualization, reproducible analysis, data architecture, and exploration of cloud-scale consortium-generated genomic data.

Young man sitting at desk with computer and a thought bubble saying, "What did that code do?"

Principles, Statistical and Computational Tools for Reproducible Data Science

Learn skills and tools that support data science and reproducible research, to ensure you can trust your own research results, reproduce them yourself, and communicate them to others.

Join our list to learn more

Carruthers Hall, 1001 N. Emmet St.
P.O. Box 400203
Charlottesville, VA 22904-4203

Phone: (434) 924-4122
Fax: (434) 924-4156
Email:

M-F: 10am - noon and 1pm - 4pm

© 2024 By the Rector and Visitors of the University of Virginia

    University of Virginia
   
  Jun 29, 2024  
Undergraduate Record 2024-2025    
Undergraduate Record 2024-2025

Return to: College of Arts & Sciences: Departments/Programs    

The Bachelor of Arts (B.A.) in Applied Statistics provides students with a solid grounding the field of statistics, with particular attention paid to applications. Knowledge of statistics is becoming increasingly important in many fields, so that students completing this major will have many options available upon graduation.

Students completing this major will be well prepared to design experimental studies, analyze data, and communicate results in a wide range of subject areas. They will also be well prepared to enter MS programs in statistics and related fields.  With a modest amount of advance planning students are able to complete an M.S. in Statistics at UVa with one additional year of study. Students interested in the B.A./M.S. program should visit the Department of Statistics website.

Students who declare the B.A. in Applied Statistics have the option of choosing one of three concentrations within the major. These concentrations are Finance and Business, Biostatistics, and Data Science. The details of these concentrations are given below. The prerequisites to declare any of the concentrations are those listed below.

Universal Curriculum Requirements

To be awarded a degree from the College of Arts and Sciences, students are required to complete universal curriculum requirements in addition to the program requirements provided below. The school universal curriculum requirements can be found on the school  Degree Programs page   .

Program Requirements

The BA in Applied Statistics requires five core courses and five restricted elective courses. In total the BA in Applied Statistics requires 30 credit hours, plus prerequisite courses. There are two lists of restricted elective courses, those that focus on data analysis and those that are more computational. Of the five restricted elective courses, at least three must be taken from the Data Analysis list. A grade of C- or higher is required for all prerequisite and major courses. 

Prerequisites: 15 -17 credit hours

Students must have completed all prerequisite courses to declare the major. Students may use AP credit to satisfy the prerequisites.

Calculus II: Fulfilled by one of the following courses:

  • MATH 1220 - A Survey of Calculus II Credits: 3
  • MATH 1320 - Calculus II Credits: 4
  • APMA 1110 - Single Variable Calculus II Credits: 4

Introductory Statistics: Fulfilled by one of the following courses:

  • STAT 1100 - Chance: An Introduction to Statistics Credits: 3
  • STAT 1120 - Introduction to Statistics Credits: 3
  • STAT 2020 - Statistics for Biologists Credits: 4
  • STAT 2120 - Introduction to Statistical Analysis Credits: 4
  • APMA 3110 - Applied Statistics and Probability Credits: 3
  • APMA 3120 - Statistics Credits: 3

Introductory Programming: Fulfilled by one of the following courses:

  • STAT 1601 - Introduction to Data Science with R Credits: 3
  • STAT 1602 - Introduction to Data Science with Python Credits: 3
  • CS 1110 - Introduction to Programming Credits: 3
  • CS 1111 - Introduction to Programming Credits: 3
  • CS 1112 - Introduction to Programming Credits: 3
  • CS 1113 - Introduction to Programming Credits: 3

EACH OF THE FOLLOWING: 

  • STAT 3110 - Foundations of Statistics Credits: 3 (MATH 3100 can also be used when combined with MATH 3350 or MATH 3351)
  • STAT 3220 - Introduction to Regression Analysis Credits: 3

Core Courses: 18 credit hours

ONE OF THE FOLLOWING:

  • STAT 3110 - Foundations of Statistics Credits: 3
  • MATH 3100 - Introduction to Probability Credits: 3

      AND

  • MATH 3351 - Elementary Linear Algebra Credits: 3

      OR

  • MATH 3350 - Applied Linear Algebra Credits: 3
  • STAT 3130 - Design and Analysis of Sample Surveys Credits: 3
  • STAT 4160 - Experimental Design Credits: 3

EACH OF THE FOLLOWING: 

  • STAT 3080 - From Data to Knowledge Credits: 3
  • STAT 3120 - Introduction to Mathematical Statistics Credits: 3
  • STAT 4996 - Capstone Credits: 3

Restricted Electives:

The Data Analysis restricted electives and the Computational restricted electives are listed below. Students must take four restricted electives, with at least two from the Data Analysis list. At most one of the four restricted electives may be drawn from a non-STAT pneumonic. (This limit on non-STAT courses also applies to the concentrations listed below.)

Data Analysis Restricted Electives: Minimum of 6 credit hours

The purpose of the additional required coursework in data analysis is to further prepare students to apply data science tools to generate insights from data and identify and predict trends. This coursework will allow students to gain additional breadth and depth in data analytics software and applications.

  • STAT 3480 - Nonparametric and Rank-Based Statistics Credits: 3
  • STAT 4120 - Applied Linear Models Credits: 3
  • STAT 4130 - Applied Multivariate Statistics Credits: 3
  • STAT 4170 - Financial Time Series and Forecasting Credits: 3
  • STAT 4220 - Applied Analytics for Business Credits: 3
  • STAT 4630 - Statistical Machine Learning Credits: 3
  • STAT 4800 - Advanced Sports Analytics I Credits: 3
  • STAT 5140: Survival Analysis and Reliability Theory Credits: 3
  • STAT 5170: Applied Time Series Credits: 3
  • STAT 5310: Clinical Tirals Methodology Credits: 3
  • STAT 5330: Data Mining Credits: 3
  • STAT 5390: Exploratory Data Analysis Credits: 3
  • STAT 5630: Statistical Machine Learning Credits: 3
  • ECON 3720 - Introduction to Econometrics Credits: 4
  • ECON 4720 - Econometric Methods Credits: 3
  • SOC 5110: Survey Research Methods (3 credit hours) 
  • SYS 4021 - Linear Statistical Models Credits: 4

Computational Restricted Electives:

The purpose of the additional required coursework in computation is to further prepare students to apply computational tools and methods for statistical modeling and analysis. This coursework will allow students to gain additional breadth and depth in modern computing software and applications.

  • STAT 3250 - Data Analysis with Python Credits: 3
  • STAT 3280 - Data Visualization and Management Credits: 3
  • ASTR 4140 - Research Methods in Astrophysics Credits: 3
  • COMM 3220 - Data Management for Decision Making Credits: 3
  • CS 4444 - Introduction to Parallel Computing Credits: 3
  • CS 4740 - Cloud Computing Credits: 3
  • CS 4750 - Database Systems Credits: 3
  • PHYS 5630: Computational Physics I (3 credit hours)

Course Duplication Limitations

  • Only one of STAT 4630 and STAT 5630 will satisfy the major requirements, as these are both versions of a machine learning course.
  • Only one of STAT 4260, ASTR 4140, COMM 3220, and CS 4750 will satisfy the major requirements, as these are all versions of a database course. 
  • Only one of STAT 4120, ECON 3720, ECON 4720, and SYS 4021 will satisfy the major requirements, as these are all versions of an advanced regression course. 

Applied Statistics Concentrations

Those declaring the B.A. in Applied Statistics have the option of choosing a major concentration. The concentrations are Finance and Business, Biostatistics, and Data Science. The requirements for these concentrations are given below. The prerequisites to declare any of the concentrations are the same as described earlier.

Biostatistics Concentration

Eight required common core courses and two restricted elective courses.

  • Two restricted elective courses, at least one from the Data Analysis list.

Data Science Concentration

Nine required common core courses and one restricted elective courses.

ONE OF THE FOLLOWING:  

  • STAT 5630: Statistical Machine Learning Credits: 3
  • One restricted elective courses, at least one from the Data Analysis list.

Finance and Business Concentration

  • STAT 5170: Applied Time Series Credits: 3

Description of Capstone

For the capstone, students will work in teams of 3 or 4 to complete an extensive data analysis project. The students and capstone faculty will work collaboratively to develop a hands-on project for each team to demonstrate knowledge and skill in data analysis, interpretation, and communication. Each project will require the team to determine the nature of the questions of interest; prepare data for analysis; select and perform the appropriate analysis; determine conclusions; and present the results. The capstone project will provide an opportunity to observe how students work through all aspects of a statistical analysis.

Students will be guided and evaluated by the capstone faculty. The capstone experience will culminate with the submission of a final report, and a formal presentation. If a student fails the capstone course, the Director of Undergraduate Programs will meet with the student to determine a set of revisions and/or alternative academic activities to complete their project. A student who fails to complete their project may retake the course in a subsequent semester.

ICS Project Expo-nential Growth

  • Share on Facebook
  • Share on Twitter
  • Share on LinkedIn
  • Share through Email
  • Copy permalink

On a warm Wednesday evening in May, 280 students from UC Irvine’s Donald Bren School of Information and Computer Sciences ( ICS ) gathered to present their work at the fourth annual ICS Project Expo . The event, held at the end of spring quarter, marks the culmination of more than 20 weeks’ worth of work on more than 75 projects across six capstone programs.

Three photos of students sharing their work and talking with faculty or industry reps.

With 550 attendees — including local industry leaders, UCI faculty and alumni, and ICS students and their family and friends — this was by far the largest ICS Project Expo to date. The event’s continued growth exemplifies the high level of interest in leveraging corporate partnerships and alumni relations to provide students with practical, hands-on experience as part of their ICS education.

“We had a great turnout of project partners, alumni and also students not currently enrolled in a capstone course,” says Mimi Anderson , Associate Director of the ICS Capstone Projects program. The impressive turnout illustrates people’s genuine interest in the program and its emphasis on fostering productive partnerships. “Witnessing student passion and ingenuity transform into innovative projects is truly inspiring,” says Anderson. “ICS capstone projects thrive thanks to our industry partners, whose crucial mentorship and real-world challenges prepare our students to become future tech leaders and ensure long-term success beyond graduation.”

Real-World Collaboration

For more than 15 years, ICS has used capstone projects to ensure students have the opportunity to apply their classroom knowledge in a practical setting. ICS now offers undergraduate capstone courses in computer science, data science, game design and informatics, as well as for the ICS Honors Program. This year’s ICS Project Expo also featured capstone projects from the Master of Computer Science professional program.

Three female students stand next to a PrepWiser poster

“This year’s Expo was a leap in both quantity and quality of student projects,” says Informatics Professor Hadar Ziv , who has been teaching the Informatics capstone course since 2009. Both undergraduate and masters-level students took on demanding projects related to a variety of hot topics, including AI and machine learning, cloud computing, mobile apps, data science, and web development. “One student team used APIs and face-recognition techniques to allow renters to interact with ‘holograms’ of their property managers,” says Ziv, “while another team developed a fun interactive VR game for NASA that teachers young players about Psyche , a metal-rich asteroid between Mars and Jupiter.”

Ziv also noted “an increase in projects from mid- to large-size companies, who are major players — and employers — in the Southern California ecosystem.” For example, one group of computer science students worked with Partner Engineering & Science to deploy an AI-powered PDF parser that can sift through old pdf reports, meticulously extract crucial data components, and seamlessly import them into contemporary report writer.

“This was our first time working with students, and personally I had a great experience,” says Kun Liu , a data scientist at Partner Engineering & Sciences, who led two of the company’s four projects. The PDF Parser project Kun oversaw won third place for the computer science capstone program. “I really enjoyed the Expo,” says Liu. “I also browsed other projects at the event and saw some truly inspiring ideas.”

Kun Liu stands with four students next to a poster

Another group of students worked on a cybersecurity chatbot for Raytheon . “Working with the capstone program’s staff, faculty and especially students was a great experience for my organization and team,” says Jose Romero-Mariona , an ICS alumnus and technical fellow at Raytheon. “The students’ ability to implement cutting-edge technologies and pivot with the latest advancements was both impressive and useful for our organization.” The project took second place for the Informatics program.

Five students stand next to their poster

Innovative Projects

The 79 capstone projects on display ranged from fraud detection apps and educational tools, to novel models for healthcare analytics, to action-packed video games. A group of 26 judges — comprised of industry leaders, ICS alumni and faculty — scored the projects using the RocketJudge app. Once judging closed, ICS Dean Marios Papaefthymiou announced the winners for each of the following capstone programs:

  • Computer Science : AWS (Neeraja Beesetti, Jessica Bhalerao, David Horta, Ulises Maldonado and Xiling Tian)
  • Data Science : Response Prediction Model for Bridge Structures (Emily Truong, Louis Chu, Brandon Keung and Vicki Bui)
  • Game Design : The Ninth Circle (Ryan Wong, Henry Olmstead, Cameron Romeis, Christopher Pena, David Rizko, Hasan “Soni” Rakipi, Jacob Ho, Whittaker Worland, Leyna Ho and Pedro Longo)
  • Honors Capstone : The Impact of Virtual Social Interactions on Real-Life Trust and Perceived Character (Alaina Klaes)
  • Informatics : MNDYRR — Mentoring Nurturement for Dynamic Youth Resilience & Restoration (Matt Cho, Neal Lowry, Jaylen Luc, Shiyi Mu and Jibreel Rasheed)
  • Master of Computer Science : PrepWiser (Kriti Taparia, Nisargi Vipulbhai Shah and Bhavini Piyush Mamtora)

These first-place teams each received $2,000, while the second- and third-place teams received $1,200 and $650, respectively. The top honors student received $375, while the second- and third-place honors students received $250 and $125, respectively.

A large group of students, some holding oversized checks, stand together outside a building at UCI

“The capstone program gave me an invaluable opportunity to work in an environment that closely mirrors industry conditions, but without the typical stresses,” says software engineering major Jibreel Rasheed. “Collaborating with a supportive project partner to create something beneficial to the world helped me realize my passion for design and leadership, and the creative freedoms I had throughout the project allowed me to expand my skills in a personally meaningful way.” His team’s project for MNDYRR resulted in Mendy, an empathetic conversational AI chatbot designed to address the youth mental health crisis.

Three students stand, and two kneel, next to their poster on Mendy

Increased Engagement

While the judging element was first added to the Expo in 2023, new for this year was a partner appreciation dinner. “We wanted to host a dinner after the main event to celebrate our project partners and to provide an additional opportunity for networking among the partners,” says Anderson. “We’re constantly looking for new ways to increase engagement and build stronger relationships with industry leaders and local alumni.”

One such alumna is Pooja Lohia Pai , an independent business consultant and ICS Alumni Chapter board member who served as a returning judge this year. “It is an honor and privilege to judge the capstone projects. There were so many innovative projects and not enough time to see them all,” she says, adding that she’s pleased to see growing interest. “It is so exciting to see how much the program has grown, evolved and expanded in the past few years with the leadership and support of local companies. The program is bursting at the seams.”

In fact, talks of a larger venue are already in the works for 2025. When it comes to connecting current and next-generation computer scientists, software designers, game developers and tech entrepreneurs, it’s a win-win for students and their partners.

“I highly recommend more companies and alumni get involved and partner with ICS students on a capstone project,” says Pai, “as it is rewarding and fulfilling for all parties involved.”

If you are a company interested in partnering on a capstone project , contact Mimi Anderson at [email protected] .

— Shani Murray

Related Posts

Commentary: california’s public universities come through – at least for one family (edsource), pride month 2024: supporting lgbtq+ in tech, best master’s in data science for 2024 (fortune), a generation of ai guinea pigs (the atlantic), students “up their game” for 2024 beall butterworth competition, tracing the network traffic fingerprinting techniques of openvpn (comms acm).

SSU students showcase their Spring 2024 Senior Capstone Projects

May 20, 2024

Website with interactive map

Screenshot from "Crisis Companion"

Website with interactive map

Spring 2024 Advanced Software Design Project (CS 470) students presented their senior capstone projects. Students made video demos of their apps that are accessible to a professional audience. Check out each team's presentation below.

Crisis Companion Ethan Martinez, Nicolas Randazzo, Jacob Franco

The Crisis Companion project is dedicated to enhancing disaster response capabilities through a comprehensive web-based platform that supports community communication and local authority efforts during emergencies. This platform enables users to actively report and track incidents such as weather anomalies and crowdsource members of the community who have skills to serve as volunteers. (i.e. Firefighters, Heavy Machinery Operators, Medics) The core functionality includes user registration, incident reporting, volunteer coordination, and real-time incident mapping, all designed to maintain communication and community awareness even in scenarios where communities traditionally lack coordinated response mechanisms for disasters. Utilizing a robust technology stack comprising MariaDB/MySQL, Amazon Web Services, Node.js, Express.js, and React.js, Crisis Companion integrates sophisticated backend services with a user-friendly front-end interface. The system is designed to be intuitive, allowing for quick user adoption and efficient use of real-time data to inform decision-making processes. By fostering a proactive approach to crisis management, Crisis Companion aims to empower communities with the tools needed to respond swiftly and efficiently to emergencies, ultimately minimizing impact and enhancing recovery efforts.

Training Grounds (Pokemon) Adam Lyday, Ian Boskin, Benito Sanchez, Erika Diaz Ramirez

The objective of the project was to make a Pokémon battle emulator, in which you were able to create an account, create Pokémon teams, and battle against other people online in real time. The Pokémon implemented were restricted to specifically generation 1. Each team was allowed up to 6 Pokémon with duplicates allowed. We were able to implement a battle with a random user with a preselected team. Upon entering a match a random level was assigned to each Pokémon between 50 and 80 for balance. Additionally, you are able to add friends and message the friends you have added.

Online Poker Game Homero Arellano, Zach Gassner, Diego Rivera

We made an online 3D multiplayer poker game for our final project. Specifically, no limit texas holdem, which is probably the most popular variant of poker today. Our online poker app requires you to login or create an account upon connecting to our app. Each new account is given 20,000 chips to start with. The buy-in for our poker game is 10,000 chips, and our game is a cash game style. Cash game poker refers to the fact that the blinds are fixed (50 for small blind and 100 for blind). The big blind and small blind are forced bets that rotate 1 spot clockwise each hand. Another characteristic of cash games is that a player can join and leave the game whenever they would like. If a player leaves a game and they still have chips (meaning they didn’t lose all their chips during the game), then the amount of chips that player has will be added back to their account. Cash game poker differs from tournament poker in that in tournament poker there is an initial buy-in of a certain amount, and the blinds periodically increase after a set amount of time. In tournament poker, if you leave before the tournament is over then you do not get your money back. Instead the top 20-30 % of players typically get paid depending on the tournament structure. Just to reiterate, our poker application is a cash game style of poker opposed to tournament poker. Our online poker application also has a friends page where you can add friends and join games from the friends tab. The friends tab allows users to send requests to other players, and they can either accept/reject requests. The use of websockets also allows users to see which of their friends are online/offline indicated by a green or red circle respectively. On top of that, our application also has a shop where you can purchase buyable avatars with the chips you acquire from playing the game. When actually loading into a match via creating or joining a match, you are met with a 3D model of a poker table and chairs. From there, players can select where they would like to sit, and will see other players who have already taken a seat. Our game allows for FP(first person), controls and all cards will be rendered as a 3D model as well. Once the game has been started by the host, the game will begin on a turn-based system, until the end of the hand has been revealed. Winners will be indicated by a green overlay and losers with a red, as well as all players' hands will be revealed in the bottom right corner.

Wow-Teamz Kyle Drewes

Wow-Teamz is an application used to store roster data of a guild’s raid team in World of Warcraft. In WoW (world of Warcraft), the main content that players partake in can have 5-30 people. Managing all these people can be a hassle, so WoW-Teamz provides an interactive and easy was to add characters from the game into a database and manage them. First, the user must create an account, which will allow you to create a raid team. A raid team can house up to 30 players, and the user can also select the raid times for that raid team. After doing this, the user can select the raid and they will be prompted with a “+”, which, if clicked, a text box where the user can simply enter a character name, which will make a call to the Warcraft API and gather important information on that specified character and add that to their raid team. Once this is done, characters can be designated one of the 3 roles of the game: Tank, Healer, and damage dealer. Once this is done, the user can view the break down on their raid team. A graph is generated that displays how many roles of each role are in the raid team, and a list of classes, a type of specialization that each class has, is missing, as each raid team needs one of each.

GlobeNomad Michael Seutin, William Cottrell, Anudari Gereltod

GlobeNomad is an all-in-one travel website where users can use their Google or Github account to create an account which they later can use to log in. After logging in users can create trips by selecting a City from the autogenerated suggestions and dates. Each trip has information about the city like current time, weather, and a map. Dashboard is a place where you can showcase your adventure history and see your personal information. This includes name, business information, packing list, bucket list, countdown to your upcoming trip, photos and more. A user can also follow other nomads and see their profile for inspiration. Private or public chats are available for nomads to exchange information. Featured cities provide other useful information that is gathered from the internet.

Work Scheduler App Nathan Brin, Brody Lang, Kyle Pallo

The Work Scheduler App is a web and iOS application, built to aid businesses in dynamically scheduling employees on a week-to-week basis. It’s designed to be as simple and accessible as possible, since existing applications with similar functionality are often quite complex and clunky to use. With that in mind, the application’s back-end is built upon a dead-simple and reliable MySQL database and a Koa.js-powered API. The web UI is built with React and Material UI, and the iOS app uses native Swift and UIKit. The core functionality is as follows: Admins (i.e. managers of businesses) can use the web application to manage their employees’ training, time off, availability, and time clock punches. Admins can also define schedules with shifts that need to be filled, and automatically assign employees to each shift based on their training, time off, availability, and existing schedules. All users can use the web application or iOS application to view their schedules and make time off and availability requests. The iOS application can additionally be used to punch in and out of shifts and meals with biometric authentication.

Petlove Kristen Cocciante, Phi Do, Bella Gonzalez

PetLove is a pet organizational tool. It was designed as a hub to share useful information about a pet between co-owners and sitters. This information ranges from meal times and meal quantities to vet appointments and medication. We added features to allow users to send friend requests, send messages, and stay in touch with other pet owners. Our app’s primary components include a dashboard, a profile page, a calendar, and a social page. The app also has a few secondary components, such as a login screen and a settings page. The profile page is used to view and create/edit individual pet profiles, which feature notes, allergies, mealtimes, medications, user and veterinarian contact information. The calendar is used to check recorded appointments, including relevant information such as the nature of the appointment, the relevant pets (whether they be your own or ones you are sitting), and important notes. Last is the social page to message users and search for users in order to add them as friends (which is necessary in order to add them as co-owners and sitters on your pets).

Quiz Social Evan Walters, Kathy Yuen, Hangpei Zhang

QuizSocial is an online Study tool, aims to create a simple and easy way to learn, allows users to create, share, and discover study tools. After making an account you can create a quiz and add what are essentially flash cards with a question and an answer. After doing this you can use 1 of 4 study methods: flash cards, fill in the blank, memory match, and fast multiple choice. The web app also allows you to visit the profiles and quizzes made by other users by means of a search page. This way you can follow users to easily view their quizzes as well as any new quizzes they create. Also when visiting a quiz you can rate it and even favorite it for easy and specific access to it. To sum it up, the app is a social network that makes it easy to create and find study tools.

Capstone Projects

Online M.S. in Data Science students are required to complete a capstone project. Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor. Teams select their capstone project in term 4 and work on the project in term 5, which is their final term.

Most projects are sponsored by an organization—academic, commercial, non-profit, and government—seeking valuable recommendations to address strategic and operational issues. Depending on the needs of the sponsor, teams may develop web-based applications that can support ongoing decision-making. The capstone project concludes with a paper and presentation.

Key takeaways:

  • Synthesizing the concepts you have learned throughout the program in various courses (this requires that the question posed by the project be complex enough to require the application of appropriate analytical approaches learned in the program and that the available data be of sufficient size to qualify as ‘big’)
  • Experience working with ‘raw’ data exposing you to the data pipeline process you are likely to encounter in the ‘real world’  
  • Demonstrating oral and written communication skills through a formal paper and presentation of project outcomes  
  • Acquisition of team building skills on a long-term, complex, data science project 
  • Addressing an actual client's need by building a data product that can be shared with the client

Capstone projects have been sponsors by a variety of organizations and industries, including: Capital One, City of Charlottesville, Deloitte Consulting LLP, Metropolitan Museum of Art, MITRE Corporation, a multinational banking firm, The Public Library of Science, S&P Global Market Intelligence, UVA Brain Institute, UVA Center for Diabetes Technology, UVA Health System, U.S. Army Research Laboratory, Virginia Department of Health, Virginia Department of Motor Vehicles, Virginia Office of the Governor, Wikipedia, and more. 

Sponsor a Capstone Project  

View previous examples of capstone projects  and check out answers to frequently asked questions. 

What does the process look like?

  • The School of Data Science periodically puts out a  Call for Proposals . Prospective project sponsors submit official proposals, vetted by the Associate Director for Research Development, Capstone Director, and faculty.
  • Sponsors present their projects to students at “Pitch Day” during Semester 4, where students have the opportunity to ask questions.
  • Students individually rank their top project choices. An algorithm sorts students into capstone groups of approximately 3 to 4 students per group.
  • Adjustments are made by hand as necessary to finalize groups.
  • Each group is assigned a faculty mentor, who will meet groups each week in a seminar-style format.  

What is the seminar approach to mentoring capstones?

We utilize a seminar approach to managing capstones to provide faculty mentorship and streamlined logistics. This approach involves one mentor supervising three to four loosely related projects and meeting with these groups on a regular basis. Project teams often encounter similar roadblocks and issues so meeting together to share information and report on progress toward key milestones is highly beneficial.

Do all capstone projects have corporate sponsors?

Not necessarily. Generally, each group works with a sponsor from outside the School of Data Science. Some sponsors are corporations, some are from nonprofit and governmental organizations, and some are from in other departments at UVA.

One of the challenges we continue to encounter when curating capstone projects with external sponsors is appropriately scoping and defining a question that is of sufficient depth for our students, obtaining data of sufficient size, obtaining access to the data in sufficient time for adequate analysis to be performed and navigating a myriad of legal issues (including conflicts of interest). While we continue to strive to use sponsored projects and work to solve these issues, we also look for ways to leverage openly available data to solve interesting societal problems which allow students to apply the skills learned throughout the program. While not all capstones have sponsors, all capstones have clients. That is, the work is being done for someone who cares and has investment in the outcome. 

Why do we have to work in groups?

Because data science is a team sport!

All capstone projects are completed by group work. While this requires additional coordination , this collaborative component of the program reflects the way companies expect their employees to work. Building this skill is one of our core learning objectives for the program.

I didn’t get my first choice of capstone project from the algorithm matching. What can I do?

Remember that the point of the capstone projects isn’t the subject matter; it’s the data science. Professional data scientists may find themselves in positions in which they work on topics assigned to them, but they use methods they enjoy and still learn much through the process. That said, there are many ways to tackle a subject, and we are more than happy to work with you to find an approach to the work that most aligns with your interests.

Why don’t we have a say in the capstone topics?

Your ability to influence which project you work on is in the ranking process after “pitch day” and in encouraging your company or department to submit a proposal during the Call for Proposal process. At a minimum it takes several months to work with a sponsor to adequately scope a project, confirm access to the data and put the appropriate legal agreements into place. Before you ever see a project presented on pitch day, a lot of work has taken place to get it to that point!

Can I work on a project for my current employer?

Each spring, we put forward a public call for capstone projects. You are encouraged to share this call widely with your community, including your employer, non-profit organizations, or any entity that might have a big data problem that we can help solve. As a reminder, capstone projects are group projects so the project would require sufficient student interest after ‘pitch day’. In addition, you (the student) cannot serve as the project sponsor (someone else within your employer organization must serve in that capacity).

If my project doesn’t have a corporate sponsor, am I losing out on a career opportunity?

The capstone project will provide you with the opportunity to do relevant, high-quality work which can be included on a resume and discussed during job interviews. The project paper and your code on Github will provide more career opportunities than the sponsor of the project. Although it does happen from time to time, it is rare that capstones lead to a direct job offer with the capstone sponsor's company. Capstone projects are just one networking opportunity available to you in the program.

Capstone Project Reflections From Alumni

Theo Braimoh, MSDS Online Graduate and Admissions Student Ambassador

For my Capstone project, I used Python to train machine learning models for visual analysis – also known as computer vision. Computer vision helped my Capstone team analyze the ergonomic posture of workers at risk of developing musculoskeletal injuries. We automated the process, and hope our work further protects the health and safety of American workers.”  — Theophilus Braimoh, MSDS Online Program 2023, Admissions Student Ambassador

Haley Egan, MSDS Online 2023 and Admissions Student Ambassador

“My Capstone experience with the ALMA Observatory and NRAO was a pivotal chapter in my UVA Master’s in Data Science journey. It fostered profound growth in my data science expertise and instilled a confidence that I'm ready to make meaningful contributions in the professional realm.” — Haley Egan, MSDS Online Program 2023, Admissions Student Ambassador

Mina Kim, MSDS/PhD 2023

“Our Capstone projects gave us the opportunity to gain new domain knowledge and answer big data questions beyond the classroom setting.”  — Mina Kim, MSDS Residential Program 2023, Ph.D. in Psychology Candidate

Capstone Project Reflections From Sponsors

“For us, the level of expertise, and special expertise, of the capstone students gives us ‘extra legs’ and an extra push to move a project forward. The team was asked to provide a replicable prototype air quality sensor that connected to the Cville Things Network, a free and community supported IoT network in Charlottesville. Their final product was a fantastic example that included clear circuit diagrams for replication by citizen scientists.” — Lucas Ames, Founder, Smart Cville
“Working with students on an exploratory project allowed us to focus on the data part of the problem rather than the business part, while testing with little risk. If our hypothesis falls flat, we gain valuable information; if it is validated or exceeded, we gain valuable information and are a few steps closer to a new product offering than when we started.” — Ellen Loeshelle, Senior Director of Product Management, Clarabridge

Get the latest news

Subscribe to receive updates from the School of Data Science.

  • Prospective Student
  • School of Data Science Alumnus
  • UVA Affiliate
  • Industry Member
  • People of Pacific
  • Digital Swag
  • Schools & Colleges
  • Undergraduate Programs
  • Graduate Programs
  • Dual-Degree Programs
  • Online Graduate Programs
  • Honors Programs
  • Professional Development & Continuing Education
  • Centers, Clinics & Institutes
  • Student Outcomes
  • Academic Support
  • Research & Scholarship
  • Undergraduate
  • Four-Year Guarantee
  • International Student
  • Financial Aid & Cost
  • Community Involvement Program
  • Our Campuses
  • History & Mission
  • Community Impact
  • Diversity, Equity, & Inclusion
  • Sustainability
  • President Callahan
  • Administrative Offices
  • University Leadership
  • Activities & Programs
  • Housing & Dining
  • Student Services
  • Career Services
  • Equity & Inclusion
  • Safety & Wellness

What can you do with a business analytics degree?

  • June 28, 2024 June 28, 2024
  • by People of Pacific

data science capstone project report

All businesses collect data, and most collect more than they know what to do with. Sales and financial information gathered during customer interactions can show a business how to refine its sales techniques, improve its marketing or find ways to save money. But data-driven decision-making can only happen when someone is able to turn raw numbers into insights that can inform and guide business leaders’ actions.  

Business analysts are trained to do just this. Business analytics students gain a deep knowledge of both business operations and data analysis, so they can translate business questions into data queries and those queries into analytics reports, which provide business predictions and recommendations.  

They learn to address business problems by:  

  • Identifying and defining the problem 
  • Determining what information is needed to address the problem 
  • Obtaining, managing and cleaning that data, as necessary 
  • Analyzing the data using analytical tools including Excel, Tableau and Power BI 
  • Interpreting the data 
  • Creating visualizations of the data to communicate its lessons to leadership 

Business analysts serve as catalysts for the companies they work for, helping leadership better understand the ins-and-outs of operations and enabling them to improve outcomes in many settings. 

“It’s absolutely amazing, that as an analyst, I get to work with the senior management and that they depend on my analytical abilities to become the best decision-makers in their positions,” said Arooj Rizvi, who earned a Bachelor of Science in Business Administration with a business analytics concentration from University of the Pacific in 2021 and a Master of Science in Business Analytics in 2022. She works as an institutional research analyst at San Joaquin Delta College.  

Business analytics vs. data science

The advantage that business analytics students have over statistics or data science majors is that they are better equipped to relate data to a company’s strategic objectives and operational processes that would lead to more data driven decision-making. They can speak the languages of both business and analytics and translate between them, which helps to improve communication within a business and enables it to work more efficiently.  

There are similarities between business analytics and data science programs’ coursework: both learn statistics and how to use similar analytical tools. But data science students learn computer programming, whereas business analytics students learn how to apply these tools and techniques to solve business problems and drive strategic decision-making . 

Business analytics graduates have an advantage in a variety of analytical jobs because they can combine strong analytical skills with a deep understanding of business operations and strategy.  

For this reason, Leili Javadpour, an associate professor in Pacific’s Eberhardt School of Business , encourages her business analytics students to double-major in another field of business such as finance or marketing, arguing that doing so makes students competitive with business degree and data science graduates.  

Business analytics jobs

“Business analytics is a rewarding career choice that gives us the skills to support the core of the organization,” said Rizvi. “Inevitably, all organizations will depend on the existence of data being captured for business intelligence.” 

Jobs that business analytics graduates commonly obtain include:  

  • Financial analyst: Financial analysts advise the leaders of their organizations about the best ways to responsibly manage their funds and increase their earnings. This might mean doing market research to figure out what products or offerings might sell best or analyzing financial trends to ensure that an organization’s investments are optimized. They are key staff involved in creating budgets and financial reporting.    
  • Marketing analyst: Marketing analysts determine the effectiveness of sales techniques and assess how well advertising campaigns worked—and why. They figure out what the most effective ways to increase sales are, to help companies spend their advertising budgets efficiently and to help marketing and sales staff spend their time on the efforts that are most likely to provide the greatest benefit to the company.  
  • Accounting analyst: Accounting analysts are responsible for monitoring and reporting on a company’s financial wellbeing, both overall and at the level of individual business units. They conduct audits to make sure the company is complying with all relevant financial regulations, create invoices and billing statements, monitor and track financial transactions and assist with tax preparations.  
  • Institutional research analyst: Institutional research analysts work at schools or nonprofit institutions to analyze internal data about members of a community and their work. Research could be undertaken for many reasons: to demonstrate compliance with regulations, to forecast enrollment numbers to inform budgets, to meet accreditation requirements and many other purposes. 
  • Database architect/administrator: Data architects and administrators ensure that an organization has access to the data it needs; the data is well organized, so that needed reports can be pulled easily; the database itself is maintained and updated as necessary; and the database integrates with other platforms as needed. They may also provide and interpret reports based on the data they manage, as well.  

Pacific’s business analytics degree

Business analytics students at Pacific experience the university’s hallmark small class sizes and benefit from personal attention from faculty members who are invested in the success of each of their students.  

“Pacific is offering a competitive program allowing mentorship from seasoned professionals and a competitive curriculum right from the start,” says Rizvi. 

Javadpour says that she and her colleagues regularly rework their classes to incorporate the most recent market trends and introduce new tools and technologies. The Pacific business analytics degree builds on that foundation.  

“We are focusing on teaching students problem solving skills rather than just how to work with software,” says Javadpour. “Software is changing and we won’t know what will be the tool to use in four years when they are graduating. Analytical skills and a problem-solving mindset will equip them with what they need in their future jobs.”   

Flexibility in the curriculum allows motivated students to make the most of their time at Pacific and makes it comparatively easy to double-major and gain practical experience .  

Business analytics students complete an experiential learning capstone project in their final semester. Rather than writing a thesis, students are assigned to clients—either corporations or business units at Pacific—and work with them to find solutions to existing problems. At the end of the semester, they present their work to their clients and professors.  

Learn more about business analytics at Pacific .  

Share this:

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

IMAGES

  1. Capstone Project Final Report Sample

    data science capstone project report

  2. Request a Powerful Data Science Capstone from Us & Shine

    data science capstone project report

  3. Capstone Project Ideas For Data Analytics

    data science capstone project report

  4. GitHub

    data science capstone project report

  5. Capstone Project Ideas for Data Science

    data science capstone project report

  6. coursera-applied-data-science-capstone/Final Report

    data science capstone project report

VIDEO

  1. IBM Coursera Advanced Data Science Capstone

  2. Data Science Capstone Project

  3. Health Science Capstone Project 2023

  4. Aldie Adrian

  5. Data Science Capstone Project Spotlight: Language Detection App

  6. Decoding Data Science Projects Part-1

COMMENTS

  1. Final Capstone Project for IBM Data Science Professional ...

    Final Capstone Project for IBM Data Science Professional Certification - GitHub - vikthak/IBM-AppliedDataScience-Capstone-FINAL: Final Capstone Project for IBM Data Science Professional Certification ... Report repository Releases No releases published. Packages 0. No packages published . Languages. Jupyter Notebook 99.6%; Python 0.4%; Footer

  2. Google Data Analytics Capstone Project Report

    Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion.

  3. An Exemplary Data Science Capstone, Annotated

    Since there was a lot of content, I'll conclude with my top three tips for doing a great data science capstone project: Choose a good data set: a small, uninteresting, or otherwise hard-to-analyze data set will make it substantially harder to make a great project. Include all of the following: Data cleaning.

  4. GitHub

    Executive summary. In this capstone project, we will predict if the SpaceX Falcon 9 first stage will land successfully using several machine learning classification algorithms. The main steps in this project include: Data collection, wrangling, and formatting. Exploratory data analysis. Interactive data visualization. Machine learning prediction.

  5. Data Science Capstone Project

    This is the project report to demonstrate the completion of a milestone of the capstone final project. In this report, the steps taken to acquire,load,clean up the data and some exploratory analysis done on the data are documented. Finally, the future plan and the next steps required to complete the development of data product is documented.

  6. Data Science Capstone Project: Milestone Report

    Synopsis. This is a milestone report for Week 2 of the capstone project for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University.. The purpose of the capstone project is to build a Natural Language Processing (NLP) application, that, given a chunk of text, predicts the next most probable word.

  7. A friendly walk-through of a Data Science Capstone Project

    Many websites and online courses focus on what beginners need to learn in order to become data scientists or on the importance of doing capstone projects to showcase one's skills.

  8. Milestone Report

    Introduction. This milestone report is a part of the data science capstone project of Coursera and Swiftkey.The main objective of the capstone project is to transform corpora of text into a Next Word Prediction system, based on word frequencies and context, applying data science in the area of natural language processing.This Rmarkdown report describes exploratory analysis of the sample ...

  9. Data Science Capstone Course by Johns Hopkins University

    Introduction to Task 1: Getting and Cleaning the Data • 1 minute. Regular Expressions: Part 1 (Optional) • 5 minutes. Regular Expressions: Part 2 (Optional) • 8 minutes. 6 readings • Total 52 minutes. A Note of Explanation • 2 minutes. Project Overview • 10 minutes. Syllabus • 10 minutes.

  10. Coursera Data Science Capstone

    Introduction. The Coursera Data Science Capstone - Milestone Report (aka, "the report") is intended to give an introductory look at analyzing the SwiftKey data set and figuring out:. What the data consists of, and; Identifying the standard tools and models used for this type of data. The report is then to be written in a clear, concise, style that a data scientist OR non-data scientist can ...

  11. IBM Data Science Capstone Report

    IBM Data Science Capstone Report - Free download as PDF File (.pdf), Text File (.txt) or read online for free. This document summarizes an IBM Data Science capstone project aimed at preventing avoidable car accidents. The project uses data on past accidents collected by Seattle police to build machine learning models that can predict accident severity based on factors like weather, road, and ...

  12. Final Capstone Project Report

    Final Capstone Project Report - Free download as PDF File (.pdf), Text File (.txt) or read online for free. This document describes a data science capstone project on HR analytics related to employee attrition and performance at IBM. It involves acquiring employee data from Kaggle, cleaning the data by addressing missing values and outliers, and conducting exploratory data analysis to ...

  13. Capstone Projects

    Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor. Teams select their capstone project at the beginning of the year and work on the project over the course of two semesters. Most projects are sponsored by an organization—academic ...

  14. Applied Data Science Capstone Course by IBM

    There are 5 modules in this course. This is the final course in the IBM Data Science Professional Certificate as well as the Applied Data Science with Python Specialization. This capstone project course will give you the chance to practice the work that data scientists do in real life when working with datasets.

  15. 21 Interesting Data Science Capstone Project Ideas [2024]

    21 Interesting Data Science Capstone Project Ideas [2024] By Mohini Saxena. Data science, encompassing the analysis and interpretation of data, stands as a cornerstone of modern innovation. Capstone projects in data science education play a pivotal role, offering students hands-on experience to apply theoretical concepts in practical settings.

  16. Data Science at Scale

    In the capstone, students will engage on a real world project requiring them to apply skills from the entire data science pipeline: preparing, organizing, and transforming data, constructing a model, and evaluating results.

  17. Capstone Projects in Data Science and Machine Learning: An ...

    Capstone Projects in Data Science and Machine Learning: An Example of a Full Report That Will Help You Write Yours ... This project is a data clustering project, and it is aimed to segment the ...

  18. Data Science: Capstone

    By completing this capstone project you will get an opportunity to apply the knowledge and skills in R data analysis that you have gained throughout the series. This final project will test your skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning.

  19. Capstone Projects

    The culminating experience in the Master's in Applied Data Science program is a Capstone Project where you'll put your knowledge and skills into practice. You will immerse yourself in a real business problem and will gain valuable, data driven insights using authentic data. Together with project sponsors, you will develop a data science ...

  20. Data Science Capstone Experience

    Capstone Experience - 1 course/final project. The Capstone Experience in Data Science (EN.553.806) is a research-oriented project which must be approved by the research supervisor, academic advisor and the Internal Oversight Committee. The Capstone Experience can be taken in multiple semesters, but the total number of credits required for ...

  21. Data Science: Capstone

    To become an expert data scientist you need practice and experience. By completing this capstone project you will get an opportunity to apply the knowledge and skills in R data analysis that you have gained throughout the series. This final project will test your skills in data visualization, probability, inference and modeling, data wrangling ...

  22. Program: Applied Statistics, B.A.

    The capstone experience will culminate with the submission of a final report, and a formal presentation. If a student fails the capstone course, the Director of Undergraduate Programs will meet with the student to determine a set of revisions and/or alternative academic activities to complete their project. A student who fails to complete their ...

  23. Data Science with R

    There are 6 modules in this course. In this capstone course, you will apply various data science skills and techniques that you have learned as part of the previous courses in the IBM Data Science with R Specialization or IBM Data Analytics with Excel and R Professional Certificate. For this project, you will assume the role of a Data Scientist ...

  24. ICS Project Expo-nential Growth

    The PDF Parser project Kun oversaw won third place for the computer science capstone program. "I really enjoyed the Expo," says Liu. "I also browsed other projects at the event and saw some truly inspiring ideas." Kun Liu (far right) with data science students (from left) Ellie Lee, Karis Park, Adam Ho and Ashley Yung.

  25. SSU students showcase their Spring 2024 Senior Capstone Projects

    Spring 2024 Advanced Software Design Project (CS 470) students presented their senior capstone projects. Students made video demos of their apps that are accessible to a professional audience. Check out each team's presentation below.

  26. Capstone Projects

    Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor. Teams select their capstone project in term 4 and work on the project in term 5, which is their final term. Most projects are sponsored by an organization—academic, commercial, non ...

  27. Capstone Copper Corp. (TSE:CS) Director Sells C$587,766.00 ...

    Capstone Copper Corp. (TSE:CS - Get Free Report) Director Darren Murvin Pylot sold 60,000 shares of the firm's stock in a transaction on Monday, June 24th. The stock was sold at an average price of C$9.80, for a total value of C$587,766.00.

  28. What can you do with a business analytics degree?

    But data science students learn computer programming, ... They may also provide and interpret reports based on the data they manage, as well. ... Business analytics students complete an experiential learning capstone project in their final semester. Rather than writing a thesis, students are assigned to clients—either corporations or business ...

  29. PDF Master of Public Health Program Manual 2024-2025 Full-time

    • Explain public health history, philosophy, and values • Identify the core functions of public health and the 10 Essential Services • Explain the role of quantitative methods and sciences in describing and assessing a population's health