data science assignment answers

Tutorial Playlist

Data science tutorial for beginners, what is data science: lifecycle, applications, prerequisites and tools, the best introduction to data science, data scientist vs data analyst vs data engineer: job role, skills, and salary, data science with r: getting started, getting started with linear regression in r, logistic regression in r: the ultimate tutorial with examples, support vector machine (svm) in r: taking a deep dive, introduction to random forest in r, what is hierarchical clustering and how does it work, the best guide to time series forecasting in r, how to build a career in data science, how to become a data scientist in 2024: complete guide, data scientist salary in india: are you earning enough, top 90+ data science interview questions and answers for 2024, top 90+ data science interview questions and answers 2024.

Lesson 14 of 14 By Avijeet Biswal

What is Data Science?

Data Science combines statistics, maths, specialised programs, artificial intelligence, machine learning etc. Data Science is simply the application of specific principles and analytic techniques to extract information from data used in strategic planning, decision making, etc. Simply, data science means analysing data for actionable insights, incorporating top data science skills such as data visualization, Processing large data sets, Statistical analysis and computing to unlock the value in data.

10 Most Asked Data Science Interview Questions

Differentiate between Data Analytics and Data Science
What are the differences between supervised and unsupervised learning?
Explain the steps in making a decision tree.
Differentiate between univariate, bivariate, and multivariate analysis.
How should you maintain a deployed model?
What is a Confusion Matrix?
How is logistic regression done?
What is the significance of p-value?
Mention some techniques used for sampling.

Basic and Advanced Data Science Interview Questions

Here's a list of the most popular data science interview questions on the technical concept which you can expect to face, and how to frame your answers.

1. What are the differences between supervised and unsupervised learning?

Become a data scientist with hands-on training.

2. How is logistic regression done?

Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).

The image shown below depicts how logistic regression works:

The formula and graph for the sigmoid function are as shown:

3. Explain the steps in making a decision tree.

Take the entire data set as input
Calculate entropy of the target variable, as well as the predictor attributes
Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
Choose the attribute with the highest information gain as the root node
Repeat the same procedure on every branch until the decision node of each branch is finalized

For example, let's say you want to build a decision tree to decide whether you should accept or decline a job offer. The decision tree for this case is as shown:

It is clear from the decision tree that an offer is accepted if:

Salary is greater than $50,000
The commute is less than an hour
Incentives are offered

4. How do you build a random forest model?

A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.

Steps to build a random forest model:

Randomly select 'k' features from a total of 'm' features where k << m
Among the 'k' features, calculate the node D using the best split point
Split the node into daughter nodes using the best split
Repeat steps two and three until leaf nodes are finalized
Build forest by repeating steps one to four for 'n' times to create 'n' number of trees

5. How can you avoid overfitting your model?

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting :

Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
Use cross-validation techniques, such as k folds cross-validation
Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting

Learn Job Critical Skills To Help You Grow!

6. Differentiate between univariate, bivariate, and multivariate analysis.

Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it.

Example: height of students

The patterns can be studied by drawing conclusions using mean, median, mode, dispersion or range, minimum, maximum, etc.

Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.

Example: temperature and ice cream sales in the summer season

Here, the relationship is visible from the table that temperature and sales are directly proportional to each other. The hotter the temperature, the better the sales.

Multivariate

Multivariate data involves three or more variables, it is categorized under multivariate. It is similar to a bivariate but contains more than one dependent variable.

Example: data for house price prediction

The patterns can be studied by drawing conclusions using mean, median, and mode, dispersion or range, minimum, maximum, etc. You can start describing the data and using it to guess what the price of the house will be.

7. What are the feature selection methods used to select the right variables?

There are two main methods for feature selection, i.e, filter, and wrapper methods.

Filter Methods

This involves:

Linear discrimination analysis

The best analogy for selecting features is "bad data in, bad answer out." When we're limiting or selecting the features, it's all about cleaning up the data coming in.

Wrapper Methods

Forward Selection: We test one feature at a time and keep adding them until we get a good fit
Backward Selection: We test all the features and start removing them to see what works better
Recursive Feature Elimination: Recursively looks through all the different features and how they pair together

Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

Future-Proof Your AI/ML Career: Top Dos and Don'ts

8. In your choice of language, write a program that prints the numbers ranging from one to 50.

But for multiples of three, print "Fizz" instead of the number, and for the multiples of five, print "Buzz." For numbers which are multiples of both three and five, print "FizzBuzz"

The code is shown below:

Note that the range mentioned is 51, which means zero to 50. However, the range asked in the question is one to 50. Therefore, in the above code, you can include the range as (1,51).

The output of the above code is as shown:

9. You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?

The following are ways to handle missing data values:

If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way; we use the rest of the data to predict the values.

For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas' data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).

10. For the given points, how will you calculate the Euclidean distance in Python?

plot1 = [1,3]

plot2 = [2,5]

The Euclidean distance can be calculated as follows:

euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

Check out the Simplilearn's video on "Data Science Interview Question" curated by industry experts to help you prepare for an interview.

11. What are dimensionality reduction and its benefits?

The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there's no point in storing a value in two different units (meters and inches).

Related Interview Questions and Answers

AI | ML | Data engineer | Data Analytics

12. How will you calculate eigenvalues and eigenvectors of the following 3x3 matrix?

The characteristic equation is as shown:

Expanding determinant:

(-2 – λ) [(1-λ) (5-λ)-2x2] + 4[(-2) x (5-λ) -4x2] + 2[(-2) x 2-4(1-λ)] =0

- λ3 + 4λ2 + 27λ – 90 = 0,

λ3 - 4 λ2 -27 λ + 90 = 0

Here we have an algebraic equation built from the eigenvectors.

By hit and trial:

33 – 4 x 32 - 27 x 3 +90 = 0

Hence, (λ - 3) is a factor:

λ3 - 4 λ2 - 27 λ +90 = (λ – 3) (λ2 – λ – 30)

Eigenvalues are 3,-5,6:

(λ – 3) (λ2 – λ – 30) = (λ – 3) (λ+5) (λ-6),

Calculate eigenvector for λ = 3

-5 - 4Y + 2Z =0,

-2 - 2Y + 2Z =0

Subtracting the two equations:

3 + 2Y = 0,

Subtracting back into second equation:

Y = -(3/2)

Similarly, we can calculate the eigenvectors for -5 and 6.

13. How should you maintain a deployed model?

The steps to maintain a deployed model are:

Constant monitoring of all models is needed to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it's doing what it's supposed to do.

Evaluation metrics of the current model are calculated to determine if a new algorithm is needed.

The new models are compared to each other to determine which model performs the best.

The best-performing model is re-built on the current state of data.

14. What are recommender systems?

A recommender system predicts what a user would rate a specific product based on their preferences. It can be split into two different areas:

Collaborative Filtering

As an example, Last.fm recommends tracks that other users with similar interests play often. This is also commonly seen on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: "Users who bought this also bought…"

Content-based Filtering

As an example: Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to music.

15. How do you find RMSE and MSE in a linear regression model?

RMSE and MSE are two of the most common measures of accuracy for a linear regression model.

RMSE indicates the Root Mean Square Error.

MSE indicates the Mean Square Error.

16. How can you select k for k-means?

We use the elbow method to select k for k-means clustering . The idea of the elbow method is to run k-means clustering on the data set where 'k' is the number of clusters.

Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.

17. What is the significance of p-value?

p-value typically ≤ 0.05

This indicates strong evidence against the null hypothesis; so you reject the null hypothesis.

p-value typically > 0.05

This indicates weak evidence against the null hypothesis, so you accept the null hypothesis.

p-value at cutoff 0.05

This is considered to be marginal, meaning it could go either way.

18. How can outlier values be treated?

You can drop outliers only if it is a garbage value.

Example: height of an adult = abc ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed.

If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point.

If you cannot drop outliers, you can try the following:

Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model.
Try normalizing the data. This way, the extreme data points are pulled to a similar range.
You can use algorithms that are less affected by outliers; an example would be random forests .

19. How can time-series data be declared as stationery?

It is stationary when the variance and mean of the series are constant with time.

Here is a visual example:

In the first graph, the variance is constant with time. Here, X is the time factor and Y is the variable. The value of Y goes through the same points all the time; in other words, it is stationary.

In the second graph, the waves get bigger, which means it is non-stationary and the variance is changing with time.

20. How can you calculate accuracy using a confusion matrix?

Consider this confusion matrix :

You can see the values for total data, actual values, and predicted values.

The formula for accuracy is:

Accuracy = (True Positive + True Negative) / Total Observations

= (262 + 347) / 650

= 609 / 650

As a result, we get an accuracy of 93 percent.

21. Write the equation and calculate the precision and recall rate.

Consider the same confusion matrix used in the previous question.

Precision = (True positive) / (True Positive + False Positive)

= 262 / 277

Recall Rate = (True Positive) / (Total Positive + False Negative)

= 262 / 288

22. 'People who bought this also bought…' recommendations seen on Amazon are a result of which algorithm?

The recommendation engine is accomplished with collaborative filtering. Collaborative filtering explains the behavior of other users and their purchase history in terms of ratings, selection, etc.

The engine makes predictions on what might interest a person based on the preferences of other users. In this algorithm, item features are unknown.

For example, a sales page shows that a certain number of people buy a new phone and also buy tempered glass at the same time. Next time, when a person buys a phone, he or she may see a recommendation to buy tempered glass as well.

23. Write a basic SQL query that lists all orders with customer information.

Usually, we have order tables and customer tables that contain the following columns:

Order Table
customerId
OrderNumber
TotalAmount
Customer Table
The SQL query is:
SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country
JOIN Customer
ON Order.CustomerId = Customer.Id

24. You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of 96 percent. Why shouldn't you be happy with your model performance? What can you do about it?

Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based as a measure of performance. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection, and can greatly improve a patient's prognosis.

Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class wise performance of the classifier.

25. Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables?

K-means clustering
Linear regression
K-NN (k-nearest neighbor)
Decision trees

The K nearest neighbor algorithm can be used because it can compute the nearest neighbor and if it doesn't have a value, it just computes the nearest neighbor based on all the other features.

When you're dealing with K-means clustering or linear regression , you need to do that in your pre-processing, otherwise, they'll crash. Decision trees also have the same problem, although there is some variance.

26. Below are the eight actual values of the target variable in the train file. What is the entropy of the target variable?

[0, 0, 0, 1, 1, 1, 1, 1]

Choose the correct answer.

-(5/8 log(5/8) + 3/8 log(3/8))
5/8 log(5/8) + 3/8 log(3/8)
3/8 log(5/8) + 5/8 log(3/8)
5/8 log(3/8) – 3/8 log(5/8)

The target variable, in this case, is 1.

The formula for calculating the entropy is:

Putting p=5 and n=8, we get

Entropy = A = -(5/8 log(5/8) + 3/8 log(3/8))

27. We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case?

Choose the correct option:

Logistic Regression
Linear Regression
K-means clustering
Apriori algorithm

The most appropriate algorithm for this case is A, logistic regression.

28. After studying the behavior of a population, you have identified four specific individual types that are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study?

Linear regression
Association rules
Decision trees

As we are looking for grouping people together specifically by four different similarities, it indicates the value of k. Therefore, K-means clustering (answer A) is the most appropriate algorithm for this study.

29. You have run the association rules algorithm on your dataset, and the two rules {banana, apple} => {grape} and {apple, orange} => {grape} have been found to be relevant. What else must be true?

Choose the right answer:

{banana, apple, grape, orange} must be a frequent itemset
{banana, apple} => {orange} must be a relevant rule
{grape} => {banana, apple} must be a relevant rule
{grape, apple} must be a frequent itemset

The answer is A: {grape, apple} must be a frequent itemset

30. Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to website visitors has any impact on their purchase decisions. Which analysis method should you use?

One-way ANOVA
Association rules
Student's t-test

The answer is A: One-way ANOVA

31. What do you understand about true positive rate and false-positive rate?

The True Positive Rate (TPR) defines the probability that an actual positive will turn out to be positive.

The True Positive Rate (TPR) is calculated by taking the ratio of the [True Positives (TP)] and [True Positive (TP) & False Negatives (FN) ].

The formula for the same is stated below -

TPR=TP/TP+FN

The False Positive Rate (FPR) defines the probability that an actual negative result will be shown as a positive one i.e the probability that a model will generate a false alarm.

The False Positive Rate (FPR) is calculated by taking the ratio of the [False Positives (FP)] and [True Positives (TP) & False Positives(FP)].

FPR=FP/TN+FP

32. What is the ROC curve?

The graph between the True Positive Rate on the y-axis and the False Positive Rate on the x-axis is called the ROC curve and is used in binary classification.

The False Positive Rate (FPR) is calculated by taking the ratio between False Positives and the total number of negative samples, and the True Positive Rate (TPR) is calculated by taking the ratio between True Positives and the total number of positive samples.

In order to construct the ROC curve, the TPR and FPR values are plotted on multiple threshold values. The area range under the ROC curve has a range between 0 and 1. A completely random model, which is represented by a straight line, has a 0.5 ROC. The amount of deviation a ROC has from this straight line denotes the efficiency of the model.

The image above denotes a ROC curve example.

33. What is a Confusion Matrix?

The Confusion Matrix is the summary of prediction results of a particular problem. It is a table that is used to describe the performance of the model. The Confusion Matrix is an n*n matrix that evaluates the performance of the classification model.

34. What do you understand about the true-positive rate and false-positive rate?

TRUE-POSITIVE RATE: The true-positive rate gives the proportion of correct predictions of the positive class. It is also used to measure the percentage of actual positives that are accurately verified.

FALSE-POSITIVE RATE: The false-positive rate gives the proportion of incorrect predictions of the positive class. A false positive determines something is true when that is initially false.

35. How is Data Science different from traditional application programming?

The primary and vital difference between Data Science and traditional application programming is that in traditional programming, one has to create rules to translate the input to output. In Data Science, the rules are automatically produced from the data.

36. What is the difference between the long format data and wide format data?

LONG FORMAT DATA: It contains values that repeat in the first column. In this format, each row is a one-time point per subject.

WIDE FORMAT DATA: In the Wide Format Data, the data’s repeated responses will be in a single row, and each response can be recorded in separate columns.

Long format Table:

Wide format Table:

37. Mention some techniques used for sampling.

Sampling is the selection of individual members or a subset of the population to estimate the characters of the whole population. There are two types of Sampling, namely Probability and Non-Probability Sampling.

38. Why is Python used for Data Cleaning in DS?

Data Scientists and technical analysts must convert a huge amount of data into effective ones. Data Cleaning includes removing malwared records, outliners, inconsistent values, redundant formatting etc. Matplotlib, Pandas etc are the most used Python Data Cleaners.

39. What are the popular libraries used in Data Science?

The popular libraries used in Data Science are

Tensor Flow

40. What is variance in Data Science?

Variance is the value that depicts the individual figures in a set of data which distributes themselves about the mean and describes the difference of each value from the mean value. Data Scientists use variance to understand the distribution of a data set.

41. What is pruning in a decision tree algorithm?

In Data Science and Machine Learning, Pruning is a technique which is related to decision trees. Pruning simplifies the decision tree by reducing the rules. Pruning helps to avoid complexity and improves accuracy. Reduced error Pruning, cost complexity pruning etc. are the different types of Pruning.

42. What is entropy in a decision tree algorithm?

Entropy is the measure of randomness or disorder in the group of observations. It also determines how a decision tree switches to split data. Entropy is also used to check the homogeneity of the given data. If the entropy is zero, then the sample of data is entirely homogeneous, and if the entropy is one, then it indicates that the sample is equally divided.

43. What information is gained in a decision tree algorithm?

Information gain is the expected reduction in entropy. Information gain decides the building of the tree. Information Gain makes the decision tree smarter. Information gain includes parent node R and a set E of K training examples. It calculates the difference between entropy before and after the split.

44. What is k-fold cross-validation?

The k-fold cross validation is a procedure used to estimate the model's skill in new data. In k-fold cross validation, every observation from the original dataset may appear in the training and testing set. K-fold cross-validation estimates the accuracy but does not help you to improve the accuracy.

45. What is a normal distribution?

Normal Distribution is also known as the Gaussian Distribution. The normal distribution shows the data near the mean and the frequency of that particular data. When represented in graphical form, normal distribution appears like a bell curve. The parameters included in the normal distribution are Mean, Standard Deviation, Median etc.

46. What is Deep Learning?

Deep Learning is one of the essential factors in Data Science, including statistics. Deep Learning makes us work more closely with the human brain and reliable with human thoughts. The algorithms are sincerely created to resemble the human brain. In Deep Learning, multiple layers are formed from the raw input to extract the high-level layer with the best features.

47. What is an RNN (recurrent neural network)?

RNN is an algorithm that uses sequential data. RNN is used in language translation, voice recognition, image capturing etc. There are different types of RNN networks such as one-to-one, one-to-many, many-to-one and many-to-many. RNN is used in Google’s Voice search and Apple’s Siri.

Basic Data Science Interview Questions

Let us begin with a few basic data science interview questions!

48. What are the feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that's easy to analyze.

49. What are the steps in making a decision tree?

Take the entire data set as input.
Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
Apply the split to the input data (divide step).
Re-apply steps one and two to the divided data.
Stop when you meet any stopping criteria.
This step is called pruning. Clean up the tree if you went too far doing splits.

50. What is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

51. What is logistic regression?

Logistic regression is also known as the logit model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.

52. What are recommender systems?

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.

53. Explain cross-validation.

Cross-validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

54. What is collaborative filtering?

Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

55. Do gradient descent methods always converge to similar points?

They do not, because in some cases, they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

56. What is the goal of A/B Testing?

This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.

57. What are the drawbacks of the linear model?

The assumption of linearity of the errors
It can't be used for count outcomes or binary outcomes
There are overfitting problems that it can't solve

58. What is the law of large numbers?

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.

59. What are the confounding variables?

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

60. What is star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.

61. How regularly must an algorithm be updated?

You will want to update an algorithm when:

You want the model to evolve as data streams through infrastructure
The underlying data source is changing
There is a case of non-stationarity

62. What are eigenvalue and eigenvector?

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching.

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.

63. Why is resampling done?

Resampling is done in any of these cases:

Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points
Substituting labels on data points when performing significance tests
Validating models by using random subsets ( bootstrapping , cross-validation)

64. What is selection bias?

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.

65. What are the types of biases that can occur during sampling?

Selection bias
Undercoverage bias
Survivorship bias

66. What is survivorship bias?

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

67. How do you work towards a random forest?

The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are:

Build several decision trees on bootstrapped training samples of data
On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors
Rule of thumb: At each split m=p√m=p
Predictions: At the majority rule

This exhaustive list is sure to strengthen your preparation for data science interview questions.

68. What is a bias-variance trade-off?

Bias: Due to an oversimplification of a Machine Learning Algorithm, an error occurs in our model, which is known as Bias. This can lead to an issue of underfitting and might lead to oversimplified assumptions at the model training time to make target functions easier and simpler to understand.

Some of the popular machine learning algorithms which are low on the bias scale are -

Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees.

Algorithms that are high on the bias scale -

Logistic Regression and Linear Regression.

Variance: Because of a complex machine learning algorithm, a model performs really badly on a test data set as the model learns even noise from the training data set. This error that occurs in the Machine Learning model is called Variance and can generate overfitting and hyper-sensitivity in Machine Learning models.

While trying to get over bias in our model, we try to increase the complexity of the machine learning algorithm. Though it helps in reducing the bias, after a certain point, it generates an overfitting effect on the model hence resulting in hyper-sensitivity and high variance.

Bias-Variance trade-off: To achieve the best performance, the main target of a supervised machine learning algorithm is to have low variance and bias.

The following things are observed regarding some of the popular machine learning algorithms -

The Support Vector Machine algorithm (SVM) has high variance and low bias. In order to change the trade-off, we can increase the parameter C. The C parameter results in a decrease in the variance and an increase in bias by influencing the margin violations allowed in training datasets.
In contrast to the SVM, the K-Nearest Neighbors (KNN) Machine Learning algorithm has a high variance and low bias. To change the trade-off of this algorithm, we can increase the prediction influencing neighbors by increasing the K value, thus increasing the model bias.

69. Describe Markov chains?

Markov Chains defines that a state’s future probability depends only on its current state.

Markov chains belong to the Stochastic process type category.

The below diagram explains a step-by-step model of the Markov Chains whose output depends on their current state.

A perfect example of the Markov Chains is the system of word recommendation. In this system, the model recognizes and recommends the next word based on the immediately previous word and not anything before that. The Markov Chains take the previous paragraphs that were similar to training data-sets and generates the recommendations for the current paragraphs accordingly based on the previous word.

70. Why is R used in Data Visualization?

R is widely used in Data Visualizations for the following reasons-

We can create almost any type of graph using R.
R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt functions as well.
It is easier to customize graphics in R compared to Python.
R is used in feature engineering and in exploratory data analysis as well.

71. What is the difference between a box plot and a histogram?

The frequency of a certain feature’s values is denoted visually by both box plots

and histograms.

Boxplots are more often used in comparing several datasets and compared to histograms, take less space and contain fewer details. Histograms are used to know and understand the probability distribution underlying a dataset.

The diagram above denotes a boxplot of a dataset.

72. What does NLP stand for?

NLP is short for Natural Language Processing. It deals with the study of how computers learn a massive amount of textual data through programming. A few popular examples of NLP are Stemming, Sentimental Analysis, Tokenization, removal of stop words, etc.

73. Difference between an error and a residual error

The difference between a residual error and error are defined below -

74. Difference between Normalisation and Standardization

75. difference between point estimates and confidence interval.

Confidence Interval: A range of values likely containing the population parameter is given by the confidence interval. Further, it even tells us how likely that particular interval can contain the population parameter. The Confidence Coefficient (or Confidence level) is denoted by 1-alpha, which gives the probability or likeness. The level of significance is given by alpha.

Point Estimates: An estimate of the population parameter is given by a particular value called the point estimate. Some popular methods used to derive Population Parameters’ Point estimators are - Maximum Likelihood estimator and the Method of Moments.

To conclude, the bias and variance are inversely proportional to each other, i.e., an increase in bias results in a decrease in the variance, and an increase in variance results in a decrease in bias.

One-on-One Data Science Interview Questions

To crack a data science interview is no walk in the park. It requires in-depth knowledge and expertise in various topics. Furthermore, the projects that you have worked on can significantly boost your potential in a lot of interviews. In order to help you with your interviews, we have compiled a set of questions for you to relate to. Since data science is an extensive field, there are no limitations on the type of questions that can be inquired. With that being said, you can answer each of these questions depending on the projects you have worked on and the industries you have been in. Try to answer each one of these sample questions and then share your answer with us through the comments.

Pro Tip: No matter how basic a question may seem, always try to view it from a technical perspective and use each question to demonstrate your unique technical skills and abilities.

76. Which is your favorite machine learning algorithm and why?

One of the popular and versatile machine learning algorithms is the Random Forest. It's an ensemble method that combines multiple decision trees, providing high accuracy, handling both classification and regression tasks, and reducing overfitting. Its ability to handle large datasets and diverse feature types makes it a powerful choice in various applications.

77. Which according to you is the most important skill that makes a good data scientist?

The most important skill that makes a good data scientist is a strong foundation in statistics. Data scientists need to understand statistical concepts to analyze and interpret data accurately, draw meaningful insights, and make data-driven decisions. This skill allows them to select appropriate modeling techniques, handle uncertainty, and effectively communicate findings to stakeholders, ensuring the success of data-driven projects.

78. Why do you think data science is so popular today?

Data science is popular today due to the explosion of data and the potential to extract valuable insights from it. Organizations across various industries recognize the importance of data-driven decision-making to gain a competitive edge. Moreover, advancements in technology and accessible tools have made data science more approachable, attracting professionals from diverse backgrounds to harness data's power for innovation and problem-solving.

79. Explain the most challenging data science project that you worked on.

The most challenging data science project I encountered involved analyzing vast amounts of unstructured text data from various sources. Extracting meaningful insights required advanced natural language processing techniques, sentiment analysis, and topic modeling. Additionally, handling data quality issues and ensuring scalable processing posed significant hurdles. Collaborating with domain experts and iteratively refining models were crucial to deliver accurate and actionable results.

80. How do you usually prefer working on a project - individually, small team, or large team?

For projects, I can provide support individually, in small teams, or as part of larger teams. My adaptability allows me to assist in diverse settings, leveraging my capabilities to meet project requirements effectively and contribute to successful outcomes, regardless of team size.

81. Based on your experience in the industry, tell me about your top 5 predictions for the next 10 years.

AI continues to evolve, becoming an integral part of daily life, driving innovation in healthcare, education, and transportation.
Climate change initiatives reshape economies, with renewable energy, electric vehicles, and sustainable practices mainstream.
Advances in biotech lead to breakthroughs in personalized medicine and anti-aging therapies.
Virtual and augmented reality become widespread, transforming entertainment, work, and social interactions.
Cryptocurrencies and blockchain technologies gain wide acceptance, changing the financial landscape.

82. What are some unique skills that you can bring to the team as a data scientist?

As a data scientist, I bring expert knowledge in machine learning, statistical modeling, and data visualization. My ability to translate complex data into actionable insights is valuable. I have proficiency in programming languages like Python, R, and SQL, crucial for data manipulation and analysis. Additionally, my experience with big data platforms and tools, along with strong problem-solving skills, uniquely position me to contribute.

83. Were you always in the data science field? If not, what made you change your career path and how did you upgrade your skills?

No, I have switched to Data Science field recently due to the ever increasing opportunities in the domain.

84. If we give you a random data set, how will you figure out whether it suits the business needs or not?

To ensure a random dataset suits business needs, first understand the business objectives and key performance indicators. Then, assess the dataset's relevance, quality, and completeness with respect to these objectives. If necessary, perform exploratory data analysis to uncover patterns or trends. Confirm that the dataset contains actionable insights that can drive business decisions.

85. Given a chance, if you could pick a career other than being a data scientist, what would you choose?

The role of a Data Engineer is a vital and rewarding profession. They are responsible for designing, building, and managing the data infrastructure. They create the architecture that enables data generation, processing, storage, and retrieval. Their work allows data scientists to perform analyses and make meaningful contributions.

86. Given the constant change in the data science field, how quickly can you adapt to new technologies?

I'm a keen learner and always ready to upskill. I think I will be able to adapt to new technologies in no time.

87. Have you ever been in a conflict with your colleagues regarding different strategies to go about a project? How were you able to resolve it?

Yes, once I remember. However, it was resolved in no time.

Resolving differences in strategies among colleagues requires:

Open Communication: Initiate a dialogue to understand each person's perspective.

Active Listening: Allow each colleague to express their views fully.

Find Common Ground: Identify shared goals or priorities that everyone agrees on.

Collaborative Decision-Making: Encourage participation in the decision-making process.

Compromise: Recognize that a perfect solution may not exist and compromise might be needed.

Feedback and Follow-up: Regularly review the strategy's progress and adjust as needed.

88. Can you break down an algorithm you have used on a recent project?

Yes, I cam do that.

89. What tools did you use in your last project and why?

Programming Languages: Python, R, SQL for data manipulation and analysis.
Libraries: Pandas, NumPy, Scikit-learn for data processing and machine learning.
Visualization Tools: Matplotlib, Seaborn, Tableau for data visualization.
Big Data Platforms: Hadoop, Spark for handling large datasets.
Machine Learning Platforms: TensorFlow, PyTorch for creating ML models.
Notebooks: Jupyter, Google Colab for prototyping and sharing work.

90. What is your most favored strategy to clean a big data set and why?

My most favored strategy is iterative cleaning, where data is cleaned in stages or chunks, rather than all at once. This approach, often combined with automation tools, is efficient and manageable for large datasets. It allows for quality checks at each stage, minimizes the risk of data loss, and enables timely error detection.

91. Do you contribute to any open source projects?

I have contributed to open source projects in several ways:

Developing algorithms and models
Improving data processing:
Visualizations and dashboard creation
Documentation and Tutorials
Bug fixing and feature requests

Are you looking forward to becoming a Data Science expert? This career guide is a perfect read to get you started in the thriving field of Data Science. Download the eBook now !

Stay Sharp With Our Data Science Interview Questions

For data scientists, the work isn't easy, but it's rewarding and there are plenty of available positions out there. These data science interview questions can help you get one step closer to your dream job. So, prepare yourself for the rigors of interviewing and stay sharp with the nuts and bolts of data science.

Simplilearn's comprehensive Post Graduate Program in Data Science , in partnership with Purdue University and in collaboration with IBM will prepare you for one of the world's most exciting technology frontiers.

Find our Post Graduate Program in Data Science Online Bootcamp in top cities:

About the author.

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

Recommended Programs

Post Graduate Program in Data Science

Caltech Post Graduate Program in Data Science

Data Scientist

*Lifetime access to high-quality, self-paced e-learning content.

Recommended Resources

Top 150 Python Interview Questions and Answers for 2024

Data Science Interview Guide

PayPal Interview Questions

Top 24 Ansible Interview Questions and Answers

180+ Core Java Interview Questions and Answers for 2024

Python Interview Guide

PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

R for Data Science (2e) - Solutions to Exercises

This is the website for the work-in-progress Solutions to Exercises for the 2nd edition of “R for Data Science” .

Programming for Data Science

Teaching data scientists the tools they need to use computers to do data science

Assignments

Programming with python assignments.

Assignment 1

Advanced Python Assignments

Assignment 2
Assignment 3
Assignment 4
Assignment 5
Assignment 6
Assignment 7
Assignment 8
Assignment 9
Assignment 10
Assignment 11
Assignment 12
Assignment 13

Practice Exams

Course Notes

Infographics

Career Guides

A selection of practice exams that will test your current data science knowledge. Identify key areas of improvement to strengthen your theoretical preparation, critical thinking, and practical problem-solving skills so you can get one step closer to realizing your professional goals.

Excel Mechanics

Imagine if you had to apply the same Excel formatting adjustment to both Sheet 1 and Sheet 2 (i.e., adjust font, adjust fill color of the sheets, add a couple of empty rows here and there) which contain thousands of rows. That would cost an unjustifiable amount of time. That is where advanced Excel skills come in handy as they optimize your data cleaning, formatting and analysis process and shortcut your way to a job well-done. Therefore, asses your Excel data manipulation skills with this free practice exam.

Formatting Excel Spreadsheets

Did you know that more than 1 in 8 people on the planet uses Excel and that Office users typically spend a third of their time in Excel. But how many of them use the popular spreadsheet tool efficiently? Find out where you stand in your Excel skills with this free practice exam where you are a first-year investment banking analyst at one of the top-tier banks in the world. The dynamic nature of your position will test your skills in quick Excel formatting and various Excel shortcuts

Hypothesis Testing

Whenever we need to verify the results of a test or experiment we turn to hypothesis testing. In this free practice exam you are a data analyst at an electric car manufacturer, selling vehicles in the US and Canada. Currently the company offers two car models – Apollo and SpeedX. You will need to download a free Excel file containing the car sales of the two models over the last 3 years in order find out interesting insights and test your skills in hypothesis testing.

Confidence Intervals

Confidence Intervals refers to the probability of a population parameter falling between a range of certain values. In this free practice exam, you lead the research team at a portfolio management company with over $50 billion dollars in total assets under management. You are asked to compare the performance of 3 funds with similar investment strategies and are given a table with the return of the three portfolios over the last 3 years. You will have to use the data to answer questions that will test your knowledge in confidence intervals.

Fundamentals of Inferential Statistics

While descriptive statistics helps us describe and summarize a dataset, inferential statistics allows us to make predictions based off data. In this free practice exam, you are a data analyst at a leading statistical research company. Much of your daily work relates to understanding data structures and processes, as well as applying analytical theory to real-world problems on large and dynamic datasets. You will be given an excel dataset and will be tested on normal distribution, standardizing a dataset, the Central Limit Theorem among other inferential statistics questions.

Fundamentals of Descriptive Statistics

Descriptive statistics helps us understand the actual characteristics of a dataset by generating summaries about data samples. The most popular types of descriptive statistics are measures of center: median, mode and mean. In this free practice exam you have been appointed as a Junior Data Analyst at a property developer company in the US, where you are asked to evaluate the renting prices in 9 key states. You will work with a free excel dataset file that contains the rental prices and houses over the last years.

Jupyter Notebook Shortcuts

In this free practice exam you are an experienced university professor in Statistics who is looking to upskill in data science and has joined the data science apartment. As on of the most popular coding environments for Python, your colleagues recommend you learn Jupyter Notebook as a beginner data scientist. Therefore, in this quick assessment exam you are going to be tested on some basic theory regarding Jupyter Notebook and some of its shortcuts which will determine how efficient you are at using the environment.

Intro to Jupyter Notebooks

Jupyter is a free, open-source interactive web-based computational notebook. As one of the most popular coding environments for Python and R, you are inevitably going to encounter Jupyter at some point in you data science journey, if you have not already. Therefore, in this free practice exam you are a professor of Applied Economics and Finance who is learning how to use Jupyter. You are going to be tested on the very basics of the Jupyter environment like how to set up the environment and some Jupyter keyboard shortcuts.

Black-Scholes-Merton Model in Python

The Black Scholes formula is one of the most popular financial instruments used in the past 40 years. Derived by Fisher, Black Myron Scholes and Robert Merton in 1973, it has become the primary tool for derivative pricing. In this free practice exam, you are a finance student whose Applied Finance is approaching and is asked to perform the Black-Scholes-Merton formula in Python by working on a dataset containing Tesla’s stock prices for the period between mid-2010 and mid-2020.

Python for Financial Analysis

In a heavily regulated industry like fintech, simplicity and efficiency is key. Which is why Python is the preferred choice for programming language over the likes of Java or C++. In this free practice exam you are a university professor of Applied Economics and Finance, who is focused on running regressions and applying the CAPM model on the NASDAQ and The Coca-Cola Company Dataset for the period between 2016 and 2020 inclusive. Make sure to have the following packages running to complete your practice test: pandas, numpy, api, scipy, and pyplot as plt.

Python Finance

Python has become the ideal programming language for the financial industry, as more and more hedge funds and large investment banks are adopting this general multi-purpose language to solve their quantitative problems. In this free practice exam on Python Finance, you are part of the IT team of a huge company, operating in the US stock market, where you are asked to analyze the performance of three market indices. The packages you need to have running are numpy, pandas and pyplot as plt.

Machine Learning with KNN

KNN is a popular supervised machine learning algorithm that is used for solving both classification and regression problems. In this free practice exam, this is exactly what you are going to be asked to do, as you are required to create 2 datasets for 2 car dealerships in Jupyter Notebook, fit the models to the training data, find the set of parameters that best classify a car, construct a confusion matrix and more.

Excel Functions

The majority of data comes in spreadsheet format, making Excel the #1 tool of choice for professional data analysts. The ability to work effectively and efficiently in Excel is highly desirable for any data practitioner who is looking to bring value to a company. As a matter of fact, being proficient in Excel has become the new standard, as 82% of middle-skill jobs require competent use of the productivity software. Take this free Excel Functions practice exam and test your knowledge on removing duplicate values, transferring data from one sheet to another, rand using the VLOOKUP and SUMIF function.

Useful Tools in Excel

What Excel lacks in data visualization tools compared to Tableau, or computational power for analyzing big data compared to Python, it compensates with accessibility and flexibility. Excel allows you to quickly organize, visualize and perform mathematical functions on a set of data, without the need for any programming or statistical skills. Therefore, it is in your best interest to learn how to use the various Excel tools at your disposal. This practice exam is a good opportunity to test your excel knowledge in the text to column functions, excel macros, row manipulation and basic math formulas.

Excel Basics

Ever since its first release in 1985, Excel continues to be the most popular spreadsheet application to this day- with approximately 750 million users worldwide, thanks to its flexibility and ease of use. No matter if you are a data scientist or not, knowing how to use Excel will greatly improve and optimize your workflow. Therefore, in this free Excel Basics practice exam you are going to work with a dataset of a company in the Fast Moving Consumer Goods Sector as an aspiring data analyst and test your knowledge on basic Excel functions and shortcuts.

A/B Testing for Social Media

In this free A/B Testing for Social Media practice exam, you are an experienced data analyst who works at a new social media company called FilmIt. You are tasked with the job of increasing user engagement by applying the correct modifications to how users move on to the next video. You decide that the best approach is by conducting a A/B test in a controlled environment. Therefore, in order to successfully complete this task, you are going to be tested on statistical significance, 2 tailed-tests and choosing the success metrics.

Fundamentals of A/B Testing

A/B Testing is a powerful statistical tool used to compare the results between two versions of the same marketing asset such as a webpage or email in a controlled environment. An example of A/B testing is when Electronic Arts created a variation version of the sales page for the popular SimCity 5 simulation game, which performed 40% better than the control page. Speaking about video games, in this free practice test, you are a data analyst who is tasked with the job to conduct A/B testing for a game developer. You are going to be asked to choose the best way to perform an A/B test, identify the null hypothesis, choose the right evaluation metrics, and ultimately increase revenue through in-game ads.

Grey Cover of Intro to Machine Learning. The practice exam resource is from 365 Data Science.

Introduction to Data Science Disciplines

The term “Data Science” dates back to the 1960s, to describe the emerging field of working with large amounts of data that drives organizational growth and decision-making. While the essence has remained the same, the data science disciplines have changed a lot over the past decades thanks to rapid technological advancements. In this free introduction to data science practice exam, you will test your understanding of the modern day data science disciplines and their role within an organization.

Advanced SQL

In this free Advanced SQL practice exam you are a sophomore Business student who has decided to focus on improving your coding and analytical skills in the areas of relational database management systems. You are given an employee dataset containing information like titles, salaries, birth dates and department names, and are required to come up with the correct answers. This free SQL practice test will evaluate your knowledge on MySQL aggregate functions , DML statements (INSERT, UPDATE) and other advanced SQL queries.

Most Popular Practice Exams

Check out our most helpful downloadable resources according to 365 Data Science students and our expert team of instructors.

Join 2M+ Students and Start Learning

Learn from the best, develop an invaluable skillset, and secure a job in data science.

Data Science
Data Analysis
Data Visualization
Machine Learning
Deep Learning
Computer Vision
Artificial Intelligence
AI ML DS Interview Series
AI ML DS Projects series
Data Engineering
Web Scrapping

Data Science Interview Questions and Answers

What is data science, basic data science interview questions for fresher, q.1 what is marginal probability, q.2 what are the probability axioms, q.3 what is conditional probability.

Q.4 What is Bayes Theorem and when is it used in data science?

Q.5 Define variance and conditional variance.

Q.6 explain the concepts of mean, median, mode, and standard deviation., q.7 what is the normal distribution and standard normal distribution.

Q.8 What is SQL, and what does it stand for?

Q.9 Explain the differences between SQL and NoSQL databases.

Q.10 what are the primary sql database management systems (dbms), q.11 what is the er model in sql, q.12 what is data transformation, q.13 what are the main components of a sql query, q.14 what is a primary key, q.15 what is the purpose of the group by clause, and how is it used, q.16 what is the where clause used for, and how is it used to filter data, q.17 how do you retrieve distinct values from a column in sql, q.18 what is the having clause, q.19 how do you handle missing or null values in a database table, q.20 what is the difference between supervised and unsupervised machine learning, q.21 what is linear regression, and what are the different assumptions of linear regression algorithms, q.22 logistic regression is a classification technique, why its name is regressions, not logistic classifications, q.23 what is the logistic function (sigmoid function) in logistic regression, q.24 what is overfitting and how can be overcome this, q.25 what is a support vector machine (svm), and what are its key components, q.26 explain the k-nearest neighbors (knn) algorithm..

Q.27 What is the Naive Bayes algorithm, what are the different assumptions of Naive Bayes?

Q.28 What are decision trees, and how do they work?

Q.29 explain the concepts of entropy and information gain in decision trees., q.30 what is the difference between the bagging and boosting model, q.31 describe random forests and their advantages over single-decision trees., q.32 what is k-means, and how will it work, q.33 what is a confusion matrix explain with an example., q.34 what is a classification report and explain the parameters used to interpret the result of classification tasks with an example., intermediate data science interview questions, q.35 explain the uniform distribution., q.36 describe the bernoulli distribution., q.37 what is the binomial distribution.

Q.38 Explain the exponential distribution and where it's commonly used.

Q.39 Describe the Poisson distribution and its characteristics.

Q40. explain the t-distribution and its relationship with the normal distribution., q.41 describe the chi-squared distribution., q.42 what is the difference between z-test, f-test, and t-test, q.43 what is the central limit theorem, and why is it significant in statistics, q.44 describe the process of hypothesis testing, including null and alternative hypotheses., q.45 how do you calculate a confidence interval, and what does it represent, q.46 what is a p-value in statistics, q.47 explain type i and type ii errors in hypothesis testing., q.48 what is the significance level (alpha) in hypothesis testing, q.49 how can you calculate the correlation coefficient between two variables, q.50 what is covariance, and how is it related to correlation, q.51 explain how to perform a hypothesis test for comparing two population means., q.52 explain the concept of normalization in database design..

Q.53 What is database normalization?

Q.54 Define different types of SQL functions.

Q.55 explain the difference between inner join and left join., q.56 what is a subquery, and how can it be used in sql, q.57 how do you perform mathematical calculations in sql queries, q.58 what is the purpose of the case statement in sql, q.59 what is the difference between a database and a data warehouse, q.60 what is regularization in machine learning, state the differences between l1 and l2 regularization, q.61 explain the concepts of bias-variance trade-off in machine learning., q.62 how do we choose the appropriate kernel function in svm.

Q.63 How does Naive Bayes handle categorical and continuous features?
Q.64 What is Laplace smoothing (add-one smoothing) and why is it used in Naive Bayes?

Q.65 What are imbalanced datasets and how can we handle them?

Q.66 what are outliers in the dataset and how can we detect and remove them, q.67 what is the curse of dimensionality and how can we overcome this, q.68 how does the random forest algorithm handle feature selection, q.69 what is feature engineering explain the different feature engineering methods..

Q.70 How will we deal with the categorical text values in machine learning?
Q.71 What is DBSCAN and How will we use it?

Q.72 How does the EM (Expectation-Maximization) algorithm work in clustering?

Q.73 explain the concept of silhouette score in clustering evaluation., q.74 what is the relationship between eigenvalues and eigenvectors in pca, q.75 what is the cross-validation technique in machine learning, q.76 what are the roc and auc, explain its significance in binary classification..

Q.77 Describe gradient descent and its role in optimizing machine learning models

Q.78 Describe batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

Q.79 Explain the Apriori - Association Rule Mining

Data Science Interview Questions for Experienced

Q.80 explain multivariate distribution in data science., q.81 describe the concept of conditional probability density function (pdf)., q.82 what is the cumulative distribution function (cdf), and how is it related to pdf, q.83 what is anova what are the different ways to perform anova tests, q.84 how can you prevent gradient descent from getting stuck in local minima, q.85 explain the gradient boosting algorithms in machine learning..

Q.86 Explain convolutions operations of CNN architecture?
Q.87 What is feed forward network and how it is different from recurrent neural network?

Q.88 Explain the difference between generative and discriminative models?

Q.89 What is the forward and backward propagations in deep learning?

Q.90 Describe the use of Markov models in sequential data analysis?

Q.91 what is generative ai.

Q.92 What are different neural network architectures used to generate artificial data in deep learning?

Q.93 What is deep reinforcement learning technique?

Q.94 what is transfer learning, and how is it applied in deep learning.

Q.95 What is the difference between object detection and image segmentation.

Q.96 Explain the concept of word embeddings in natural language processing (NLP).

Q.97 what is seq2seq model.

Q.98 What is artificial neural networks.

Q.99 What is marginal probability?

Q.100 what are the probability axioms.

Data Science Interview Questions – Explore the Data Science Interview Questions and Answers for beginners and experienced professionals looking for new opportunities in data science.

Top-100-Data-Science-Interview-Questions-and-Answers

We all know that data science is a field where data scientists mine raw data, analyze it, and extract useful insights from it. The article outlines the frequently asked questionas during the data science interview. Practising all the below questions will help you to explore your career as a data scientist .

Table of Content

Data science is a field that extracts knowledge and insights from structured and unstructured data by using scientific methods, algorithms, processes, and systems. It combines expertise from various domains, such as statistics, computer science, machine learning, data engineering, and domain-specific knowledge, to analyze and interpret complex data sets.

Furthermore, data scientists use a combination of multiple languages, such as Python and R . They are also frequent users of data analysis tools like pandas, NumPy, and scikit-learn, as well as machine learning libraries.

After exploring the brief of data science, let’s dig into the data science interview questions and answers.

A key idea in statistics and probability theory is marginal probability, which is also known as marginal distribution. With reference to a certain variable of interest, it is the likelihood that an event will occur, without taking into account the results of other variables. Basically, it treats the other variables as if they were “marginal” or irrelevant and concentrates on one.

Marginal probabilities are essential in many statistical analyses, including estimating anticipated values, computing conditional probabilities, and drawing conclusions about certain variables of interest while taking other variables’ influences into account.

The fundamental rules that control the behaviour and characteristics of probabilities in probability theory and statistics are referred to as the probability axioms, sometimes known as the probability laws or probability principles.

There are three fundamental axioms of probability:

Non-Negativity Axiom
Normalization Axiom
Additivity Axiom

The event or outcome occurring based on the existence of a prior event or outcome is known as conditional probability. It is determined by multiplying the probability of the earlier occurrence by the increased lprobability of the later, or conditional, event.

Q.4 What is Bayes’ Theorem and when is it used in data science?

The Bayes theorem predicts the probability that an event connected to any condition would occur. It is also taken into account in the situation of conditional probability. The probability of “causes” formula is another name for the Bayes theorem.

In data science, Bayes’ Theorem is used primarily in:

Bayesian Inference
Text Classification
Medical Diagnosis
Predictive Modeling

When working with ambiguous or sparse data, Bayes’ Theorem is very helpful since it enables data scientists to continually revise their assumptions and come to more sensible conclusions.

A statistical concept known as variance quantifies the spread or dispersion of a group of data points within a dataset. It sheds light on how widely individual data points depart from the dataset’s mean (average). It assesses the variability or “scatter” of data.

Conditional Variance

A measure of the dispersion or variability of a random variable under certain circumstances or in the presence of a particular event, as the name implies. It reflects a random variable’s variance that is dependent on the knowledge of another random variable’s variance.

Mean: The mean, often referred to as the average, is calculated by summing up all the values in a dataset and then dividing by the total number of values.

Median: When data are sorted in either ascending or descending order, the median is the value in the middle of the dataset. The median is the average of the two middle values when the number of data points is even. In comparison to the mean, the median is less impacted by extreme numbers, making it a more reliable indicator of central tendency.

Mode: The value that appears most frequently in a dataset is the mode. One mode (unimodal), several modes (multimodal), or no mode (if all values occur with the same frequency) can all exist in a dataset.

Standard deviation : The spread or dispersion of data points in a dataset is measured by the standard deviation. It quantifies the variance between different data points.

The normal distribution, also known as the Gaussian distribution or bell curve, is a continuous probability distribution that is characterized by its symmetric bell-shaped curve. The normal distribution is defined by two parameters: the mean ( μ ) and the standard deviation ( σ ). The mean determines the center of the distribution, and the standard deviation determines the spread or dispersion of the distribution. The distribution is symmetric around its mean, and the bell curve is centered at the mean. The probabilities for values that are further from the mean taper off equally in both directions. Similar rarity applies to extreme values in the two tails of the distribution. Not all symmetrical distributions are normal, even though the normal distribution is symmetrical.

The standard normal distribution, also known as the Z distribution, is a special case of the normal distribution where the mean ( μ ) is 0 and the standard deviation ( σ ) is 1. It is a standardized form of the normal distribution, allowing for easy comparison of scores or observations from different normal distributions.

Q.8 What is SQL, and what does it stand for?

SQL stands for Structured Query Language.It is a specialized programming language used for managing and manipulating relational databases. It is designed for tasks related to database management, data retrieval, data manipulation, and data definition.

Both SQL (Structured Query Language) and NoSQL (Not Only SQL) databases, differ in their data structures, schema, query languages, and use cases. The following are the main variations between SQL and NoSQL databases.

Relational database systems, both open source and commercial, are the main SQL (Structured Query Language) database management systems (DBMS), which are widely used for managing and processing structured data. Some of the most popular SQL database management systems are listed below:

Microsoft SQL Server
Oracle Database

The structure and relationships between the data entities in a database are represented by the Entity-Relationship (ER) model, a conceptual framework used in database architecture. The ER model is frequently used in conjunction with SQL for creating the structure of relational databases even though it is not a component of the SQL language itself.

The process of transforming data from one structure, format, or representation into another is referred to as data transformation. In order to make the data more suited for a given goal, such as analysis, visualisation, reporting, or storage, this procedure may involve a variety of actions and changes to the data. Data integration, cleansing, and analysis depend heavily on data transformation, which is a common stage in data preparation and processing pipelines.

A relational database’s data can be retrieved, modified, or managed via a SQL (Structured Query Language) query. The operation of a SQL query is defined by a number of essential components, each of which serves a different function.

A relational database table’s main key, also known as a primary keyword, is a column that is unique for each record. It is a distinctive identifier.The primary key of a relational database must be unique. Every row of data must have a primary key value and none of the rows can be null.

In SQL, the GROUP BY clause is used to create summary rows out of rows that have the same values in a set of specified columns. In order to do computations on groups of rows as opposed to individual rows, it is frequently used in conjunction with aggregate functions like SUM, COUNT, AVG, MAX, or MIN. we may produce summary reports and perform more in-depth data analysis using the GROUP BY clause.

In SQL, the WHERE clause is used to filter rows from a table or result set according to predetermined criteria. It enables us to pick only the rows that satisfy particular requirements or follow a pattern. A key element of SQL queries, the WHERE clause is frequently used for data retrieval and manipulation.

Using the DISTINCT keyword in combination with the SELECT command, we can extract distinct values from a column in SQL. By filtering out duplicate values and returning only unique values from the specified column, the DISTINCT keyword is used.

To filter query results depending on the output of aggregation functions, the HAVING clause, a SQL clause, is used along with the GROUP BY clause. The HAVING clause filters groups of rows after they have been grouped by one or more columns, in contrast to the WHERE clause, which filters rows before they are grouped.

Missing or NULL values can arise due to various reasons, such as incomplete data entry, optional fields, or data extraction processes.

Replace NULL with Placeholder Values
Handle NULL Values in Queries
Use Default Values

The difference between Supervised Learning and Unsupervised Learning are as follow:

Linear Regression – It is type of Supervised Learning where we compute a linear relationship between the predictor and response variable. It is based on the linear equation concept given by:

$\hat{y} = \beta_1x+\beta_o$

There are 4 assumptions we make about a Linear regression problem:

Linear relationship : This assumes that there is a linear relationship between predictor and response variable. This means that, which changing values of predictor variable, the response variable changes linearly (either increases or decreases).
Normality : This assumes that the dataset is normally distributed, i.e., the data is symmetric about the mean of the dataset.
Independence : The features are independent of each other, there is no correlation among the features/predictor variables of the dataset.
Homoscedasticity : This assumes that the dataset has equal variance for all the predictor variables. This means that the amount of independent variables have no effect on the variance of data.

While logistic regression is used for classification, it still maintains a regression structure underneath. The key idea is to model the probability of an event occurring (e.g., class 1 in binary classification) using a linear combination of features, and then apply a logistic (Sigmoid) function to transform this linear combination into a probability between 0 and 1. This transformation is what makes it suitable for classification tasks.

In essence, while logistic regression is indeed used for classification, it retains the mathematical and structural characteristics of a regression model, hence the name.

Sigmoid Function: It is a mathematical function which is characterized by its S- shape curve. Sigmoid functions have the tendency to squash a data point to lie within 0 and 1. This is why it is also called Squashing function, which is given as:

Some of the properties of Sigmoid function is:

Range: [0,1]

Overfitting refers to the result of analysis of a dataset which fits so closely with training data that it fails to generalize with unseen/future data. This happens when the model is trained with noisy data which causes it to learn the noisy features from the training as well.

To avoid Overfitting and overcome this problem in machine learning, one can follow the following rules:

Feature selection : Sometimes the training data has too many features which might not be necessary for our problem statement. In that case, we use only the necessary features that serve our purpose
Cross Validation : This technique is a very powerful method to overcome overfitting. In this, the training dataset is divided into a set of mini training batches, which are used to tune the model.
Regularization : Regularization is the technique to supplement the loss with a penalty term so as to reduce overfitting. This penalty term regulates the overall loss function, thus creating a well trained model.
Ensemble models : These models learn the features and combine the results from different training models into a single prediction.

Support Vector machines are a type of Supervised algorithm which can be used for both Regression and Classification problems. In SVMs, the main goal is to find a hyperplane which will be used to segregate different data points into classes. Any new data point will be classified based on this defined hyperplane.

Support Vector machines are highly effective when dealing with high dimensionality space and can handle non linear data very well. But if the number of features are greater than number of data samples, it is susceptible to overfitting.

The key components of SVM are:

Kernels Function : It is a mapping function used for data points to convert it into high dimensionality feature space.
Hyperplane : It is the decision boundary which is used to differentiate between the classes of data points.
Margin : It is the distance between Support Vector and Hyperplane
C: It is a regularization parameter which is used for margin maximization and misclassification minimization.

The k-Nearest Neighbors (KNN) algorithm is a simple and versatile supervised machine learning algorithm used for both classification and regression tasks. KNN makes predictions by memorizing the data points rather than building a model about it. This is why it is also called “ lazy learner ” or “ memory based ” model too.

KNN relies on the principle that similar data points tend to belong to the same class or have similar target values. This means that, In the training phase, KNN stores the entire dataset consisting of feature vectors and their corresponding class labels (for classification) or target values (for regression). It then calculates the distances between that point and all the points in the training dataset. (commonly used distance metrics are Euclidean distance and Manhattan distance).

(Note : Choosing an appropriate value for k is crucial. A small k may result in noisy predictions, while a large k can smooth out the decision boundaries. The choice of distance metric and feature scaling also impact KNN’s performance.)

Q.27 What is the Naïve Bayes algorithm, what are the different assumptions of Naïve Bayes?

The Naïve Bayes algorithm is a probabilistic classification algorithm based on Bayes’ theorem with a “naïve” assumption of feature independence within each class. It is commonly used for both binary and multi-class classification tasks, particularly in situations where simplicity, speed, and efficiency are essential.

The main assumptions that Naïve Bayes theorem makes are:

Feature independence – It assumes that the features involved in Naïve Bayes algorithm are conditionally independent, i.e., the presence/ absence of one feature does not affect any other feature
Equality – This assumes that the features are equal in terms of importance (or weight).
Normality – It assumes that the feature distribution is Normal in nature, i.e., the data is distributed equally around its mean.

Decision trees are a popular machine learning algorithm used for both classification and regression tasks. They work by creating a tree-like structure of decisions based on input features to make predictions or decisions. Lets dive into its core concepts and how they work briefly:

Decision trees consist of nodes and edges.
The tree starts with a root node and branches into internal nodes that represent features or attributes.
These nodes contain decision rules that split the data into subsets.
Edges connect nodes and indicate the possible decisions or outcomes.
Leaf nodes represent the final predictions or decisions.

The objective is to increase data homogeneity, which is often measured using standards like mean squared error (for regression) or Gini impurity (for classification). Decision trees can handle a variety of attributes and can effectively capture complex data relationships. They can, however, overfit, especially when deep or complex. To reduce overfitting, strategies like pruning and restricting tree depth are applied.

Entropy : Entropy is the measure of randomness. In terms of Machine learning, Entropy can be defined as the measure of randomness or impurity in our dataset. It is given as:

Information gain: It is defined as the change in the entropy of a feature given that there’s an additional information about that feature. If there are more than one features involved in Decision tree split, then the weighted average of entropies of the additional features is taken.

E = Entropy

Random Forests are an ensemble learning technique that combines multiple decision trees to improve predictive accuracy and reduce overfitting. The advantages it has over single decision trees are:

Improved Generalization : Single decision trees are prone to overfitting, especially when they become deep and complex. Random Forests mitigate this issue by averaging predictions from multiple trees, resulting in a more generalized model that performs better on unseen data
Better Handling of High-Dimensional Data : Random Forests are effective at handling datasets with a large number of features. They select a random subset of features for each tree, which can improve the performance when there are many irrelevant or noisy features
Robustness to Outliers: Random Forests are more robust to outliers because they combine predictions from multiple trees, which can better handle extreme cases

K-Means is an unsupervised machine learning algorithm used for clustering or grouping similar data points together. It aims to partition a dataset into K clusters, where each cluster represents a group of data points that are close to each other in terms of some similarity measure. The working of K-means is as follow:

Choose the number of clusters K
For each data point in the dataset, calculate its distance to each of the K centroids and then assign each data point to the cluster whose centroid is closest to it
Recalculate the centroids of the K clusters based on the current assignment of data points.
Repeat the above steps until a group of clusters are formed.

Confusion matrix is a table used to evaluate the performance of a classification model by presenting a comprehensive view of the model’s predictions compared to the actual class labels. It provides valuable information for assessing the model’s accuracy, precision, recall, and other performance metrics in a binary or multi-class classification problem.

A famous example demonstration would be Cancer Confusion matrix:

TP (True Positive) = The number of instances correctly predicted as the positive class
TN (True Negative) = The number of instances correctly predicted as the negative class
FP (False Positive) = The number of instances incorrectly predicted as the positive class
FN (False Negative) = The number of instances incorrectly predicted as the negative class

A classification report is a summary of the performance of a classification model, providing various metrics that help assess the quality of the model’s predictions on a classification task.

The parameters used in a classification report typically include:

Precision : Precision is the ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions made by the model.
Recall (Sensitivity or True Positive Rate) : Recall is the ratio of true positive predictions to the total actual positives. It measures the model’s ability to identify all positive instances correctly.
Accuracy : Accuracy is the ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances. It measures the overall correctness of the model’s predictions.
F1-Score : The F1-Score is the harmonic mean of precision and recall. It provides a balanced measure of both precision and recall and is particularly useful when dealing with imbalanced datasets.
TP = True Positive
TN = True Negative
FP = False Positive
FN = False Negative

A fundamental probability distribution in statistics is the uniform distribution, commonly referred to as the rectangle distribution. A constant probability density function (PDF) across a limited range characterises it. In simpler terms, in a uniform distribution, every value within a specified range has an equal chance of occurring.

A discrete probability distribution, the Bernoulli distribution is focused on discrete random variables. The number of heads you obtain while tossing three coins at once or the number of pupils in a class are examples of discrete random variables that have a finite or countable number of potential values.

The binomial distribution is a discrete probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials, where each trial has only two possible outcomes: success or failure. The outcomes are often referred to as “success” and “failure,” but they can represent any dichotomous outcome, such as heads or tails, yes or no, or defective or non-defective.

The fundamental presumptions of a binomial distribution are that each trial has exactly one possible outcome, each trial has an equal chance of success, and each trial is either independent of the others or mutually exclusive.

Q.38 Explain the exponential distribution and where it’s commonly used.

The probability distribution of the amount of time between events in the Poisson point process is known as the exponential distribution. The gamma distribution is thought of as a particular instance of the exponential distribution. Additionally, the geometric distribution’s continuous analogue is the exponential distribution.

Common applications of the exponential distribution include:

Reliability Engineering
Queueing Theory
Telecommunications
Natural Phenomena
Survival Analysis

The Poisson distribution is a probability distribution that describes the number of events that occur within a fixed interval of time or space when the events happen at a constant mean rate and are independent of the time since the last event.

Key characteristics of the Poisson distribution include:

Discreteness: The Poisson distribution is used to model the number of discrete events that occur within a fixed interval.
Constant Mean Rate: The events occur at a constant mean rate per unit of time or space.
Independence: The occurrences of events are assumed to be independent of each other. The probability of multiple events occurring in a given interval is calculated based on the assumption of independence.

The t-distribution, also known as the Student’s t-distribution, is used in statistics for inferences about population means when the sample size is small and the population standard deviation is unknown. The shape of the t-distribution is similar to the normal distribution, but it has heavier tails.

Relationship between T-Distribution and Normal Distribution: The t-distribution converges to the normal distribution as the degrees of freedom increase. In fact, when the degrees of freedom become very large, the t-distribution approaches the standard normal distribution (normal distribution with mean 0 and standard deviation 1). This is a result of the Central Limit Theorem.

The chi-squared distribution is a continuous probability distribution that arises in statistics and probability theory. It is commonly denoted as χ 2 (chi-squared) and is associated with degrees of freedom. The chi-squared distribution is particularly used to model the distribution of the sum of squared independent standard normal random variables.It is also used to determine if data series are independent, the goodness of fit of a data distribution, and the level of confidence in the variance and standard deviation of a random variable with a normal distribution.

The z-test, t-test, and F-test are all statistical hypothesis tests used in different situations and for different purposes. Here’s a overview of each test and the key differences between them.

In summary, the choice between a z-test, t-test, or F-test depends on the specific research question and the characteristics of the data.

The Central Limit Theorem states that, regardless of the shape of the population distribution, the distribution of the sample means approaches a normal distribution as the sample size increases.This is true even if the population distribution is not normal. The larger the sample size, the closer the sampling distribution of the sample mean will be to a normal distribution.

Hypothesis testing is a statistical method used to make inferences about population parameters based on sample data.It is a systematic way of evaluating statements or hypotheses about a population using observed sample data.To identify which statement is best supported by the sample data, it compares two statements about a population that are mutually exclusive.

Null hypothesis(H0): The null hypothesis (H0) in statistics is the default assumption or assertion that there is no association between any two measured cases or any two groups. In other words, it is a fundamental assumption or one that is founded on knowledge of the problem.
Alternative hypothesis(H1) : The alternative hypothesis, or H1, is the null-hypothesis-rejecting hypothesis that is utilised in hypothesis testing.

A confidence interval (CI) is a statistical range or interval estimate for a population parameter, such as the population mean or population proportion, based on sample data. to calculate confidence interval these are the following steps.

Collect Sample Data
Choose a Confidence Level
Select the Appropriate Statistical Method
Calculate the Margin of Error (MOE)
Calculate the Confidence Interval
Interpret the Confidence Interval

Confidence interval represents a range of values within which we believe, with a specified level of confidence (e.g., 95%), that the true population parameter lies.

The term “p-value,” which stands for “probability value,” is a key one in statistics and hypothesis testing. It measures the evidence contradicting a null hypothesis and aids in determining whether a statistical test’s findings are statistically significant. Here is a definition of a p-value and how it is used in hypothesis testing.

Rejecting a null hypothesis that is actually true in the population results in a type I error (false-positive); failing to reject a null hypothesis that is actually untrue in the population results in a type II error (false-negative).

type I and type II mistakes cannot be completely avoided, the investigator can lessen their risk by increasing the sample size (the less likely it is that the sample will significantly differ from the population).

A crucial metric in hypothesis testing that establishes the bar for judging whether the outcomes of a statistical test are statistically significant is the significance level, which is sometimes indicated as (alpha). It reflects the greatest possible chance of committing a Type I error, or mistakenly rejecting a valid null hypothesis.

The significance level in hypothesis testing.

Setting the Significance Level
Interpreting the Significance Level
Hypothesis Testing Using Significance Level
Choice of Significance Level

The degree and direction of the linear link between two variables are quantified by the correlation coefficient. The Pearson correlation coefficient is the most widely used method for determining the correlation coefficient. The Pearson correlation coefficient can be calculated as follows.

Collect Data
Calculate the Means
Calculate the Covariance
Calculate the Standard Deviations
Calculate the Pearson Correlation Coefficient (r)
Interpret the Correlation Coefficient.

Both covariance and correlation are statistical metrics that show how two variables are related to one another.However, they serve slightly different purposes and have different interpretations.

Covariance :Covariance measures the degree to which two variables change together. It expresses how much the values of one variable tend to rise or fall in relation to changes in the other variable.
Correlation : A standardised method for measuring the strength and direction of a linear relationship between two variables is correlation. It multiplies the standard deviations of the two variables to scale the covariance.

When comparing two population means, a hypothesis test is used to determine whether there is sufficient statistical support to claim that the means of the two distinct populations differ significantly. Tests we can commonly use for include “paired t-test” or “two -sample t test” . The general procedures for carrying out such a test are as follows.

Formulate Hypotheses
Choose the Significance Level
Define Test Statistic
Draw a Conclusion
Final Results

By minimising data duplication and enhancing data integrity, normalisation is a method in database architecture that aids in the effective organisation of data. It include dividing a big, complicated table into smaller, associated tables while making sure that connections between data elements are preserved. The basic objective of normalisation is to reduce data anomalies, which can happen when data is stored in an unorganised way and include insertion, update, and deletion anomalies.

Q.53 What is database normalization?

Database denormalization is the process of intentionally introducing redundancy into a relational database by merging tables or incorporating redundant data to enhance query performance. Unlike normalization, which minimizes data redundancy for consistency, denormalization prioritizes query speed. By reducing the number of joins required, denormalization can improve read performance for complex queries. However, it may lead to data inconsistencies and increased maintenance complexity. Denormalization is often employed in scenarios where read-intensive operations outweigh the importance of maintaining a fully normalized database structure. Careful consideration and trade-offs are essential to strike a balance between performance and data integrity.

SQL functions can be categorized into several types based on their functionality.

Scalar Functions
Aggregate Functions
Window Functions
Table-Valued Functions
System Functions
User-Defined Functions
Conversion Functions
Conditional Functions

INNER JOIN and LEFT JOIN are two types of SQL JOIN operations used to combine data from multiple tables in a relational database. Here are the some main differences between them.

A subquery is a query that is nested within another SQL query, also referred to as an inner query or nested query. On the basis of the outcomes of another query, we can use it to get data from one or more tables. SQL’s subqueries capability is employed for a variety of tasks, including data retrieval, computations, and filtering.

In SQL, we can perform mathematical calculations in queries using arithmetic operators and functions. Here are some common methods for performing mathematical calculations.

Arithmetic Operators
Mathematical Functions
Custom Expressions

The SQL CASE statement is a flexible conditional expression that may be used to implement conditional logic inside of a query. we can specify various actions or values based on predetermined criteria.

Database: Consistency and real-time data processing are prioritised, and they are optimised for storing, retrieving, and managing structured data. Databases are frequently used for administrative functions like order processing, inventory control, and customer interactions.

Data Warehouse: Data warehouses are made for processing analytical data. They are designed to facilitate sophisticated querying and reporting by storing and processing massive amounts of historical data from various sources. Business intelligence, data analysis, and decision-making all employ data warehouses.

Regularization : Regularization is the technique to restrict the model overfitting during training by inducing a penalty to the loss. The penalty imposed on the loss function is added so that the complexity of the model can be controlled, thus overcoming the issue of overfitting in the model.

The following are the differences between L1 and L2 regularization:

When creating predictive models, the bias-variance trade-off is a key concept in machine learning that deals with finding the right balance between two sources of error, bias and variance. It plays a crucial role in model selection and understanding the generalization performance of a machine learning algorithm. Here’s an explanation of these concepts:

Bias :Bias is simply described as the model’s inability to forecast the real value due of some difference or inaccuracy. These differences between actual or expected values and the predicted values are known as error or bias error or error due to bias.
Variance : Variance is a measure of data dispersion from its mean location. In machine learning, variance is the amount by which a predictive model’s performance differs when trained on different subsets of the training data. More specifically, variance is the model’s variability in terms of how sensitive it is to another subset of the training dataset, i.e. how much it can adapt on the new subset of the training dataset.

As a Data Science Professional, Our focus should be to achieve the the best fit model i.e Low Bias and Low Variance. A model with low bias and low variance suggests that it can capture the underlying patterns in the data (low bias) and is not overly sensitive to changes in the training data (low variance). This is the perfect circumstance for a machine learning model, since it can generalize effectively to new, previously unknown data and deliver consistent and accurate predictions. However, in practice, this is not achievable.

If the algorithm is too simplified (hypothesis with linear equation), it may be subject to high bias and low variance, making it error-prone. If algorithms fit too complicated a hypothesis (hypothesis with a high degree equation), it may have a large variance and a low bias. In the latter case, the new entries will underperform. There is, however, something in between these two situations called as a Trade-off or Bias Variance Trade-off . So, that An algorithm can’t be more complex and less complex at the same time.

A kernel function is responsible for converting the original data points into a high dimensionality feature space. Choosing the appropriate kernel function in a Support Vector Machine is a crucial step, as it determines how well the SVM can capture the underlying patterns in your data. Below mentioned are some of the ways to choose the suitable kernel function:

If the dataset exhibits linear relationship

In this case, we should use Linear Kernel function. It is simple, computationally efficient and less prone to overfitting. For example, text classification, sentiment analysis, etc.

If the dataset requires probabilistic approach

The sigmoid kernel is suitable when the data resembles a sigmoid function or when you have prior knowledge suggesting this shape. For example, Risk assessment, Financial applications, etc.

If the dataset is Simple Non Linear in nature

In this case, use a Polynomial Kernel Function. Polynomial functions are useful when we are trying to capture moderate level of non linearity. For example, Image and Speech Recognition, etc.

If the dataset is Highly Non-Linear in Nature/ we do not know about the underlying relationship

In that case, a Radial basis function is the best choice. RBF kernel can handle highly complex dataset and is useful when you’re unsure about the data’s underlying distribution. For example, Financial forecasting, bioinformatics, etc.

Q.63 How does Naïve Bayes handle categorical and continuous features?

Naive Bayes is probabilistic approach which assumes that the features are independent of each other. It calculates probabilities associated with each class label based on the observed frequencies of feature values within each class in the training data. This is done by finding the conditional probability of Feature given a class. (i.e., P(feature | class)). To make predictions on categorical data, Naive Bayes calculates the posterior probability of each class given the observed feature values and selects the class with the highest probability as the predicted class label. This is called as “maximum likelihood” estimation.

Q.64 What is Laplace smoothing (add-one smoothing) and why is it used in Naïve Bayes?

In Naïve Bayes, the conditional probability of an event given a class label is determined as P(event| class). When using this in a classification problem (let’s say a text classification), there could a word which did not appear in the particular class. In those cases, the probability of feature given a class label will be zero. This could create a big problem when getting predictions out of the training data.

To overcome this problem, we use Laplace smoothing. Laplace smoothing addresses the zero probability problem by adding a small constant (usually 1) to the count of each feature in each class and to the total count of features in each class. Without smoothing, if any feature is missing in a class, the probability of that class given the features becomes zero, making the classifier overly confident and potentially leading to incorrect classifications

Imbalanced datasets are datasets in which the distribution of class labels (or target values) is heavily skewed, meaning that one class has significantly more instances than any other class. Imbalanced datasets pose challenges because models trained on such data can have a bias toward the majority class, leading to poor performance on the minority class, which is often of greater interest. This will lead to the model not generalizing well on the unseen data.

To handle imbalanced datasets, we can approach the following methods:

Up-sampling : In this case, we can increase the classes for minority by either sampling without replacement or generating synthetic examples. Some of the popular examples are SMOTE (Synthetic Minority Over-sampling Technique), etc.
Down-sampling : Another case would be to randomly cut down the majority class such that it is comparable to minority class.
Bagging : Techniques like Random Forests, which can mitigate the impact of class imbalance by constructing multiple decision trees from bootstrapped samples
Boosting : Algorithms like AdaBoost and XGBoost can give more importance to misclassified minority class examples in each iteration, improving their representation in the final model

An Outlier is a data point that is significantly different from other data points. Usually, Outliers are present in the extremes of the distribution and stand out as compared to their out data point counterparts.

For detecting Outliers we can use the following approaches:

Visual inspection: This is the easiest way which involves plotting the data points into scatter plot/box plot, etc.
statistics : By using measure of central tendency, we can determine if a data point falls significantly far from its mean, median, etc. making it a potential outlier.
Z-score: if a data point has very high Z-score, it can be identified as Outlier

For removing the outliers, we can use the following:

Removal of outliers manually
Doing transformations like applying logarithmic transformation or square rooting the outlier
Performing imputations wherein the outliers are replaced with different values like mean, median, mode, etc.

When dealing with a dataset that has high dimensionality (high number of features), we are often encountered with various issues and problems. Some of the issues faced while dealing with dimensionality dataset are listed below:

Computational expense : The biggest problem with handling a dataset with vast number of features is that it takes a long time to process and train the model on it. This can lead to wastage of both time and monetary resources.
Data sparsity : Many times data points are far from each other (high sparsity). This makes it harder to find the underlying patterns between features and can be a hinderance in proper analysis
Visualising issues and overfitting : It is rather easy to visualize 2d and 3d data. But beyond this order, it is difficult to properly visualize our data. Furthermore, more data features can be correlated and provide misleading information to the model training and cause overfitting.

These issues are what are generally termed as “Curse of Dimensionality”.

To overcome this, we can follow different approaches – some of which are mentioned below:

Feature Selection : Many a times, not all the features are necessary. It is the user’s job to select out the features that would be necessary in solving a given problem statement.
Feature engineering : Sometimes, we may need a feature that is the combination of many other features. This method can, in general, reduces the features count in the dataset.
Dimensionality Reduction techniques : These techniques reduce the number of features in a dataset while preserving as much useful information as possible. Some of the famous Dimensionality reduction techniques are: Principle component analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), etc.
Regularization: Some regularization techniques like L1 and L2 regularizations are useful when deciding the impact each feature has on the model training.

Mentioned below is how Random forest handles feature selection

When creating individual trees in the Random Forest ensemble, a subset of features is assigned to each tree which is called Feature Bagging. Feature Bagging introduces randomness and diversity among the trees.
After the training, the features are assigned a “importance score” based on how well those features performed by reducing the error of the model. Features that consistently contribute to improving the model’s accuracy across multiple trees are deemed more important
Then the features are ranked based on their importance scores. Features with higher importance scores are considered more influential in making predictions.

Feature Engineering : It can be defined as a method of preprocessing of data for better analysis purpose which involves different steps like selection, transformation, deletion of features to suit our problem at hand. Feature Engineering is a useful tool which can be used for:

Improving the model’s performance and Data interpretability
Reduce computational costs
Include hidden patterns for elevated Analysis results.

Some of the different methods of doing feature engineering are mentioned below:

Principle Component Analysis (PCA) : It identifies orthogonal axes (principal components) in the data that capture the maximum variance, thereby reducing the data features.
One-Hot Encoding – When we need to encode Nominal Categorical Data
Label Encoding – When we need to encode Ordinal Categorical Data
Feature Transformation : Sometimes, we can create new columns essential for better modelling just by combining or modifying one or more columns.

Q.70 How we will deal with the categorical text values in machine learning?

Often times, we are encountered with data that has Categorical text values. For example, male/female, first-class/second-class/third-class, etc. These Categorical text values can be divided into two types and based on that we deal with them as follows:

If it is Categorical Nominal Data: If the data does not have any hidden order associated with it (e.g., male/female), we perform One-Hot encoding on the data to convert it into binary sequence of digits
If it is Categorical Ordinal Data : When there is a pattern associated with the text data, we use Label encoding. In this, the numerical conversion is done based on the order of the text data. (e.g., Elementary/ Middle/ High/ Graduate,etc.)

Q.71 What is DBSCAN and How we will use it?

Density-Based Spatial Clustering of Applications with Noise (DBSCAN), is a density-based clustering algorithm used for grouping together data points that are close to each other in high-density regions and labeling data points in low-density regions as outliers or noise. Here is how it works:

For each data point in the dataset, DBSCAN calculates the distance between that point and all other data points
DBSCAN identifies dense regions by connecting core points that are within each other’s predefined threshold (eps) neighborhood.
DBSCAN forms clusters by grouping together data points that are density-reachable from one another.

The Expectation-Maximization (EM) algorithm is a probabilistic approach used for clustering data when dealing with mixture models. EM is commonly used when the true cluster assignments are not known and when there is uncertainty about which cluster a data point belongs to. Here is how it works:

First, the number of clusters K to be formed is specified.
Then, for each data point, the likelihood of it belonging to each of the K clusters is calculated. This is called the Expectation (E) step
Based on the previous step, the model parameters are updated. This is called Maximization (M) step.
Together it is used to check for convergence by comparing the change in log-likelihood or the parameter values between iterations.
If it converges, then we have achieved our purpose. If not, then the E-step and M-step are repeated until we reach convergence.

Silhouette score is a metric used to evaluate the quality of clusters produced by a clustering algorithm. Here is how it works:

the average distance between the data point and all other data points in the same cluster is first calculated. Let us call this as (a)
Then for the same data point, the average distance (b) between the data point and all data points in the nearest neighboring cluster (i.e., the cluster to which it is not assigned)
if -1<S<0, it signifies that data point is closer to a neighboring cluster than to its own cluster.
if S is close to zero, data point is on or very close to the decision boundary between two neighboring clusters.
if 0<S<1, data point is well within its own cluster and far from neighboring clusters.

In Principal Component Analysis (PCA), eigenvalues and eigenvectors play a crucial role in the transformation of the original data into a new coordinate system. Let us first define the essential terms:

Eigen Values : Eigenvalues are associated with each eigenvector and represent the magnitude of the variance (spread or extent) of the data along the corresponding eigenvector
Eigen Vectors : Eigenvectors are the directions or axes in the original feature space along which the data varies the most or exhibits the most variance

The relationship between them is given as:

$AV = \lambda{V}$

A = Feature matrix

V = eigen vector

$\lambda$

A larger eigenvalue implies that the corresponding eigenvector captures more of the variance in the data.The sum of all eigenvalues equals the total variance in the original data. Therefore, the proportion of total variance explained by each principal component can be calculated by dividing its eigenvalue by the sum of all eigenvalues

Cross-validation is a resampling technique used in machine learning to assess and validate the performance of a predictive model. It helps in estimating how well a model is likely to perform on unseen data, making it a crucial step in model evaluation and selection. Cross validation is usually helpful when avoiding overfitting the model. Some of the widely known cross validation techniques are:

K-Fold Cross-Validation : In this, the data is divided into K subsets, and K iterations of training and testing are performed.
Stratified K-Fold Cross-Validation : This technique ensures that each fold has approximately the same proportion of classes as the original dataset (helpful in handling data imbalance)
Shuffle-Split Cross-Validation : It randomly shuffles the data and splits it into training and testing sets.

Receiver Operating Characteristic (ROC) is a graphical representation of a binary classifier’s performance. It plots the true positive rate (TPR) vs the false positive rate (FPR) at different classification thresholds.

True positive rate (TPR) : It is the ratio of true positive predictions to the total actual positives.

False positive rate (FPR) : It is the ratio of False positive predictions to the total actual positives.

Area Under the Curve (AUC) as the name suggests is the area under the ROC curve. The AUC is a scalar value that quantifies the overall performance of a binary classification model and ranges from 0 to 1, where a model with an AUC of 0.5 indicates random guessing, and an AUC of 1 represents a perfect classifier.

Q.77 Describe gradient descent and its role in optimizing machine learning models.

Gradient descent is a fundamental optimization algorithm used to minimize a cost or loss function in machine learning and deep learning. Its primary role is to iteratively adjust the parameters of a machine learning model to find the values that minimize the cost function, thereby improving the model’s predictive performance. Here’s how Gradient descent help in optimizing Machine learning models:

Minimizing Cost functions : The primary goal of gradient descent is to find parameter values that result in the lowest possible loss on the training data.
Convergence : The algorithm continues to iterate and update the parameters until it meets a predefined convergence criterion, which can be a maximum number of iterations or achieving a desired level of accuracy.
Generalization : Gradient descent ensure that the optimized model generalizes well to new, unseen data.

Batch Gradient Descent: In Batch Gradient Descent, the entire training dataset is used to compute the gradient of the cost function with respect to the model parameters (weights and biases) in each iteration. This means that all training examples are processed before a single parameter update is made. It converges to a more accurate minimum of the cost function but can be slow, especially in a high dimensionality space.

Stochastic Gradient Descent: In Stochastic Gradient Descent, only one randomly selected training example is used to compute the gradient and update the parameters in each iteration. The selection of examples is done independently for each iteration. This is capable of faster updates and can handle large datasets because it processes one example at a time but high variance can cause it to converge slower.

Mini-Batch Gradient Descent: Mini-Batch Gradient Descent strikes a balance between BGD and SGD. It divides the training dataset into small, equally-sized subsets called mini-batches. In each iteration, a mini-batch is randomly sampled, and the gradient is computed based on this mini-batch. It utilizes parallelism well and takes advantage of modern hardware like GPUs but can still exhibits some level of variance in updates compared to Batch Gradient Descent.

Q.79 Explain the Apriori — Association Rule Mining

Association Rule mining is an algorithm to find relation between two or more different objects. Apriori association is one of the most frequently used and most simple association technique. Apriori Association uses prior knowledge of frequent objects properties. It is based on Apriori property which states that:

“All non-empty subsets of a frequent itemset must also be frequent”

A vector with several normally distributed variables is said to have a multivariate normal distribution if any linear combination of the variables likewise has a normal distribution. The multivariate normal distribution is used to approximatively represent the features of specific characteristics in machine learning, but it is also important in extending the central limit theorem to several variables.

In probability theory and statistics, the conditional probability density function (PDF) is a notion that represents the probability distribution of a random variable within a certain condition or constraint. It measures the probability of a random variable having a given set of values given a set of circumstances or events.

The probability that a continuous random variable will take on particular values within a range is described by the Probability Density Function (PDF), whereas the Cumulative Distribution Function (CDF) provides the cumulative probability that the random variable will fall below a given value. Both of these concepts are used in probability theory and statistics to describe and analyse probability distributions. The PDF is the CDF’s derivative, and they are related by integration and differentiation.

The statistical method known as ANOVA, or Analysis of Variance, is used to examine the variation in a dataset and determine whether there are statistically significant variations between group averages. When comparing the means of several groups or treatments to find out if there are any notable differences, this method is frequently used.

There are several different ways to perform ANOVA tests, each suited for different types of experimental designs and data structures:

One-Way ANOVA
Two-Way ANOVA
Three-Way ANOVA

When conducting ANOVA tests we typically calculate an F-statistic and compare it to a critical value or use it to calculate a p-value.

Ans: The local minima problem occurs when the optimization algorithm converges a solution that is minimum within a small neighbourhood of the current point but may not be the global minimum for the objective function.

To mitigate local minimal problems, we can use the following technique:

Use initialization techniques like Xavier/Glorot and He to model trainable parameters. This will help to set appropriate initial weights for the optimization process.
Set Adam or RMSProp as optimizer, these adaptive learning rate algorithms can adapt the learning rates for individual parameters based on historical gradients.
Introduce stochasticity in the optimization process using mini-batches, which can help the optimizer to escape local minima by adding noise to the gradient estimates.
Adding more layers or neurons can create a more complex loss landscape with fewer local minima.
Hyperparameter tuning using random search cv and grid search cv helps to explore the parameter space more thoroughly suggesting right hyperparameters for training and reducing the risk of getting stuck in local minima.

Gradient boosting techniques like XGBoost, and CatBoost are used for regression and classification problems. It is a boosting algorithm that combines the predictions of weak learners to create a strong model. The key steps involved in gradient boosting are:

Initialize the model with weak learners, such as a decision tree.
Calculate the difference between the target value and predicted value made by the current model.
Add a new weak learner to calculate residuals and capture the errors made by the current ensemble.
Update the model by adding fraction of the new weak learner’s predictions. This updating process can be controlled by learning rate.
Repeat the process from step 2 to 4, with each iteration focusing on correcting the errors made by the previous model.

Q.86 Explain convolutions operations of CNN architecture?

In a CNN architecture, convolution operations involve applying small filters (also called kernels) to input data to extract features. These filters slide over the input image covering one small part of the input at a time, computing dot products at each position creating a feature map. This operation captures the similarity between the filter’s pattern and the local features in the input. Strides determine how much the filter moves between positions. The resulting feature maps capture patterns, such as edges, textures, or shapes, and are essential for image recognition tasks. Convolution operations help reduce the spatial dimensions of the data and make the network translation-invariant, allowing it to recognize features in different parts of an image. Pooling layers are often used after convolutions to further reduce dimensions and retain important information.

Q.87 What is feed forward network and how it is different from recurrent neural network?

Deep learning designs that are basic are feedforward neural networks and recurrent neural networks. They are both employed for different tasks, but their structure and how they handle sequential data differ.

Feed Forward Neural Network

In FFNN, the information flows in one direction, from input to output, with no loops
It consists of multiple layers of neurons, typically organized into an input layer, one or more hidden layers, and an output layer.
Each neuron in a layer is connected to every neuron in the subsequent layer through weighted connections.
FNNs are primarily used for tasks such as classification and regression, where they take a fixed-size input and produce a corresponding output

Recurrent Neural Network

A recurrent neural network is designed to handle sequential data, where the order of input elements matters. Unlike FNNs, RNNs have connections that loop back on themselves, allowing them to maintain a hidden state that carries information from previous time steps.
This hidden state enables RNNs to capture temporal dependencies and context in sequential data, making them well-suited for tasks like natural language processing, time series analysis, and sequence generation.
However, standard RNNs have limitations in capturing long-range dependencies due to the vanishing gradient problem.

Generative models focus on generating new data samples, while discriminative models concentrate on classification and prediction tasks based on input data.

Generative Models:

Objective: Model the joint probability distribution P(X, Y) of input X and target Y.
Use: Generate new data, often for tasks like image and text generation.
Examples: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs).

Discriminative Models:

Objective: Model the conditional probability distribution P(Y | X) of target Y given input X.
Use: Classify or make predictions based on input data.
Examples: Logistic Regression, Support Vector Machines, Convolutional Neural Networks (CNNs) for image classification.

Q.89 What is the forward and backward propogations in deep learning?

Forward and backward propagations are key processes that occur during neural network training in deep learning. They are essential for optimizing network parameters and learning meaningful representations from input.

The process by which input data is passed through the neural network to generate predictions or outputs is known as forward propagation. The procedure begins at the input layer, where data is fed into the network. Each neuron in a layer calculates the weighted total of its inputs, applies an activation function, and sends the result to the next layer. This process continues through the hidden layers until the final output layer produces predictions or scores for the given input data.

The technique of computing gradients of the loss function with regard to the network’s parameters is known as backward propagation. It is utilized to adjust the neural network parameters during training using optimization methods such as gradient descent.

The process starts with the computation of the loss, which measures the difference between the network’s predictions and the actual target values. Gradients are then computed by using the chain rule of calculus to propagate this loss backward through the network. This entails figuring out how much each parameter contributed to the error. The computed gradients are used to adjust the network’s weights and biases, reducing the error in subsequent forward passes.

Markov models are effective methods for capturing and modeling dependencies between successive data points or states in a sequence. They are especially useful when the current condition is dependent on earlier states. The Markov property, which asserts that the future state or observation depends on the current state and is independent of all prior states. There are two types of Markov models used in sequential data analysis:

Markov chains are the simplest form of Markov models, consisting of a set of states and transition probabilities between these states. Each state represents a possible condition or observation, and the transition probabilities describe the likelihood of moving from one state to another.
Hidden Markov Models extend the concept of Markov chains by introducing a hidden layer of states and observable emissions associated with each hidden state. The true state of the system (hidden state) is not directly observable, but the emissions are observable.

Applications:

HMMs are used to model phonemes and words in speech recognition systems, allowing for accurate transcription of spoken language
HMMs are applied in genomics for gene prediction and sequence alignment tasks. They can identify genes within DNA sequences and align sequences for evolutionary analysis.
Markov models are used in modeling financial time series data, such as stock prices, to capture the dependencies between consecutive observations and make predictions.

Generative AI is an abbreviation for Generative Artificial Intelligence, which refers to a class of artificial intelligence systems and algorithms that are designed to generate new, unique data or material that is comparable to, or indistinguishable from, human-created data. It is a subset of artificial intelligence that focuses on the creative component of AI, allowing machines to develop innovative outputs such as writing, graphics, audio, and more. There are several generative AI models and methodologies, each adapted to different sorts of data and applications such as:

Generative AI models such as GPT (Generative Pretrained Transformer) can generate human-like text.” Natural language synthesis, automated content production, and chatbot responses are all common uses for these models.
Images are generated using generative adversarial networks (GANs).” GANs are made up of a generator network that generates images and a discriminator network that determines the authenticity of the generated images. Because of the struggle between the generator and discriminator, high-quality, realistic images are produced.
Generative AI can also create audio content, such as speech synthesis and music composition.” Audio content is generated using models such as WaveGAN and Magenta.

Q.92 What are different neural network architecture used to generate artificial data in deep learning?

Various neural networks are used to generate artificial data. Here are some of the neural network architectures used for generating artificial data:

GANs consist of two components – generator and discriminator, which are trained simultaneously through adversarial training. They are used to generating high-quality images, such as photorealistic faces, artwork, and even entire scenes.
VAEs are generative models that learn a probabilistic mapping from the data space to a latent space. They also consist of encoder and decoder. They are used for generating images, reconstructing missing parts of images, and generating new data samples. They are also applied in generating text and audio.
RNNs are a class of neural networks with recurrent connections that can generate sequences of data. They are often used for sequence-to-sequence tasks. They are used in text generation, speech synthesis, music composition.
Transformers are a type of neural network architecture that has gained popularity for sequence-to-sequence tasks. They use self-attention mechanisms to capture dependencies between different positions in the input data. They are used in natural language processing tasks like machine translation, text summarization, and language generation.
Autoencoders are neural networks that are trained to reconstruct their input data. Variants like denoising autoencoders and contractive autoencoders can be used for data generation. They are used for image denoising, data inpainting, and generating new data samples.

Deep Reinforcement Learning (DRL) is a cutting-edge machine learning technique that combines the principles of reinforcement learning with the capability of deep neural networks. Its ability to enable machines to learn difficult tasks independently by interacting with their environments, similar to how people learn via trial and error, has garnered significant attention.

DRL is made up of three fundamental components:

The agent interacts with the environment and takes decision.
The environment is the outside world with which the agent interacts and receives feedback.
The reward signal is a scalar value provided by the environment after each action, guiding the agent toward maximizing cumulative rewards over time.
In robotics, DRL is used to control robots, manipulation and navigation.
DRL plays a role in self-driving cars and vehicle control
Can also be used for customized recommendations

Transfer learning is a strong machine learning and deep learning technique that allows models to apply knowledge obtained from one task or domain to a new, but related. It is motivated by the notion that what we learn in one setting can be applied to a new, but comparable, challenge.

Benefits of Transfer Learning:

We may utilize knowledge from a large dataset by starting with a pretrained model, making it easier to adapt to a new task with data.
Training a deep neural network from scratch can be time-consuming and costly in terms of compute. Transfer learning enables us to bypass the earliest phases of training, saving both time and resources.
Pretrained models frequently learn rich data representations. Models that use these representations can generalize better, even when the target task has a smaller dataset.

Transfer Learning Process:

It’s a foundation step in transfer learning. The pretrained data is already trained on large and diverse dataset for a related task.
To leverage the knowlege, output layers of the pretrained model are removed leaving the layers responsible for feature extraction. The target data is passed through these layers to extract feature information.
using these extracted features, the model captures patterns and representations from the data.
After the feature extraction process, the model is fine-tuned for the specific target task.
Output layers are added to the model and these layer are designed to produce the desired output for the target task.
Backpropagation is used to iteratively update the model’s weights during fine-tuning. This method allows the model to tailor its representations and decision boundaries to the specifics of the target task.
Even as the model focuses in the target task, the knowledge and features learned from the pre-trained layers continue to contribute to its understanding. This dual learning process improves the model’s performance and enables it to thrive in tasks that require little data or resources.

Q.95 What is difference between object detections and image segmentations.

Object detection and Image segmentation are both computer vision tasks that entail evaluating and comprehending image content, but they serve different functions and give different sorts of information.

Object Detection:

goal of object detection is to identify and locate objects and represent the object in bounding boxes with their respective labels.
used in applications like autonomous driving for detecting pedestrians and vehicle

Image Segmentation:

focuses on partitioning an image into multiple regions, where each segment corresponding to a coherent part of the image.
provide pixel level labeling of the entire image
used in applications that require pixel level understanding such as medical image analysis for organ and tumor delineation.

In NLP, the concept of word embedding is use to capture semantic and contextual information. Word embeddings are dense representations of words or phrases in continuous-valued vectors in a high-dimensional space. Each word is mapped to a vector with the real numbers, these vectors are learned from large corpora of text data.

Word embeddings are based on the Distributional Hypothesis, which suggests that words that appear in similar context have similar meanings. This idea is used by word embedding models to generate vector representations that reflect the semantic links between words depending on how frequently they co-occur with other words in the text.

The most common word embeddings techniques are-

Bag of Words (BOW)
Glove: Global Vector for word representation
Term frequency-inverse document frequency (TF-IDF)

A neural network architecture called a Sequence-to-Sequence (Seq2Seq) model is made to cope with data sequences, making it particularly helpful for jobs involving variable-length input and output sequences. Machine translation, text summarization, question answering, and other tasks all benefit from its extensive use in natural language processing.

The Seq2Seq consists of two main components: encoder and decoder. The encoder takes input sequence and converts into fixed length vector . The vector captures features and context of the sequence. The decoder takes the vector as input and generated output sequence. This autoregressive technique frequently entails influencing the subsequent prediction using the preceding one.

Q.98 What is artificial neural networks.

Artificial neural networks take inspiration from structure and functioning of human brain. The computational units in ANN are called neurons and these neurons are responsible to process and pass the information to the next layer.

ANN has three main components:

Input Layer : where the network receives input features.
Hidden Layer: one or more layers of interconnected neurons responsible for learning patterns in the data
Output Layer : provides final output on processed information.

We all know that data science is growing career and if you are looking a future in data science, then explore this detailed article on data science interview questions.

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

Download Interview guide PDF

Data Science Interview Questions

Download PDF

Introduction:

Data science is an interdisciplinary field that mines raw data, analyses it, and comes up with patterns that are used to extract valuable insights from it. Statistics, computer science, machine learning, deep learning, data analysis, data visualization, and various other technologies form the core foundation of data science.

Over the years, data science has gained widespread importance due to the importance of data. Data is considered the new oil of the future which when analyzed and harnessed properly can prove to be very beneficial to the stakeholders. Not just this, a data scientist gets exposure to work in diverse domains, solving real-life practical problems all by making use of trendy technologies. The most common real-time application is fast delivery of food in apps such as Uber Eats by aiding the delivery person to show the fastest possible route to reach the destination from the restaurant.

Data Science is also used in item recommendation systems in e-commerce sites like Amazon, Flipkart, etc which recommend the user what item they can buy based on their search history. Not just recommendation systems, Data Science is becoming increasingly popular in fraud detection applications to detect any fraud involved in credit-based financial applications. A successful data scientist can interpret data, perform innovation and bring out creativity while solving problems that help drive business and strategic goals. This makes it the most lucrative job of the 21st century.

In this article, we will explore what are the most commonly asked Data Science Technical Interview Questions which will help both aspiring and experienced data scientists.

Data Science Interview Questions for Freshers

Data science interview questions for experienced, frequently asked questions, data science mcq, 1. what is data science.

An interdisciplinary field that constitutes various scientific processes, algorithms, tools, and machine learning techniques working to help find common patterns and gather sensible insights from the given raw input data using statistical and mathematical analysis is called Data Science.

The following figure represents the life cycle of data science.

It starts with gathering the business requirements and relevant data.
Once the data is acquired, it is maintained by performing data cleaning, data warehousing, data staging, and data architecture.
Data processing does the task of exploring the data, mining it, and analyzing it which can be finally used to generate the summary of the insights extracted from the data.
Once the exploratory steps are completed, the cleansed data is subjected to various algorithms like predictive analysis, regression, text mining, recognition patterns, etc depending on the requirements.
In the final stage, the results are communicated to the business in a visually appealing manner. This is where the skill of data visualization, reporting, and different business intelligence tools come into the picture. Learn More .

2. What is the difference between data analytics and data science?

Data science involves the task of transforming data by using various technical analysis methods to extract meaningful insights using which a data analyst can apply to their business scenarios.
Data analytics deals with checking the existing hypothesis and information and answers questions for a better and effective business-related decision-making process.
Data Science drives innovation by answering questions that build connections and answers for futuristic problems. Data analytics focuses on getting present meaning from existing historical context whereas data science focuses on predictive modeling.
Data Science can be considered as a broad subject that makes use of various mathematical and scientific tools and algorithms for solving complex problems whereas data analytics can be considered as a specific field dealing with specific concentrated problems using fewer tools of statistics and visualization.

The following Venn diagram depicts the difference between data science and data analytics clearly:

3. What are some of the techniques used for sampling? What is the main advantage of sampling?

Data analysis can not be done on a whole volume of data at a time especially when it involves larger datasets. It becomes crucial to take some data samples that can be used for representing the whole population and then perform analysis on it. While doing this, it is very much necessary to carefully take sample data out of the huge data that truly represents the entire dataset.

There are majorly two categories of sampling techniques based on the usage of statistics, they are:

Probability Sampling techniques: Clustered sampling, Simple random sampling, Stratified sampling.
Non-Probability Sampling techniques: Quota sampling, Convenience sampling, snowball sampling, etc.

4. List down the conditions for Overfitting and Underfitting.

Overfitting: The model performs well only for the sample training data. If any new data is given as input to the model, it fails to provide any result. These conditions occur due to low bias and high variance in the model. Decision trees are more prone to overfitting.

Underfitting: Here, the model is so simple that it is not able to identify the correct relationship in the data, and hence it does not perform well even on the test data. This can happen due to high bias and low variance. Linear regression is more prone to Underfitting.

5. Differentiate between the long and wide format data.

The following image depicts the representation of wide format and long format data:

Learn via our Video Courses

6. what are eigenvectors and eigenvalues.

Eigenvectors are column vectors or unit vectors whose length/magnitude is equal to 1. They are also called right vectors. Eigenvalues are coefficients that are applied on eigenvectors which give these vectors different values for length or magnitude.

A matrix can be decomposed into Eigenvectors and Eigenvalues and this process is called Eigen decomposition. These are then eventually used in machine learning methods like PCA (Principal Component Analysis) for gathering valuable insights from the given matrix.

7. What does it mean when the p-values are high and low?

A p-value is the measure of the probability of having results equal to or more than the results achieved under a specific hypothesis assuming that the null hypothesis is correct. This represents the probability that the observed difference occurred randomly by chance.

Low p-value which means values ≤ 0.05 means that the null hypothesis can be rejected and the data is unlikely with true null.
High p-value, i.e values ≥ 0.05 indicates the strength in favor of the null hypothesis. It means that the data is like with true null.
p-value = 0.05 means that the hypothesis can go either way.

8. When is resampling done?

Resampling is a methodology used to sample data for improving accuracy and quantify the uncertainty of population parameters. It is done to ensure the model is good enough by training the model on different patterns of a dataset to ensure variations are handled. It is also done in the cases where models need to be validated using random subsets or when substituting labels on data points while performing tests.

9. What do you understand by Imbalanced Data?

Data is said to be highly imbalanced if it is distributed unequally across different categories. These datasets result in an error in model performance and result in inaccuracy.

10. Are there any differences between the expected value and mean value?

There are not many differences between these two, but it is to be noted that these are used in different contexts. The mean value generally refers to the probability distribution whereas the expected value is referred to in the contexts involving random variables.

11. What do you understand by Survivorship Bias?

This bias refers to the logical error while focusing on aspects that survived some process and overlooking those that did not work due to lack of prominence. This bias can lead to deriving wrong conclusions.

12. Define the terms KPI, lift, model fitting, robustness and DOE.

KPI: KPI stands for Key Performance Indicator that measures how well the business achieves its objectives.
Lift: This is a performance measure of the target model measured against a random choice model. Lift indicates how good the model is at prediction versus if there was no model.
Model fitting: This indicates how well the model under consideration fits given observations.
Robustness: This represents the system’s capability to handle differences and variances effectively.
DOE: stands for the design of experiments, which represents the task design aiming to describe and explain information variation under hypothesized conditions to reflect variables.

13. Define confounding variables.

Confounding variables are also known as confounders. These variables are a type of extraneous variables that influence both independent and dependent variables causing spurious association and mathematical relationships between those variables that are associated but are not casually related to each other.

14. Define and explain selection bias?

The selection bias occurs in the case when the researcher has to make a decision on which participant to study. The selection bias is associated with those researches when the participant selection is not random. The selection bias is also called the selection effect. The selection bias is caused by as a result of the method of sample collection.

Four types of selection bias are explained below:

Sampling Bias: As a result of a population that is not random at all, some members of a population have fewer chances of getting included than others, resulting in a biased sample. This causes a systematic error known as sampling bias.
Time interval: Trials may be stopped early if we reach any extreme value but if all variables are similar invariance, the variables with the highest variance have a higher chance of achieving the extreme value.
Data: It is when specific data is selected arbitrarily and the generally agreed criteria are not followed.
Attrition: Attrition in this context means the loss of the participants. It is the discounting of those subjects that did not complete the trial.

15. Define bias-variance trade-off?

Let us first understand the meaning of bias and variance in detail:

Bias: It is a kind of error in a machine learning model when an ML Algorithm is oversimplified. When a model is trained, at that time it makes simplified assumptions so that it can easily understand the target function. Some algorithms that have low bias are Decision Trees, SVM, etc. On the other hand, logistic and linear regression algorithms are the ones with a high bias.

Variance: Variance is also a kind of error. It is introduced into an ML Model when an ML algorithm is made highly complex. This model also learns noise from the data set that is meant for training. It further performs badly on the test data set. This may lead to over lifting as well as high sensitivity.

When the complexity of a model is increased, a reduction in the error is seen. This is caused by the lower bias in the model. But, this does not happen always till we reach a particular point called the optimal point. After this point, if we keep on increasing the complexity of the model, it will be over lifted and will suffer from the problem of high variance. We can represent this situation with the help of a graph as shown below:

As you can see from the image above, before the optimal point, increasing the complexity of the model reduces the error (bias). However, after the optimal point, we see that the increase in the complexity of the machine learning model increases the variance.

Trade-off Of Bias And Variance: So, as we know that bias and variance, both are errors in machine learning models, it is very essential that any machine learning model has low variance as well as a low bias so that it can achieve good performance.

Let us see some examples. The K-Nearest Neighbor Algorithm is a good example of an algorithm with low bias and high variance. This trade-off can easily be reversed by increasing the k value which in turn results in increasing the number of neighbours. This, in turn, results in increasing the bias and reducing the variance.

Another example can be the algorithm of a support vector machine. This algorithm also has a high variance and obviously, a low bias and we can reverse the trade-off by increasing the value of parameter C. Thus, increasing the C parameter increases the bias and decreases the variance.

So, the trade-off is simple. If we increase the bias, the variance will decrease and vice versa.

16. Define the confusion matrix?

It is a matrix that has 2 rows and 2 columns. It has 4 outputs that a binary classifier provides to it. It is used to derive various measures like specificity, error rate, accuracy, precision, sensitivity, and recall.

The test data set should contain the correct and predicted labels. The labels depend upon the performance. For instance, the predicted labels are the same if the binary classifier performs perfectly. Also, they match the part of observed labels in real-world scenarios. The four outcomes shown above in the confusion matrix mean the following:

True Positive: This means that the positive prediction is correct.
False Positive: This means that the positive prediction is incorrect.
True Negative: This means that the negative prediction is correct.
False Negative: This means that the negative prediction is incorrect.

The formulas for calculating basic measures that comes from the confusion matrix are:

Error rate : (FP + FN)/(P + N)
Accuracy : (TP + TN)/(P + N)
Sensitivity = TP/P
Specificity = TN/N
Precision = TP/(TP + FP)
F-Score = (1 + b)(PREC.REC)/(b2 PREC + REC) Here, b is mostly 0.5 or 1 or 2.

In these formulas:

FP = false positive FN = false negative TP = true positive RN = true negative

Sensitivity is the measure of the True Positive Rate. It is also called recall. Specificity is the measure of the true negative rate. Precision is the measure of a positive predicted value. F-score is the harmonic mean of precision and recall.

17. What is logistic regression? State an example where you have recently used logistic regression.

Logistic Regression is also known as the logit model. It is a technique to predict the binary outcome from a linear combination of variables (called the predictor variables).

For example , let us say that we want to predict the outcome of elections for a particular political leader. So, we want to find out whether this leader is going to win the election or not. So, the result is binary i.e. win (1) or loss (0). However, the input is a combination of linear variables like the money spent on advertising, the past work done by the leader and the party, etc.

18. What is Linear Regression? What are some of the major drawbacks of the linear model?

Linear regression is a technique in which the score of a variable Y is predicted using the score of a predictor variable X. Y is called the criterion variable. Some of the drawbacks of Linear Regression are as follows:

The assumption of linearity of errors is a major drawback.
It cannot be used for binary outcomes. We have Logistic Regression for that.
Overfitting problems are there that can’t be solved.

19. What is a random forest? Explain it’s working.

Classification is very important in machine learning. It is very important to know to which class does an observation belongs. Hence, we have various classification algorithms in machine learning like logistic regression, support vector machine, decision trees, Naive Bayes classifier, etc. One such classification technique that is near the top of the classification hierarchy is the random forest classifier.

So, firstly we need to understand a decision tree before we can understand the random forest classifier and its works. So, let us say that we have a string as given below:

So, we have the string with 5 ones and 4 zeroes and we want to classify the characters of this string using their features. These features are colour (red or green in this case) and whether the observation (i.e. character) is underlined or not. Now, let us say that we are only interested in red and underlined observations. So, the decision tree would look something like this:

So, we started with the colour first as we are only interested in the red observations and we separated the red and the green-coloured characters. After that, the “No” branch i.e. the branch that had all the green coloured characters was not expanded further as we want only red-underlined characters. So, we expanded the “Yes” branch and we again got a “Yes” and a “No” branch based on the fact whether the characters were underlined or not.

So, this is how we draw a typical decision tree. However, the data in real life is not this clean but this was just to give an idea about the working of the decision trees. Let us now move to the random forest.

Random Forest

It consists of a large number of decision trees that operate as an ensemble. Basically, each tree in the forest gives a class prediction and the one with the maximum number of votes becomes the prediction of our model. For instance, in the example shown below, 4 decision trees predict 1, and 2 predict 0. Hence, prediction 1 will be considered.

The underlying principle of a random forest is that several weak learners combine to form a keen learner. The steps to build a random forest are as follows:

Build several decision trees on the samples of data and record their predictions.
Each time a split is considered for a tree, choose a random sample of mm predictors as the split candidates out of all the pp predictors. This happens to every tree in the random forest.
Apply the rule of thumb i.e. at each split m = p√m = p.
Apply the predictions to the majority rule.

20. In a time interval of 15-minutes, the probability that you may see a shooting star or a bunch of them is 0.2. What is the percentage chance of you seeing at least one star shooting from the sky if you are under it for about an hour?

Let us say that Prob is the probability that we may see a minimum of one shooting star in 15 minutes.

So, Prob = 0.2

Now, the probability that we may not see any shooting star in the time duration of 15 minutes is = 1 - Prob

1-0.2 = 0.8

The probability that we may not see any shooting star for an hour is:

= (1-Prob)(1-Prob)(1-Prob)*(1-Prob) = 0.8 * 0.8 * 0.8 * 0.8 = (0.8)⁴ ≈ 0.40

So, the probability that we will see one shooting star in the time interval of an hour is = 1-0.4 = 0.6

So, there are approximately 60% chances that we may see a shooting star in the time span of an hour.

21. What is deep learning? What is the difference between deep learning and machine learning?

Deep learning is a paradigm of machine learning. In deep learning, multiple layers of processing are involved in order to extract high features from the data. The neural networks are designed in such a way that they try to simulate the human brain.

Deep learning has shown incredible performance in recent years because of the fact that it shows great analogy with the human brain.

The difference between machine learning and deep learning is that deep learning is a paradigm or a part of machine learning that is inspired by the structure and functions of the human brain called the artificial neural networks. Learn More .

22. What is a Gradient and Gradient Descent?

Gradient: Gradient is the measure of a property that how much the output has changed with respect to a little change in the input. In other words, we can say that it is a measure of change in the weights with respect to the change in error. The gradient can be mathematically represented as the slope of a function.

Gradient Descent: Gradient descent is a minimization algorithm that minimizes the Activation function. Well, it can minimize any function given to it but it is usually provided with the activation function only.

Gradient descent, as the name suggests means descent or a decrease in something. The analogy of gradient descent is often taken as a person climbing down a hill/mountain. The following is the equation describing what gradient descent means:

So, if a person is climbing down the hill, the next position that the climber has to come to is denoted by “b” in this equation. Then, there is a minus sign because it denotes the minimization (as gradient descent is a minimization algorithm). The Gamma is called a waiting factor and the remaining term which is the Gradient term itself shows the direction of the steepest descent.

This situation can be represented in a graph as follows:

Here, we are somewhere at the “Initial Weights” and we want to reach the Global minimum. So, this minimization algorithm will help us do that.

1. How are the time series problems different from other regression problems?

Time series data can be thought of as an extension to linear regression which uses terms like autocorrelation, movement of averages for summarizing historical data of y-axis variables for predicting a better future.
Forecasting and prediction is the main goal of time series problems where accurate predictions can be made but sometimes the underlying reasons might not be known.
Having Time in the problem does not necessarily mean it becomes a time series problem. There should be a relationship between target and time for a problem to become a time series problem.
The observations close to one another in time are expected to be similar to the ones far away which provide accountability for seasonality. For instance, today’s weather would be similar to tomorrow’s weather but not similar to weather from 4 months from today. Hence, weather prediction based on past data becomes a time series problem.

2. What are RMSE and MSE in a linear regression model?

RMSE: RMSE stands for Root Mean Square Error. In a linear regression model, RMSE is used to test the performance of the machine learning model. It is used to evaluate the data spread around the line of best fit. So, in simple words, it is used to measure the deviation of the residuals.

RMSE is calculated using the formula:

Yi is the actual value of the output variable.
Y(Cap) is the predicted value and,
N is the number of data points.

MSE: Mean Squared Error is used to find how close is the line to the actual data. So, we make the difference in the distance of the data points from the line and the difference is squared. This is done for all the data points and the submission of the squared difference divided by the total number of data points gives us the Mean Squared Error (MSE).

So, if we are taking the squared difference of N data points and dividing the sum by N, what does it mean? Yes, it represents the average of the squared difference of a data point from the line i.e. the average of the squared difference between the actual and the predicted values. The formula for finding MSE is given below:

Yi is the actual value of the output variable (the ith data point)
Y(cap) is the predicted value and,
N is the total number of data points.

So, RMSE is the square root of MSE .

3. What are Support Vectors in SVM (Support Vector Machine)?

In the above diagram, we can see that the thin lines mark the distance from the classifier to the closest data points (darkened data points). These are called support vectors. So, we can define the support vectors as the data points or vectors that are nearest (closest) to the hyperplane. They affect the position of the hyperplane. Since they support the hyperplane, they are known as support vectors.

4. So, you have done some projects in machine learning and data science and we see you are a bit experienced in the field. Let’s say your laptop’s RAM is only 4GB and you want to train your model on 10GB data set.

What will you do have you experienced such an issue before.

In such types of questions, we first need to ask what ML model we have to train. After that, it depends on whether we have to train a model based on Neural Networks or SVM.

The steps for Neural Networks are given below:

The Numpy array can be used to load the entire data. It will never store the entire data, rather just create a mapping of the data.
Now, in order to get some desired data, pass the index into the NumPy Array.
This data can be used to pass as an input to the neural network maintaining a small batch size.

The steps for SVM are given below:

For SVM, small data sets can be obtained. This can be done by dividing the big data set.
The subset of the data set can be obtained as an input if using the partial fit function.
Repeat the step of using the partial fit method for other subsets as well.

Now, you may describe the situation if you have faced such an issue in your projects or working in machine learning/ data science.

5. Explain Neural Network Fundamentals.

In the human brain, different neurons are present. These neurons combine and perform various tasks. The Neural Network in deep learning tries to imitate human brain neurons. The neural network learns the patterns from the data and uses the knowledge that it gains from various patterns to predict the output for new data, without any human assistance.

A perceptron is the simplest neural network that contains a single neuron that performs 2 functions. The first function is to perform the weighted sum of all the inputs and the second is an activation function.

There are some other neural networks that are more complicated. Such networks consist of the following three layers:

Input Layer: The neural network has the input layer to receive the input.
Hidden Layer: There can be multiple hidden layers between the input layer and the output layer. The initially hidden layers are used for detecting the low-level patterns whereas the further layers are responsible for combining output from previous layers to find more patterns.
Output Layer: This layer outputs the prediction.

An example neural network image is shown below:

6. What is Generative Adversarial Network?

This approach can be understood with the famous example of the wine seller. Let us say that there is a wine seller who has his own shop. This wine seller purchases wine from the dealers who sell him the wine at a low cost so that he can sell the wine at a high cost to the customers. Now, let us say that the dealers whom he is purchasing the wine from, are selling him fake wine. They do this as the fake wine costs way less than the original wine and the fake and the real wine are indistinguishable to a normal consumer (customer in this case). The shop owner has some friends who are wine experts and he sends his wine to them every time before keeping the stock for sale in his shop. So, his friends, the wine experts, give him feedback that the wine is probably fake. Since the wine seller has been purchasing the wine for a long time from the same dealers, he wants to make sure that their feedback is right before he complains to the dealers about it. Now, let us say that the dealers also have got a tip from somewhere that the wine seller is suspicious of them.

So, in this situation, the dealers will try their best to sell the fake wine whereas the wine seller will try his best to identify the fake wine. Let us see this with the help of a diagram shown below:

From the image above, it is clear that a noise vector is entering the generator (dealer) and he generates the fake wine and the discriminator has to distinguish between the fake wine and real wine. This is a Generative Adversarial Network (GAN).

In a GAN, there are 2 main components viz. Generator and Discrminator. So, the generator is a CNN that keeps producing images and the discriminator tries to identify the real images from the fake ones.

7. What is a computational graph?

A computational graph is also known as a “Dataflow Graph”. Everything in the famous deep learning library TensorFlow is based on the computational graph. The computational graph in Tensorflow has a network of nodes where each node operates. The nodes of this graph represent operations and the edges represent tensors.

8. What are auto-encoders?

Auto-encoders are learning networks. They transform inputs into outputs with minimum possible errors. So, basically, this means that the output that we want should be almost equal to or as close as to input as follows.

Multiple layers are added between the input and the output layer and the layers that are in between the input and the output layer are smaller than the input layer. It received unlabelled input. This input is encoded to reconstruct the input later.

9. What are Exploding Gradients and Vanishing Gradients?

Exploding Gradients: Let us say that you are training an RNN. Say, you saw exponentially growing error gradients that accumulate, and as a result of this, very large updates are made to the neural network model weights. These exponentially growing error gradients that update the neural network weights to a great extent are called Exploding Gradients .
Vanishing Gradients: Let us say again, that you are training an RNN. Say, the slope became too small. This problem of the slope becoming too small is called Vanishing Gradient . It causes a major increase in the training time and causes poor performance and extremely low accuracy.

10. What is the p-value and what does it indicate in the Null Hypothesis?

P-value is a number that ranges from 0 to 1. In a hypothesis test in statistics, the p-value helps in telling us how strong the results are. The claim that is kept for experiment or trial is called Null Hypothesis.

A low p-value i.e. p-value less than or equal to 0.05 indicates the strength of the results against the Null Hypothesis which in turn means that the Null Hypothesis can be rejected.
A high p-value i.e. p-value greater than 0.05 indicates the strength of the results in favour of the Null Hypothesis i.e. for the Null Hypothesis which in turn means that the Null Hypothesis can be accepted.

11. Since you have experience in the deep learning field, can you tell us why TensorFlow is the most preferred library in deep learning?

Tensorflow is a very famous library in deep learning. The reason is pretty simple actually. It provides C++ as well as Python APIs which makes it very easier to work on. Also, TensorFlow has a fast compilation speed as compared to Keras and Torch (other famous deep learning libraries). Apart from that, Tenserflow supports both GPU and CPU computing devices. Hence, it is a major success and a very popular library for deep learning.

12. Suppose there is a dataset having variables with missing values of more than 30%, how will you deal with such a dataset?

Depending on the size of the dataset, we follow the below ways:

In case the datasets are small, the missing values are substituted with the mean or average of the remaining data. In pandas, this can be done by using mean = df.mean() where df represents the pandas dataframe representing the dataset and mean() calculates the mean of the data. To substitute the missing values with the calculated mean, we can use df.fillna(mean) .
For larger datasets, the rows with missing values can be removed and the remaining data can be used for data prediction.

13. What is Cross-Validation?

Cross-Validation is a Statistical technique used for improving a model’s performance. Here, the model will be trained and tested with rotation using different samples of the training dataset to ensure that the model performs well for unknown data. The training data will be split into various groups and the model is run and validated against these groups in rotation.

The most commonly used techniques are:

K- Fold method
Leave p-out method
Leave-one-out method
Holdout method

14. What are the differences between correlation and covariance?

Although these two terms are used for establishing a relationship and dependency between any two random variables, the following are the differences between them:

Correlation: This technique is used to measure and estimate the quantitative relationship between two variables and is measured in terms of how strong are the variables related.
Covariance: It represents the extent to which the variables change together in a cycle. This explains the systematic relationship between pair of variables where changes in one affect changes in another variable.

Mathematically, consider 2 random variables, X and Y where the means are represented as μ X {"detectHand":false} and μ Y {"detectHand":false} respectively and standard deviations are represented by σ X {"detectHand":false} and σ Y {"detectHand":false} respectively and E represents the expected value operator, then:

covarianceXY = E[(X- μ X {"detectHand":false} ),(Y- μ Y {"detectHand":false} )]
correlationXY = E[(X- μ X {"detectHand":false} ),(Y- μ Y {"detectHand":false} )]/( σ X {"detectHand":false} σ Y {"detectHand":false} ) so that

Based on the above formula, we can deduce that the correlation is dimensionless whereas covariance is represented in units that are obtained from the multiplication of units of two variables.

The following image graphically shows the difference between correlation and covariance:

15. How do you approach solving any data analytics based project?

Generally, we follow the below steps:

The first step is to thoroughly understand the business requirement/problem
Next, explore the given data and analyze it carefully. If you find any data missing, get the requirements clarified from the business.
Data cleanup and preparation step is to be performed next which is then used for modelling. Here, the missing values are found and the variables are transformed.
Run your model against the data, build meaningful visualization and analyze the results to get meaningful insights.
Release the model implementation, and track the results and performance over a specified period to analyze the usefulness.
Perform cross-validation of the model.

Check out the list of data analytics projects .

16. How regularly must we update an algorithm in the field of machine learning?

We do not want to update and make changes to an algorithm on a regular basis as an algorithm is a well-defined step procedure to solve any problem and if the steps keep on updating, it cannot be said well defined anymore. Also, this brings in a lot of problems to the systems already implementing the algorithm as it becomes difficult to bring in continuous and regular changes. So, we should update an algorithm only in any of the following cases:

If you want the model to evolve as data streams through infrastructure, it is fair to make changes to an algorithm and update it accordingly.
If the underlying data source is changing, it almost becomes necessary to update the algorithm accordingly.
If there is a case of non-stationarity, we may update the algorithm.
One of the most important reasons for updating any algorithm is its underperformance and lack of efficiency. So, if an algorithm lacks efficiency or underperforms it should be either replaced by some better algorithm or it must be updated.

17. Why do we need selection bias?

Selection Bias happens in cases where there is no randomization specifically achieved while picking a part of the dataset for analysis. This bias tells that the sample analyzed does not represent the whole population meant to be analyzed.

For example, in the below image, we can see that the sample that we selected does not entirely represent the whole population that we have. This helps us to question whether we have selected the right data for analysis or not.

18. Why is data cleaning crucial? How do you clean the data?

While running an algorithm on any data, to gather proper insights, it is very much necessary to have correct and clean data that contains only relevant information. Dirty data most often results in poor or incorrect insights and predictions which can have damaging effects.

For example, while launching any big campaign to market a product, if our data analysis tells us to target a product that in reality has no demand and if the campaign is launched, it is bound to fail. This results in a loss of the company’s revenue. This is where the importance of having proper and clean data comes into the picture.

Data Cleaning of the data coming from different sources helps in data transformation and results in the data where the data scientists can work on.
Properly cleaned data increases the accuracy of the model and provides very good predictions.
If the dataset is very large, then it becomes cumbersome to run data on it. The data cleanup step takes a lot of time (around 80% of the time) if the data is huge. It cannot be incorporated with running the model. Hence, cleaning data before running the model, results in increased speed and efficiency of the model.
Data cleaning helps to identify and fix any structural issues in the data. It also helps in removing any duplicates and helps to maintain the consistency of the data.

The following diagram represents the advantages of data cleaning:

19. What are the available feature selection methods for selecting the right variables for building efficient predictive models?

While using a dataset in data science or machine learning algorithms, it so happens that not all the variables are necessary and useful to build a model. Smarter feature selection methods are required to avoid redundant models to increase the efficiency of our model. Following are the three main methods in feature selection:

These methods pick up only the intrinsic properties of features that are measured via univariate statistics and not cross-validated performance. They are straightforward and are generally faster and require less computational resources when compared to wrapper methods.
There are various filter methods such as the Chi-Square test, Fisher’s Score method, Correlation Coefficient, Variance Threshold, Mean Absolute Difference (MAD) method, Dispersion Ratios, etc.

These methods need some sort of method to search greedily on all possible feature subsets, access their quality by learning and evaluating a classifier with the feature.
The selection technique is built upon the machine learning algorithm on which the given dataset needs to fit.
Forward Selection: Here, one feature is tested at a time and new features are added until a good fit is obtained.
Backward Selection: Here, all the features are tested and the non-fitting ones are eliminated one by one to see while checking which works better.
Recursive Feature Elimination: The features are recursively checked and evaluated how well they perform.
These methods are generally computationally intensive and require high-end resources for analysis. But these methods usually lead to better predictive models having higher accuracy than filter methods.

Embedded methods constitute the advantages of both filter and wrapper methods by including feature interactions while maintaining reasonable computational costs.
These methods are iterative as they take each model iteration and carefully extract features contributing to most of the training in that iteration.
Examples of embedded methods: LASSO Regularization (L1), Random Forest Importance.

20. During analysis, how do you treat the missing values?

To identify the extent of missing values, we first have to identify the variables with the missing values. Let us say a pattern is identified. The analyst should now concentrate on them as it could lead to interesting and meaningful insights. However, if there are no patterns identified, we can substitute the missing values with the median or mean values or we can simply ignore the missing values.

If the variable is categorical, the common strategies for handling missing values include:

Assigning a New Category: You can assign a new category, such as "Unknown" or "Other," to represent the missing values.
Mode imputation: You can replace missing values with the mode, which represents the most frequent category in the variable.
Using a Separate Category: If the missing values carry significant information, you can create a separate category to indicate missing values.

It's important to select an appropriate strategy based on the nature of the data and the potential impact on subsequent analysis or modelling.

If 80% of the values are missing for a particular variable, then we would drop the variable instead of treating the missing values.

21. Will treating categorical variables as continuous variables result in a better predictive model?

Yes! A categorical variable is a variable that can be assigned to two or more categories with no definite category ordering. Ordinal variables are similar to categorical variables with proper and clear ordering defines. So, if the variable is ordinal, then treating the categorical value as a continuous variable will result in better predictive models.

22. How will you treat missing values during data analysis?

The impact of missing values can be known after identifying what type of variables have missing values.

If the data analyst finds any pattern in these missing values, then there are chances of finding meaningful insights.
In case of patterns are not found, then these missing values can either be ignored or can be replaced with default values such as mean, minimum, maximum, or median values.
Assigning a new category: You can assign a new category, such as "Unknown" or "Other," to represent the missing values.
Using a separate category : If the missing values carry significant information, you can create a separate category to indicate the missing values. It's important to select an appropriate strategy based on the nature of the data and the potential impact on subsequent analysis or modelling.
If 80% of values are missing, then it depends on the analyst to either replace them with default values or drop the variables.

23. What does the ROC Curve represent and how to create it?

ROC (Receiver Operating Characteristic) curve is a graphical representation of the contrast between false-positive rates and true positive rates at different thresholds. The curve is used as a proxy for a trade-off between sensitivity and specificity.

The ROC curve is created by plotting values of true positive rates (TPR or sensitivity) against false-positive rates (FPR or (1-specificity)) TPR represents the proportion of observations correctly predicted as positive out of overall positive observations. The FPR represents the proportion of observations incorrectly predicted out of overall negative observations. Consider the example of medical testing, the TPR represents the rate at which people are correctly tested positive for a particular disease.

24. What are the differences between univariate, bivariate and multivariate analysis?

Statistical analyses are classified based on the number of variables processed at a given time.

25. What is the difference between the Test set and validation set?

The test set is used to test or evaluate the performance of the trained model. It evaluates the predictive power of the model. The validation set is part of the training set that is used to select parameters for avoiding model overfitting.

26. What do you understand by a kernel trick?

Kernel functions are generalized dot product functions used for the computing dot product of vectors xx and yy in high dimensional feature space. Kernal trick method is used for solving a non-linear problem by using a linear classifier by transforming linearly inseparable data into separable ones in higher dimensions.

27. Differentiate between box plot and histogram.

Box plots and histograms are both visualizations used for showing data distributions for efficient communication of information. Histograms are the bar chart representation of information that represents the frequency of numerical variable values that are useful in estimating probability distribution, variations and outliers. Boxplots are used for communicating different aspects of data distribution where the shape of the distribution is not seen but still the insights can be gathered. These are useful for comparing multiple charts at the same time as they take less space when compared to histograms.

28. How will you balance/correct imbalanced data?

There are different techniques to correct/balance imbalanced data. It can be done by increasing the sample numbers for minority classes. The number of samples can be decreased for those classes with extremely high data points. Following are some approaches followed to balance data:

Specificity/Precision: Indicates the number of selected instances that are relevant.
Sensitivity: Indicates the number of relevant instances that are selected.
F1 score: It represents the harmonic mean of precision and sensitivity.
MCC (Matthews correlation coefficient): It represents the correlation coefficient between observed and predicted binary classifications.
AUC (Area Under the Curve): This represents a relation between the true positive rates and false-positive rates.

For example, consider the below graph that illustrates training data:

Here, if we measure the accuracy of the model in terms of getting "0"s, then the accuracy of the model would be very high -> 99.9%, but the model does not guarantee any valuable information. In such cases, we can apply different evaluation metrics as stated above.

Under-sampling This balances the data by reducing the size of the abundant class and is used when the data quantity is sufficient. By performing this, a new dataset that is balanced can be retrieved and this can be used for further modeling.
Over-sampling This is used when data quantity is not sufficient. This method balances the dataset by trying to increase the samples size. Instead of getting rid of extra samples, new samples are generated and introduced by employing the methods of repetition, bootstrapping, etc.
Perform K-fold cross-validation correctly: Cross-Validation needs to be applied properly while using over-sampling. The cross-validation should be done before over-sampling because if it is done later, then it would be like overfitting the model to get a specific result. To avoid this, resampling of data is done repeatedly with different ratios.

29. What is better - random forest or multiple decision trees?

Random forest is better than multiple decision trees as random forests are much more robust, accurate, and lesser prone to overfitting as it is an ensemble method that ensures multiple weak decision trees learn strongly.

30. Consider a case where you know the probability of finding at least one shooting star in a 15-minute interval is 30%. Evaluate the probability of finding at least one shooting star in a one-hour duration?

So the probability is 0.8628 = 86.28%

31. Toss the selected coin 10 times from a jar of 1000 coins. Out of 1000 coins, 999 coins are fair and 1 coin is double-headed, assume that you see 10 heads. Estimate the probability of getting a head in the next coin toss.

We know that there are two types of coins - fair and double-headed. Hence, there are two possible ways of choosing a coin. The first is to choose a fair coin and the second is to choose a coin having 2 heads.

P(selecting fair coin) = 999/1000 = 0.999 P(selecting double headed coin) = 1/1000 = 0.001

Using Bayes rule,

So, the answer is 0.7531 or 75.3%.

32. What are some examples when false positive has proven important than false negative?

Before citing instances, let us understand what are false positives and false negatives.

False Positives are those cases that were wrongly identified as an event even if they were not. They are called Type I errors.
False Negatives are those cases that were wrongly identified as non-events despite being an event. They are called Type II errors.

Some examples where false positives were important than false negatives are:

In the medical field: Consider that a lab report has predicted cancer to a patient even if he did not have cancer. This is an example of a false positive error. It is dangerous to start chemotherapy for that patient as he doesn’t have cancer as starting chemotherapy would lead to damage of healthy cells and might even actually lead to cancer.
In the e-commerce field: Suppose a company decides to start a campaign where they give $100 gift vouchers for purchasing $10000 worth of items without any minimum purchase conditions. They assume it would result in at least 20% profit for items sold above $10000. What if the vouchers are given to the customers who haven’t purchased anything but have been mistakenly marked as those who purchased $10000 worth of products. This is the case of false-positive error.

33. Give one example where both false positives and false negatives are important equally?

In Banking fields: Lending loans are the main sources of income to the banks. But if the repayment rate isn’t good, then there is a risk of huge losses instead of any profits. So giving out loans to customers is a gamble as banks can’t risk losing good customers but at the same time, they can’t afford to acquire bad customers. This case is a classic example of equal importance in false positive and false negative scenarios.

34. Is it good to do dimensionality reduction before fitting a Support Vector Model?

If the features number is greater than observations then doing dimensionality reduction improves the SVM (Support Vector Model).

35. What are various assumptions used in linear regression? What would happen if they are violated?

Linear regression is done under the following assumptions:

The sample data used for modeling represents the entire population.
There exists a linear relationship between the X-axis variable and the mean of the Y variable.
The residual variance is the same for any X values. This is called homoscedasticity
The observations are independent of one another.
Y is distributed normally for any value of X.

Extreme violations of the above assumptions lead to redundant results. Smaller violations of these result in greater variance or bias of the estimates.

36. How is feature selection performed using the regularization method?

The method of regularization entails the addition of penalties to different parameters in the machine learning model for reducing the freedom of the model to avoid the issue of overfitting. There are various regularization methods available such as linear model regularization, Lasso/L1 regularization, etc. The linear model regularization applies penalty over coefficients that multiplies the predictors. The Lasso/L1 regularization has the feature of shrinking some coefficients to zero, thereby making it eligible to be removed from the model.

37. How do you identify if a coin is biased?

To identify this, we perform a hypothesis test as below: According to the null hypothesis, the coin is unbiased if the probability of head flipping is 50%. According to the alternative hypothesis, the coin is biased and the probability is not equal to 500. Perform the below steps:

Flip coin 500 times
Calculate p-value.
p-value > alpha: Then null hypothesis holds good and the coin is unbiased.
p-value < alpha: Then the null hypothesis is rejected and the coin is biased.

38. What is the importance of dimensionality reduction?

The process of dimensionality reduction constitutes reducing the number of features in a dataset to avoid overfitting and reduce the variance. There are mostly 4 advantages of this process:

This reduces the storage space and time for model execution.
Removes the issue of multi-collinearity thereby improving the parameter interpretation of the ML model.
Makes it easier for visualizing data when the dimensions are reduced.
Avoids the curse of increased dimensionality.

39. How is the grid search parameter different from the random search tuning strategy?

Tuning strategies are used to find the right set of hyperparameters. Hyperparameters are those properties that are fixed and model-specific before the model is tested or trained on the dataset. Both the grid search and random search tuning strategies are optimization techniques to find efficient hyperparameters.

Here, every combination of a preset list of hyperparameters is tried out and evaluated.
The search pattern is similar to searching in a grid where the values are in a matrix and a search is performed. Each parameter set is tried out and their accuracy is tracked. after every combination is tried out, the model with the highest accuracy is chosen as the best one.
The main drawback here is that, if the number of hyperparameters is increased, the technique suffers. The number of evaluations can increase exponentially with each increase in the hyperparameter. This is called the problem of dimensionality in a grid search.

In this technique, random combinations of hyperparameters set are tried and evaluated for finding the best solution. For optimizing the search, the function is tested at random configurations in parameter space as shown in the image below.
In this method, there are increased chances of finding optimal parameters because the pattern followed is random. There are chances that the model is trained on optimized parameters without the need for aliasing.
This search works the best when there is a lower number of dimensions as it takes less time to find the right set.

Conclusion:

Data Science is a very vast field and comprises many topics like Data Mining, Data Analysis, Data Visualization, Machine Learning, Deep Learning, and most importantly it is laid on the foundation of mathematical concepts like Linear Algebra and Statistical analysis. Since there are a lot of pre-requisites for becoming a good professional Data Scientist, the perks and benefits are very big. Data Scientist has become the most sought job role these days.

Looking for a comprehensive course on Data Science: Check out Scaler’s Data Science Course .

Useful Resources:

Best Data Science Courses
Python Data Science Interview Questions
Google Data Scientist Salary
Spotify Data Scientist Salary
Data Scientist Salary
Data Science Resume
Data Analyst: Career Guide
Tableau Interview
Additional Technical Interview Questions

1. How do I prepare for a data science interview?

Some of the preparation tips for data science interviews are as follows:

Resume Building: Firstly, prepare your resume well. It is preferable if the resume is only a 1-page resume, especially for a fresher. You should give great thought to the format of the resume as it matters a lot. The data science interviews can be based more on the topics like linear and logistic regression, SVM, root cause analysis, random forest, etc. So, prepare well for the data science-specific questions like those discussed in this article, make sure your resume has a mention of such important topics and you have a good knowledge of them. Also, please make sure that your resume contains some Data Science-based Projects as well. It is always better to have a group project or internship experience in the field that you are interested to go for. However, personal projects will also have a good impact on the resume. So, your resume should contain at least 2-3 data science-based projects that show your skill and knowledge level in data science. Please do not write any such skill in your resume that you do not possess. If you are just familiar with some technology and have not studied it at an advanced level, you can mention a beginner tag for those skills.
Prepare Well: Apart from the specific questions on data science, questions on Core subjects like Database Management systems (DBMS), Operating Systems (OS), Computer Networks(CN), and Object-Oriented Programming (OOPS) can be asked from the freshers especially. So, prepare well for that as well.
Data structures and Algorithms are the basic building blocks of programming. So, you should be well versed with that as well.
Research the Company: This is the tip that most people miss and it is very important. If you are going for an interview with any company, read about the company before and especially in the case of data science, learn which libraries the company uses, what kind of models are they building, and so on. This gives you an edge over most other people.

2. Are data science interviews hard?

An honest reply will be “YES”. This is because of the fact that this field is newly emerging and will keep on emerging forever. In almost every interview, you have to answer many tough and challenging questions with full confidence and your concepts should be strong to satisfy the interviewer. However, with great practice, anything can be achieved. So, follow the tips discussed above and keep practising and learning. You will definitely succeed.

3. What are the top 3 technical skills of a data scientist?

The top 3 skills of a data scientist are:

Mathematics: Data science requires a lot of mathematics and a good data scientist is strong in it. It is not possible to become a good data scientist if you are weak in mathematics.
Machine Learning and Deep Learning : A data scientist should be very skilled in Artificial Intelligence technologies like deep learning and machine learning. Some good projects and a lot of hands-on practice will help in achieving excellence in that field.
Programming: This is an obvious yet the most important skill. If a person is good at programming it does mean that he/she can solve complex problems as that is just a problem-solving skill. Programming is the ability to write clean and industry-understandable code. This is the skill that most freshers slack because of the lack of exposure to industry-level code. This also improves with practice and experience.

4. Is data science a good career?

Yes, data science is one of the most futuristic and great career fields. Today and tomorrow or even years later, this field is just going to expand and never end. The reason is simple. Data can be compared to gold today as it is the key to selling everything in the world. Data scientists know how to play with this data to generate some tremendous outputs that are not even imaginable today making it a great career.

5. Are coding questions asked in data science interviews?

Yes, coding questions are asked in data science interviews. One more important thing to note here is that the data scientists are very good problem solvers as they are indulged in a lot of strict mathematics-based activities. Hence, the interviewer expects the data science interview candidates to know data structures and algorithms and at least come up with the solutions to most of the problems.

6. Is python and SQL enough for data science?

Yes. Python and SQL are sufficient for the data science roles. However, knowing the R programming Language can have also have a better impact. If you know these 3 languages, you have got the edge over most of the competitors. However, Python and SQL are enough for data science interviews.

7. What are Data Science tools?

There are various data science tools available in the market nowadays. Various tools can be of great importance. Tensorflow is one of the most famous data science tools. Some of the other famous tools are BigML, SAS (Statistical Analysis System), Knime, Scikit, Pytorch, etc.

Which among the below is NOT a necessary condition for weakly stationary time series data?

Overfitting more likely occurs when there is a huge data amount to train. True or False?

Given the information that the demand is 100 in October 2020, 150 in November 2020, 350 during December 2020 and 400 during January 2021. Calculate a 3-month simple moving average for February 2021.

Which of the below method depicts hierarchical data in nested format?

Which among the following defines the analysis of data objects not complying with general data behaviour?

What does a linear equation having 3 variables represent?

What would be the formula representation of this problem in terms of x and y variables: “The price of 2 pens and 1 pencil as 10 units”?

Which among the below is true regarding hypothesis testing?

What are the model parameters that are used to build ML models using iterative methods under model-based learning methods?

What skills are necessary for a Data Scientist?

Practice Questions
Programming
System Design
Fast Track Courses
Online Interviewbit Compilers
Online C Compiler
Online C++ Compiler
Online Java Compiler
Online Javascript Compiler
Online Python Compiler
Interview Preparation
Java Interview Questions
Sql Interview Questions
Python Interview Questions
Javascript Interview Questions
Angular Interview Questions
Networking Interview Questions
Selenium Interview Questions
Data Structure Interview Questions
System Design Interview Questions
Hr Interview Questions
Html Interview Questions
C Interview Questions
Amazon Interview Questions
Facebook Interview Questions
Google Interview Questions
Tcs Interview Questions
Accenture Interview Questions
Infosys Interview Questions
Capgemini Interview Questions
Wipro Interview Questions
Cognizant Interview Questions
Deloitte Interview Questions
Zoho Interview Questions
Hcl Interview Questions
Highest Paying Jobs In India
Exciting C Projects Ideas With Source Code
Top Java 8 Features
Angular Vs React
10 Best Data Structures And Algorithms Books
Best Full Stack Developer Courses
Python Commands List
Maximum Subarray Sum Kadane’s Algorithm
Python Cheat Sheet
C++ Cheat Sheet
Javascript Cheat Sheet
Git Cheat Sheet
Java Cheat Sheet
Data Structure Mcq
C Programming Mcq
Javascript Mcq

1 Million +

[Week 1-4] NPTEL Python For Data Science Assignment Answers 2023

NPTEL Python For Data Science Assignment Answer 2023

NPTEL Python For Data Science Assignment Answers

Table of Contents

NPTEL Python For Data Science Week 4 Assignment Answer 2023

1. Which of the following are regression problems? Assume that appropriate data is given.

Predicting the house price.
Predicting w h ether it will rain or not on a given day.
Predicting the maximum temperature on a g iven day.
Predicting the sales of the ice-creams.

2. Which of the followings are binary classification problems?

Predicting whether a patient is diagnosed with cancer or not.
Predicting whether a team will win a tournament or not.
Predicting the price of a second-hand car.
Classify web text into one of the follow in g categories: Sports, Entertainment, or Technology.

3. If a linear regression model achieves zero training error, can we say that all the data points lie on a hyperplane in the (d+1)-dimensional space? Here, d is the nu m ber of features.

Read the information given below and answer the questions from 4 to 6: Data Description: An automotive service chain is launching its new grand service station this weekend. They offer to service a wide variety of cars. The current capacity of the station is to check 315 cars thoroughly per day. As an inaugural offer, they claim to freely check all cars that arrive on their launch day, and report whether they need servicing or not!

Unexpectedly, they get 450 cars. The servicemen will not work longer than the working hours, but the data analysts have to!

Can you save the day for the new service station?

How can a data scientist save the day for them?

He has been given a data set, ‘ ServiceTrain.csv ’ that contains some attributes of the car that can be easily measured and a conclusion that if a service is needed or not.

Now for the cars they cannot check in detail, they measure those attributes and store them in ‘ ServiceTest.csv ’

Problem Statement:

Use machine learning techniques to identify whether the cars require service or not

Read the given datasets ‘ ServiceTrain.csv ’ and ‘ ServiceTest.csv ’ as train data and test data respectively and import all the required packages for analysis.

4. Which of the following machine learning techniques would NOT be ap p ropriate to solve the problem given in the problem statement?

Random Forest
Logisti c Regression
Linear regression

5. After applying logistic regression, what is/are the correct observat ion s from the resultant confusion matrix?

True Positive = 29, True Negative = 94
True Positive = 94, Tr u e Negative = 29
False Positive = 5, True Negative = 94
None of the above

Prepare the data by following th e steps given below, and answer questions 6 and 7.

Encode categorical variable, Service – Yes as 1 and No as 0 for both the train and test datasets.
Split the set of independent features and the dependent feature on both the train and test datasets.
Set random_state for the instance of the logistic regression class as 0.

6. The logistic regression model built between the input and output variables is checked for its prediction accuracy of the test data. What is the accuracy range (in %) of the predictions made over test dat a ?

60 – 79
90 – 9 5

7. How are categorical variables preprocessed before m odel building?

Standardization
Dummy var i ables
Correlation

The Global Happiness Index report contains the Happiness Score data w i th multiple features (namely the Economy, Family, Health, and Freedom) that could affect the target variable value.

Prepare the data by following the steps g iven below, and answer question 8

Split the set of independent features and the dependent feature on the given dataset
Create training and testing data from the set of independent features and dependent feature by splitting the original data in the ratio 3:1 respectively, and set the value for random_state of the training/test split method’s instance as 1

8. A multiple linear regression model is built on the Global Happiness Index dat a set ‘GHI_Report.csv’. What is the RMSE of the baseline model?

9. A regression model with the following function y=60+5.2x was built to understand the impact of humidity (x) on rainfall (y). The humidity this week is 30 more than the previous week. Wh a t is the predicted difference in rainfall?

10. X and Y are two variables that have a strong linear relationship. Whi c h of the following statements are incorrect?

There cannot be a negative relationship between the two variables.
The relationship between the two variables is purely causal.
One variable may or may not cause a change in the other variable.
The variables can be positively or negativel y correlated with each other

NPTEL Python For Data Science Week 3 Assignment Answer 2023

1. Which of the following is the correct approach to fill missing values in case of categorical variable?

2. Of the following set of statements, which of them can be used to extract the column Type as a separate dataframe?

df_cars[[‘Type’]]
df_cars.iloc[[:, 1]
df_cars.loc[:, [‘Type’]]

3. The method df_cars.describe() will give description of which of the following column?

Price (in lakhs)
All of the above

4. Which pandas function is used to stack the dataframes vertically?

pd.concat()

5. Which of the following are libraries in Python?

6. Which of the following variable have null values?

Review Date

7. Which of the following countries have maximum locations of cocoa manufacturing companies?

8. After checking the data summary, which feature requires a data conversion considering the data values held?

Review date
Bean origin

9. What is the maximum rating of chocolates?

[bool, int, float, float, str]
[str, int, float, float, str]
[bool, int, float, int, str]
[bool, int, int, float, str]

NPTEL Python For Data Science Week 2 Assignment Answer 2023

1. Which of the following object does not support ind e xing?

dict i onary

2. Given a NumPy array, arr = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9]]]), what is the output of the command, print(arr[0][1])?

[[1 2 3] [4 5 6] [7 8 9]

3. What is the output of the following code?

[2, 3, 4, 5]
[1, 2, 3, 4]
Will throw an error: Set objects are no t iterable.

5. Which of the following code gives output My friend’s house is in Chennai?

6. Let t1=(1,2, “tuple”,4) and t2=(5,6,7). Which of the follo w ing will not give any error after the execution?

t1.append(5)
x=t2[t1[1]]
t3=(t1 , t2)
t3=(list(t1), list(t2))

7. Let d={1:“Pyhton”,2:[1,2,3]}. Which among the fol l owing will not give the error after the execution?

d[2].append(4)
d.update({‘one’ : 22})

8. Wh i ch of the following data type is immutable?

9. student = {‘name’: ‘Jane’, ‘age’: 25 , ‘courses’: [‘Math’, ‘Statistics’]} Which among the following will return {‘name’: ‘Jane’, ‘age’: 26, ‘courses’: [‘Math’ , ‘Statistics’], ‘phone’: ‘123-456’}?

student.update({‘age’ : 26})
student.update({‘age’ : 26, ‘phone’: ‘123-456’})
student[‘phone’] = ‘123-456’

[‘M’, ‘A’, ‘H’, ‘E’, ‘S’, ‘H’] [‘m’, ‘a’, ‘h’, ‘e’, ‘s’ , ‘h’] [‘M’, ‘a’, ‘h’, ‘e’, ‘s’, ‘h’] [‘m’, ‘A’, ‘H’, ‘E’, ‘S’, ‘H’]

NPTEL Python For Data Science Week 1 Assignment Answer 2023

Error: Invalid operation, unsupported operator ‘*’ used between ‘int’ and ‘str’

Code will throw an error.

4. Which of the following variable names are INVALID in Python?

variable_ 1

5. While naming the variable, use of any special character other than unders c ore(_) ill throw which type of error?

Syntax error
Value er r or
Index error

6. Let x = “Mayur”. Which of the following commands converts the ‘x’ to float datatype?

str(float,x)
x.flo a t()
Cannot convert a string to float data type

7. Which Python library is commonly used for data wrangling and manipulation?

9. Given two variables, j = 6 and g = 3.3. If both normal division and floor division operators were used to divide j by g, what would be the data type of the value obtained from the operations?

float, float

Share your love

Related posts, [week 1] nptel deep learning – iit ropar assignment answers 2024.

[Week 1-11] NPTEL The Joy Of Computing Using Python Assignment Answer 2023

[Week 1-10] NPTEL Introduction To Industry 4.0 And Industrial Internet Of Things Assignment Answers 2023

[week 1] nptel programming in modern c++ assignment answers 2024, [week 1] nptel plastic waste management assignment answers 2024.

NPTEL Soft Skill Development Assignment Answer 2023

[Week 1 to 8] NPTEL Soft Skill Development Assignment Answers 2023

If You Are Facing Any Problem In Payment Then Email On : [email protected]

Pyq [week 1 to 8] nptel data science for engineers assignment answers 2023.

About Course

This course will provide you with access to all 12 weeks of assignment answers. As of now, we have uploaded the answers of Week 1 to 12.

Note:- Our answers will be visible to only those who buy this plan. Buy this plan if you have not yet.

Course Content

Week 1 answers 2023, week 1 assignment answers, week 2 answers 2023, week 2 assignment answers, week 3 answers 2023, week 3 assignment answers, week 4 answers 2023, week 4 assignment answer, week 5 answers 2023, week 5 assignment answers, week 6 answers 2023, week 6 assignment answers, week 7 answers 2023, week 7 assignment answers, week 8 answers 2023, week 8 assignment answers, student ratings & reviews.

Want to receive push notifications for all major on-site activities?

Insert/edit link

Enter the destination URL

Or link to existing content

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications You must be signed in to change notification settings

Data Science Assignments

OmkarMore27/ExcelR-Assignments

Folders and files, repository files navigation, excelr-assignments.

Jupyter Notebook 100.0%

AWS Cloud Foundation
_Joy Of Computing
_Data Science For Engineers
Internships

Nptel Data Science For Engineers Assignment 7 Answers 2023

NPTEL Data Science for Engineers Assignment 7 Answers 2023 ! In this article we will discuss about the answers for Week 6 assignment of Data science for Engineers. Consider these answers as reference only. I am confident in providing these answers. Then Come with us until the last of page to know more about week 7 Assignment.

Also Read: Nptel Data Science For Engineers Week 6 Answers

About Nptel

NPTEL ( National Program on Technology Enhanced Learning ) is an initiative by the Ministry of Education, Government of India to provide free online courses and study materials in engineering, science, and humanities subjects. It was launched in 2003 and is jointly run by seven Indian Institutes of Technology (IITs) and the Indian Institute of Science (IISc).

NPTEL offers online courses in a variety of formats, including video lectures, web-based courses, and downloadable course materials. The courses are taught by highly qualified faculty members from IITs and IISc, and are designed to provide high-quality education to students, teachers, and working professionals across India and the world.

NPTEL has over 1500 courses in various disciplines, including computer science, electronics and communication, civil engineering, mechanical engineering, physics, mathematics, and humanities. The courses are available in both English and Hindi languages.

NPTEL also conducts certification exams for its courses, which are recognized by many industries and academic institutions. These exams are conducted online and provide a valuable credential for individuals seeking career advancement or further education.

Overall, NPTEL has been instrumental in democratizing education and making high-quality learning accessible to everyone, regardless of their geographical location or financial status.

Nptel Data Science For Engineers Assignment 7 Answers 2023

You can find the answers for Data Science for Engineers Assignment 7 Answers 2023 below

Q1. Which among the following is not a type of cross-validation technique?

b. k-fold croos validation

c. Validation set approach

d. Bias variance trade off

Answer: [ D ] Bias Variance Trade Off

The bias-variance trade-off is the common name for this relationship. It is a conceptual framework for considering the selection of models and the configuration of models. A model might be chosen based on its bias or variance. Straightforward models, like logistic and linear regression, typically have a low variance and a large bias.

Which among the following is not a type of cross-validation technique?

Q2. Which among the following is a classification problem?

a. Predicting the average rainfall in a given month.

b. Predicting whether a patient is diagnosed with a disease or not.

c. Predicting the price of a house.

d. Predicting whether it will rain or not tomorrow.

Answer: [ B, D ] Predicting whether a patient is diagnosed with a disease or not.

Classification is a type of supervised learning problem in which the goal is to predict a categorical output variable, also known as the class label, based on input features. In this case, the class labels are the two possible outcomes - diagnosed with a disease or not.

On the other hand, predicting the average rainfall in a given month, predicting the price of a house, and predicting whether it will rain or not tomorrow are examples of regression problems, where the goal is to predict a continuous output variable.

Consider the following confusion matrix for the classification of Hatchback and SUV:

Confusion Matrix:

Q3. Find the accuracy of the model.

Answer: [ A ] 0.95

Consider the following confusion matrix for the classification of Hatchback and SUV:

Q4. Find the sensitivity of the model.

Answer: [ C ] 1

Q5. Under the ‘family’ parameter of glm() function, which one of the following distributions correspond to logistic regression for a variable with binary output?

a. Binomial

b. Gaussian

Answer: [ A ] Binomial

In the glm() function, the family parameter is used to specify the distributional family of the response variable. For a variable with binary output (i.e., 0 or 1), the appropriate distributional family for logistic regression is the binomial distribution.

Therefore, the family parameter for logistic regression with a binary response variable should be set to binomial.

Use the following information to answer Q6, Q7, Q8, Q9, and Q10:

Load the dataset iris.csv as a dataframe irisdata, with the first column as index headers, first row as column headers, dependent variable as factor variable, and answer the following questions.

The iris dataset contains four Sepal and Petal features (Sepal Length, Sepal Width, Petal Length, Petal Width, all in cm) of 50 equal samples of 3 different species of the iris flower (Setosa, Versicolor, and Virginica).

Q6. What is the dimension of the dataframe?

a. (150, 5)

b. (150, 4)

d. None of the above

Answer: [ A ] (150, 5 ) 5 Columns are specified here.

Q7. What can you comment on the distribution of the independent variables in the dataframe?

a. The variables Sepal Length and Sepal Width are not normally distributed

b. All the variables are normally distributed

c. The variable Petal Length alone is normally distributed

Answer: [ A ] The variables Sepal Length and Sepal Width are not normally distributed

Q8. How many rows in the dataset contain missing values?

Answer: [ D ] 0

We can see that the table is not empty. It is filled with all values.

Q9. Which of the following code blocks can be used to summarize the data (finding the mean of

the columns PetalLength and PetalWidth), similar to the one given below

PETAL LENGTH PETAL WIDTH

3.758000 1.199333

a. lapply(irisdata[, 3:4], mean)

b. sapply(irisdata[, 3:4], 2, mean)

c. apply(irisdata[, 3:4], 2, mean)

d. apply(irisdata[, 3:4], 1, mean)

Answer: [ C ] apply(irisdata[, 3:4], 2, mean)

Option A : la pply(irisdata[, 3:4], mean)

When we use this we will get,

lapply(iris_csv_iris[, 3:4], mean) $SepalWidth [1] 3.057333 $PetalLength [1] 3.758{codeBox}

Option B : sapply(irisdata[, 3:4], 2, mean)

Error in match.fun(FUN) : '2' is not a function, character or symbol {codeBox}

Option C : apply(iris_csv_iris[, 3:4],2, mean)

SepalWidth PetalLength 3.057333 3.758000 {codeBox}

Option D: apply(irisdata[, 3:4], 1, mean)

apply(iris_csv_iris[, 3:4], 1, mean) [1] 2.45 2.20 2.25 2.30 2.50 2.80 2.40 2.45 2.15 2.30 2.60 2.50 2.20 2.05 2.60 2.95 2.60 2.45 [19] 2.75 2.65 2.55 2.60 2.30 2.50 2.65 2.30 2.50 2.50 2.40 2.40 2.35 2.45 2.80 2.80 2.30 2.20 [37] 2.40 2.50 2.15 2.45 2.40 1.80 2.25 2.55 2.85 2.20 2.70 2.30 2.60 2.35 3.95 3.85 4.00 3.15 [55] 3.70 3.65 4.00 2.85 3.75 3.30 2.75 3.60 3.10 3.80 3.25 3.75 3.75 3.40 3.35 3.20 4.00 3.40 [73] 3.70 3.75 3.60 3.70 3.80 4.00 3.70 3.05 3.10 3.05 3.30 3.90 3.75 3.95 3.90 3.35 3.55 3.25 [91] 3.50 3.80 3.30 2.80 3.45 3.60 3.55 3.60 2.75 3.45 4.65 3.90 4.45 4.25 4.40 4.80 3.50 4.60 [109] 4.15 4.85 4.15 4.00 4.25 3.75 3.95 4.25 4.25 5.25 4.75 3.60 4.45 3.85 4.75 3.80 4.50 4.60 [127] 3.80 3.95 4.20 4.40 4.45 5.10 4.20 3.95 4.10 4.55 4.50 4.30 3.90 4.25 4.35 4.10 3.90 4.55 [145] 4.50 4.10 3.75 4.10 4.40 4.05 {codeBox}

Q10. What can be interpreted from the plot shown below?

a. Sepal widths of Versicolor flowers are lesser than 3 cm.

b. Sepal lengths of Setosa flowers are lesser than 6 cm.

c. Sepal lengths of Virginica flowers are greater than 6 cm.

d. Sepals of Setosa flowers are relatively more wider than Versicolor flowers

Answer: [ A ] Sepal widths of Versicolor flowers are lesser than 3 cm.

Also Read: Nptel Data Science For Engineers Week 6 Answers

Conclusion:

One Comment Please !

Contact Form

IMAGES

2022- NPTEL Python for Data Science Assignment 1 solutions| 100% correct answers with explanation
NPTEL PYTHON FOR DATA SCIENCE ASSIGNMENT 1 ANSWERS
NPTEL PYTHON FOR DATA SCIENCE ASSIGNMENT 2 ANSWERS
NPTEL 2022: Python For Data Science Week 2 Answers
Data Analysis with Python
2022 NPTEL Data Science for Engineers Assignment Answers

VIDEO

Data Science Assignment 4
Data Science Assignment 3
Data science assignment
Data science assignment help
Data Science Assignment |Google Chrome Colab final
Applied Data Science

COMMENTS

tchagau/Introduction-to-Data-Science-in-Python
This repository includes course assignments of Introduction to Data Science in Python on coursera by university of michigan Topics. numpy pandas python3 data-analysis Resources. Readme Activity. Stars. 46 stars Watchers. 6 watching Forks. 53 forks Report repository Releases No releases published. Packages 0.
GitHub
Step 1 − First, start with the selection of random samples from a given dataset. 2. Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree. 3. Step 3 − In this step, voting will be performed for every predicted result. 4.
Top 90+ Data Science Interview Questions and Answers 2024
The target variable, in this case, is 1. The formula for calculating the entropy is: Putting p=5 and n=8, we get. Entropy = A = - (5/8 log (5/8) + 3/8 log (3/8)) 27. We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level.
R for Data Science (2e)
Welcome. This is the website for the work-in-progress Solutions to Exercises for the 2nd edition of "R for Data Science". 1 Introduction. The Solutions to Exercises for R for Data Science (2e) was written by Jabir Ghaffar, Davon Person, and Mine Çetinkaya-Rundel. This book was built with Quarto.
Introduction to R Programming for Data Science
Welcome to Introduction to R Programming for Data Science • 3 minutes • Preview module. Introduction to R Language • 2 minutes. Basic Data Types • 5 minutes. Math, Variables, and Strings • 4 minutes. R Environment • 4 minutes. Introduction to RStudio • 3 minutes. Writing and Running R in Jupyter Notebooks • 4 minutes. 1 reading ...
What is Data Science? Course by IBM
In today's world, we use Data Science to find patterns in data and make meaningful, data-driven conclusions and predictions. This course is for everyone and teaches concepts like how data scientists use machine learning and deep learning and how companies apply data science in business. You will meet several data scientists, who will share ...
Assignments · Programming for Data Science
Programming for Data Science Teaching data scientists the tools they need to use computers to do data science ... Assignments Programming with Python Assignments. Assignment 1; Advanced Python Assignments. Assignment 1; Assignment 2; Assignment 3; Assignment 4; Assignment 5; Assignment 6; Assignment 7; Assignment 8; Assignment 9; Assignment 10 ...
40 Questions to test your skill in Python for Data Science
Questions & Answers. Question Context 1. You must have seen the show "How I met your mother". Do you remember the game where they played, in which each person drinks a shot whenever someone says "but, um". I thought of adding a twist to the game.
Python for Data Science Week 3 Assignment 3 Solution
#pythonfordatascience #nptel #swayam #python #datascience Python for Data Science All week Assignment Solution - https://www.youtube.com/playlist?list=PL__28...
Free Practice Exams
In this free practice exam you have been appointed as a Junior Data Analyst at a property developer company in the US, where you are asked to evaluate the renting prices in 9 key states. You will work with a free excel dataset file that contains the rental prices and houses over the last years. Learn More.
Data Science Interview Questions and Answers
Prepare for your next data science interview with confidence using our comprehensive list of Top 100+ Data Science Interview Questions and Answers (2024). Covering a wide range of topics from basic concepts to advanced techniques, this resource will help you ace your interview and land your dream job in data science.
All Data Science Assignments are Available in this file
List of Following Files of all Data Science Assignments Topices:-1.Assignment 1 (Basic Statistics_Level 1) 2.Assignment 2(Basic Statistics_Level-2) 3.Assignment 3(Hypothesis Testing) 4.Assignment 4(Simple Linear Regression) 5.Assignment 5(Multi Linear Regression) 6.Assignment 6(Logistic Regression) 7.Assignment 7(Clustering) 8.Assignment 8(PCA)
PDF noc20 cs36 assigment 1
NPTEL » Python for Data Science Unit 2 - Week O Course outline How does an NPTEL online course work? Week O Quiz : Assignment O Week 1 Week 2 week 3 Week 4 Supporting material for Week Download Videos Announcements Assignment O The due date for submitting this assignment has passed. As per our records you have not submitted this assignment.
Data Science For Engineers Week 5 Assignment Answers |Jan 2024
NPTEL Data Science For Engineers Week 5 Assignment Answers || Jan Apr 2024 Course: Data Science For EngineersOffered by: IIT MadrasDuration: 8 weeksStart Dat...
NPTEL
NPTEL - PYTHON FOR DATA SCIENCE ASSIGNMENT - 3. Types of questions: MCQs - Multiple Choice Questions (a question has only one correct answer) MSQs - Multiple Select Questions (a question can have two, three or four correct options) In this case, equal weightage must be given to all options
Top Data Science Interview Questions and Answers (2024)
Data science involves the task of transforming data by using various technical analysis methods to extract meaningful insights using which a data analyst can apply to their business scenarios. Data analytics deals with checking the existing hypothesis and information and answers questions for a better and effective business-related decision ...
[Week 1-4] NPTEL Python For Data Science Assignment Answers 2023
Answer :- a, c. Prepare the data by following th e steps given below, and answer questions 6 and 7. Encode categorical variable, Service - Yes as 1 and No as 0 for both the train and test datasets. Split the set of independent features and the dependent feature on both the train and test datasets.
chuksoo/Coursera--IBM-Data-Science-Professional
This repo contains course notes, assignments and solved solution exercises in the "IBM Data Science Professional Certificate" offered on Coursera by IBM. The specialization includes the following courses: What is Data Science? Tools for Data Science; Data Science Methodology; Python for Data Science and AI
Nptel data Science for Engineers Assignment 5 Answers 2023
Answer: [ B ] both direction and step length at each iteration. Q4. For an unconstrained multivariate optimization given f (x¯), the necessary second order. condition for x¯∗ to be the minimizer of f (x) is. a. ∇2f (x¯∗) must be negative definite. b. ∇2f (x¯∗) must be positive definite. c. ∇f (x¯∗) = 0.
PYQ [Week 1 to 8] NPTEL Data Science For Engineers Assignment Answers
Week 8 Answers 2023. Week 8 Assignment Answers. ₹ ₹. This course will provide you with access to all 12 weeks of assignment answers. As of now, we have uploaded the answers of Week 1 to 12.
Nptel Data Science For Engineers Assignment 8 Answers 2023
NPTEL Data Science For Engineers Assignment 8 Answers 2023. Last Date: 22-03-2023. Consider the dataset "USArrests.csv". Answer questions 1 to 4 based on the information given below: This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.
OmkarMore27/ExcelR-Assignments: Data Science Assignments
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window.
Nptel Data Science For Engineers Assignment 7 Answers 2023
In this article we will discuss about the answers for Week 6 assignment of Data science for Engineers. Consider these answers as reference only. I am confident in providing these answers. Then Come with us until the last of page to know more about week 7 Assignment. Also Read: Nptel Data Science For Engineers Week 6 Answers About Nptel