#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITian's, only for AI Learners.

Download our e-book of Attention Mechanism

How to use datetime objects? How to use Enum in python? Explain Scopes in Python? What is use of rank() function? How to know given a binary tree is a binary search tree or not? Backpropagation: In second-order methods, would ReLU derivative be 0? and what its effect on training? What is p-value? Balanced Parentheses Check problem Join Discussion

4 (4,001 Ratings)

220 Learners

Shashank Shanu

9 months ago

Data Science Interview Questions

After
months and year of your learning, one of the most important part of your Data
science journey is the interview process. Interviews are very rigorous process where the candidates are
judged on different areas of expertise such technical and coding skills,
knowledge and clarity of basic concepts of data science, statistics, machine
learning and many more. If you willing to apply for data science jobs, it is very
important to know what kind of interview questions generally interviewers, recruiters
and hiring managers may ask.

So, in this article, I
will try to give top 10 questions which may be asked by an interviewer during
your interview process.

So, without wasting much
time, let’s start…

Univariate, Bivariate and Multivariate Analysis

Univariate
analysis is a type of analysis which will have one variable and due to this
there are no relationships, causes. Univariate analysis is mostly used to
summarize the data and find the patterns within it to make actionable
decisions.

A Bivariate
analysis is a type of analysis which deals with the relationship between two variables.
These sets of paired variables come from related sources, or samples. The
strength of the correlation between the two variables will be tested using
Bivariate analysis.

A
multivariate analysis is a type of analysis where we try to find the
relationships between more than two variables. In real world this is the most
important and used type of analysis.

Normal Distribution

The normal
distribution curve is symmetrical. The non-normal distribution also tries to
become normal distribution as the size of the samples increases this is known
as Central Limit Theorem. It is also very easy to apply the Central Limit
Theorem. This method helps to make sense of data that is random by creating an
order and interpreting the results using a bell-shaped graph.

Linear Regression

The Linear
Regression consists of the following three methods:

- Determining and analyzing the correlation and direction of the data.
- Deploying the estimation of the model.
- Ensuring the usefulness and validity of the model.

It is
extensively used in scenarios where the cause-effect model comes into play. For
example, you want to know the effect of a certain action in order to determine
the various outcomes and extent of the effect the cause has in determining the
final outcome.

R square

The definition of
R-squared is the percentage of the response variable variation that is
explained by a linear model.

R-squared is always
between 0 and 100%.

0% indicates that the
model explains none of the variability of the response data around its mean.

100% indicates that the
model explains all the variability of the response data around its mean.

In general, the higher
the R-squared, the better the model fits your data.

Machine Learning

Machine learning is the
scientific study of algorithms and statistical models that computer systems use
to effectively perform a specific task without using explicit instructions,
relying on patterns and inference instead.

Building a model by
learning the patterns of historical data with some relationship between data to
make a data-driven prediction.

- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning

In a supervised learning
model, the algorithm learns on a labelled dataset, to generate reasonable predictions
for the response to new data. (Forecasting outcome of new data).

- Regression
- Classification

An unsupervised model,
in contrast, provides unlabeled data that the algorithm tries to make sense of
by extracting features, co-occurrence and underlying patterns on its own. We
use unsupervised learning for

- Clustering

- Anomaly detection

- Association
- Autoencoders

Reinforcement learning
is less supervised and depends on the learning agent in determining the output solutions
by arriving at different possible ways to achieve the best possible solution.

Mean Square Error

logistic and linear regression

Linear regression models data using continuous numeric value. As
against, logistic regression models the data in the binary values.

Linear regression requires to establish the linear relationship
among dependent and independent variables, whereas it is not necessary for
logistic regression.

In linear regression, the independent
variable can be correlated with each other. On the contrary, in the logistic
regression, the variable must not be correlated with each other.

decision tree for numerical and categorical data

Every split in a decision tree is based on a feature.

1. **If the feature is
categorical, the split is done with the elements belonging to a particular
class**.

2. **If the feature is continuous, the split is done with the
elements higher than a threshold. **

At every split, the decision tree will take the best variable at
that moment. This will be done according to an impurity measure with the split
branches. And the fact that the variable used to do split is categorical or
continuous is irrelevant (in fact, decision trees categorize continuous
variables by creating binary regions with the threshold).

At last, the good approach is to
always convert your **categoricals to continuous **using **LabelEncoder **or
**OneHotEncoding.**

treating missing values

Understand
the problem statement, understand the data and then give the

answer.
Assigning a default value which can be mean, minimum or maximum

value.
Getting into the data is important.

If
it is a categorical variable, the default value is assigned. The missing value is
assigned a default value.

If
you have a distribution of data coming, for normal distribution give the mean
value.

Should
we even treat missing values is another important point to consider? If 80% of
the values for a variable are missing then you can answer that you would be
dropping the variable instead of treating the missing values.

Cleaning data from multiple sources

These
are some of the most common interviews questions and answers which is being
asked most frequently by an interviewer. But there are lots of area where an
interviewer may ask question. So, it’s very important for you to be well
prepared before facing an interview round.

I hope after you enjoyed reading this article and finally, later
I will try to bring some more interesting and important questions of data
science interviews.

For more such blogs/courses on data science, machine
learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.

Thanks for reading…

Happy Learning…