1. What is the difference between Supervised and Unsupervised machine learning?
Supervised learning requires training labeled data. For example, in order to do classification (a supervised learning task), you’ll need to first label the data you’ll use to train the model to classify data into your labeled groups. Unsupervised learning, in contrast, does not require labeling data explicitly.
2. What is the difference between classification and regression?
Classification is used to produce discrete results, classification is used to classify data into some specific categories .for example classifying e-mails into spam and non-spam categories. Whereas, We use regression analysis when we are dealing with continuous data, for example predicting stock prices at a certain point of time.
3. What is meant by ‘Training set’ and ‘Test Set’?
‘Training set’ is the portion of the dataset used to train the model. ‘Testing set’ is the portion of the dataset used to test the trained model.
4. How do you handle missing or corrupted data in a dataset?
You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.
In Pandas, there are two very useful methods: isnone() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.
5. How is KNN different from k-means clustering
K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.
The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn’t — and is thus unsupervised learning. KNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as a lazy learner because it involves minimal training of the model. Hence, it doesn’t use training data to make generalizations on the unseen data set.
6. What is the main advantage of Naive Bayes?
A Naive Bayes classifier converges very quickly as compared to other models like logistic regression. As a result, we need less training data in the case of naive Bayes classifier.
7. What’s the difference between Type I and Type II error?
Don’t think that this is a trick question! Many machine learning interview questions will be an attempt to lob basic questions at you just to make sure you’re on top of your game and you’ve prepared all of your bases.
Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is.
A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby.
8. What’s the difference between a generative and discriminative model?
A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data.
Discriminative models will generally outperform generative models on classification tasks.
9. What are Parametric models?
Parametric models are those with a finite number of parameters. To predict new data, you only need to know the parameters of the model. Examples include linear regression, logistic regression, and linear SVMs.
Non-parametric models are those with an unbounded number of parameters, allowing for more flexibility. To predict new data, you need to know the parameters of the model and the state of the data that has been observed. Examples include decision trees, k-nearest neighbors, and topic models using latent Dirichlet analysis.
10. How to ensure that your model is not overfitting?
Keep the design of the model simple. Try to reduce the noise in the model by considering fewer variables and parameters. Cross-validation techniques such as K-folds cross-validation help us keep overfitting under control. Regularization techniques such as LASSO help in avoiding overfitting by penalizing certain parameters if they are likely to cause overfitting.
11. How Much Data You should have to use For Training and Testing your Model?
You have to find a balance, and there's no right answer for every problem.
If your test set is too small, you'll have an unreliable estimation of model performance (performance statistic will have high variance). If your training set is too small, your actual model parameters will have a high variance.
A good rule of thumb is to use an 80/20 train/test split. Then, your train set can be further split into train/validation or into partitions for cross-validation.
12. What should you do when your model is suffering from low bias and high variance?
When the model’s predicted value is very close to the actual value the condition is known as low bias. In this condition, we can use bagging algorithms like random forest regressor
13. What Is Bagging Algorithm?
Bagging, or Bootstrap Aggregating, is an ensemble method in which the dataset is first divided into multiple subsets through resampling.
Then, each subset is used to train a model, and the final predictions are made through voting or averaging the component models.
Bagging is performed in parallel.
14. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?
Low bias occurs when the model’s predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like a great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on unseen data, it gives disappointing results.
In such situations, we can use a bagging algorithm (like random forest) to tackle high variance problems. Bagging algorithms divide a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).
Also, to combat high variance, we can:
Use the regularization techniques, where higher model coefficients get penalized, hence lowering model complexity.
Use top n features from the variable importance chart. Maybe, with all the variables in the data set, the algorithm is having difficulty in finding a meaningful signal.
15. List Down Advantages and Disadvantages of Neural Network
Advantages: Neural networks (specifically deep NNs) have led to performance breakthroughs for unstructured datasets such as images, audio, and video. Their incredible flexibility allows them to learn patterns that no other ML algorithm can learn.
Disadvantages: However, they require a large amount of training data to converge. It's also difficult to pick the right architecture, and the internal "hidden" layers are incomprehensible.
16. How do you think Google is training data for self-driving cars?
Google is currently using Recaptcha to source labeled data on storefronts and traffic signs. They are also building on training data collected by Sebastian Thrun at GoogleX — some of which was obtained by his grad students driving buggies on desert dunes!
17. How would you evaluate a logistic regression model?
A subsection of the question above. You have to demonstrate an understanding of what the typical goals of a logistic regression are (classification, prediction, etc.) and bring up a few examples and use cases.
18. What is Convex Hull?
In the case of linearly separable data, the convex hull represents the outer boundaries of the two groups of data points. Once the convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls.
MMH is the line which attempts to create the greatest separation between two groups.
Questions you should know before starting a data science career.