## Interview Questions and Answers

- Suppose your friend invites you to his party where you meet total strangers. Since you have no idea about them, you will mentally classify them on the basis of gender, age group, dressing, etc.
- In this scenario, the strangers represent unlabeled data and the process of classifying unlabeled data points is nothing but unsupervised learning.
- Since you didn’t use any prior knowledge about people and classified them on-the-go, this becomes an unsupervised learning problem.

- It is a statistical error that causes a bias in the sampling portion of an experiment.
- The error causes one sampling group to be selected more often than other groups included in the experiment.
- Selection bias may produce an inaccurate conclusion if the selection bias is not identified.

Let me explain you this with an analogy:

- Imagine that, your girlfriend gave you a birthday surprise every year for the last 10 years. One day, your girlfriend asks you: ‘Sweetie, do you remember all the birthday surprises from me?’
- To stay on good terms with your girlfriend, you need to recall all the 10 events from your memory. Therefore,
**recall**is the ratio of the number of events you can correctly recall, to the total number of events. - If you can recall all 10 events correctly, then, your recall ratio is 1.0 (100%) and if you can recall 7 events correctly, your recall ratio is 0.7 (70%)

Let’s consider a scenario of a fire emergency:

**True Positive:**If the alarm goes on in case of a fire.

*Fire is positive and prediction made by the system is true.***False Positive:**If the alarm goes on, and there is no fire.

*System predicted fire to be positive which is a wrong prediction, hence the prediction is false.***False Negative:**If the alarm does not ring but there was a fire.

*System predicted fire to be negative which was false since there was fire.***True Negative:**If the alarm does not ring and there was no fire.

*The fire is negative and this prediction was true.*

- TN = True Negative
- TP = True Positive
- FN = False Negative
- FP = False Positive

*Inductive learning is the process of using observations to draw conclusions**Deductive learning is the process of using conclusions to form observations*

It depends on the question as well as on the domain for which we are trying to solve the problem. If you’re using Machine Learning in the domain of medical testing, then a false negative is very risky, since the report will not show any health problem when a person is actually unwell. Similarly, if Machine Learning is used in spam detection, then a false positive is very risky because the algorithm may classify an important email as spam.

Well, you must know that model accuracy is only a subset of model performance. The accuracy of the model and performance of the model are directly proportional and hence better the performance of the model, more accurate are the predictions.

- Gini Impurity and Entropy are the metrics used for deciding how to split a Decision Tree.
- Gini measurement is the probability of a random sample being classified correctly if you randomly pick a label according to the distribution in the branch.
- Entropy is a measurement to calculate the lack of information. You calculate the Information Gain (difference in entropies) by making a split. This measure helps to reduce the uncertainty about the output label.

- Entropy is an indicator of how messy your data is. It decreases as you reach closer to the leaf node.
- The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. It keeps on increasing as you reach closer to the leaf node.

*Over-fitting occurs when a model studies the training data to such an extent that it negatively influences the performance of the model on new data.*

This means that the disturbance in the training data is recorded and learned as concepts by the model. But the problem here is that these concepts do not apply to the testing data and negatively impact the model’s ability to classify the new data, hence reducing the accuracy on the testing data.

Three main methods to avoid overfitting:

- Collect more data so that the model can be trained with varied samples.
- Use ensembling methods, such as Random Forest. It is based on the idea of bagging, which is used to reduce the variation in the predictions by combining the result of multiple Decision trees on different samples of the data set.

The following methods can be used to screen outliers:

**Boxplot:**A box plot represents the distribution of the data and its variability. The box plot contains the upper and lower quartiles, so the box basically spans the Inter-Quartile Range (IQR). One of the main reasons why box plots are used is to detect outliers in the data. Since the box plot spans the IQR, it detects the data points that lie outside this range. These data points are nothing but outliers.**Probabilistic and statistical models:**Statistical models such as normal distribution and exponential distribution can be used to detect any variations in the distribution of data points. If any data point is found outside the distribution range, it is rendered as an outlier.**Linear models:**Linear models such as logistic regression can be trained to flag outliers. In this manner, the model picks up the next outlier it sees.**Proximity-based models:**An example of this kind of model is the K-means clustering model wherein, data points form multiple or ‘k’ number of clusters based on features such as similarity or distance. Since similar data points form clusters, the outliers also form their own cluster. In this way, proximity-based models can easily help detect outliers.

- Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation.
- Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated.

**Eigenvectors:***Eigenvectors are those vectors whose direction remains unchanged even when a linear transformation is performed on them.***Eigenvalues:***Eigenvalue is the scalar that is used for the transformation of an Eigenvector.*

- A/B is Statistical hypothesis testing for randomized experiment with two variables A and B. It is used to compare two models that use different predictor variables in order to check which variable fits best for a given sample of data.
- Consider a scenario where you’ve created two models (using different predictor variables) that can be used to recommend products for an e-commerce platform.
- A/B Testing can be used to compare these two models to check which one best recommends products to a customer.

- It is a process of randomly selecting intact groups within a defined population, sharing similar characteristics.
- Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
- For example, if you’re clustering the total number of managers in a set of companies, in that case, managers (samples) will represent elements and companies will represent clusters.

- Measures such as, Gini Index and Entropy can be used to decide which variable is best fitted for splitting the Decision Tree at the root node.
- We can calculate Gini as following:

Calculate Gini for sub-nodes, using the formula – sum of square of probability for success and failure (p^2+q^2). - Calculate Gini for split using weighted Gini score of each node of that split
- Entropy is the measure of impurity or randomness in the data, (for binary class):

Here p and q is the probability of success and failure respectively in that node.

- NumPy is part of SciPy.
- NumPy defines arrays along with some basic numerical functions like indexing, sorting, reshaping, etc.
- SciPy implements computations such as numerical integration, optimization and machine learning using NumPy’s functionality.