
MACHINE LEARNING
1. Difference between feature selection and extraction?
This is the task in which we select the most relevant features. The features that clearly do not hold any importance in determining the prediction of the model are rejected.
Feature extraction on the other hand is the process by which the features are extracted from the raw data. It involves transforming raw data into a set of features that can be used to train an ML model. (For example, creating a new features which are combinations of original ones.
Both of these are very important as they help in filtering the features for our ML model which helps in determining the accuracy of the model.
2. Five Assumptions for Linear Regression?
-
Assumption 1 - Linearity: There is a linear relationship between the independent variables and the dependent variable. This assumption posits that there is a straight-line relationship between the independent variables (predictors) and the dependent variable (outcome). Mathematically, it implies that the change in the dependent variable due to one unit change in any one of the independent variables is constant.
-
Assumption 2 - Independence of errors: The errors (residuals) are independent of each other. This assumption requires that the residuals (errors) of the model, which are the differences between the observed values and the values predicted by the linear model, are independent of each other. This means that the error for one observation is not influenced by the error for any other observation, which is used for constructing confidence intervals and conducting hypothesis tests.
-
Assumption 3 - Homoscedasticity: The variance of the errors is constant across all predicted values. In other words, the size of the error does not vary with the value of the independent variables. When this assumption is violated, it leads to heteroscedasticity, where the variance of the errors differs at different levels of the independent variables, potentially compromising the efficiency of the regression estimates.
-
Assumption 4 - The error follows a normal distribution: This assumption is especially important for small sample sizes. For large sample sizes, the central limit theorem ensures that the distribution of residuals is approximately normal.
-
Assumption 5 - Independence of predictors: This assumption, also known as no multicollinearity, requires that the model's independent variables are not correlated. High correlation can lead to multicollinearity, which makes it difficult to interpret the effect of each independent variable on the dependent variable.
3. Difference between linear and non-linear regression?
Linear regression is the method that is used to find the relationship between a dependent variable and one or more independent variables. The mode finds the best-fit line, which is a linear function, that helps in fitting the model in such a way that the error is minimal.
A non-linear regression is used to model the relationship between a dependent variable and one or more independent variables by a non-linear equation. Moreover, non-linear regression models are more flexible and are able to find more complex relationships between variables.
4. Identifying Underfitting and Overfitting in a Model?
Underfitting occurs when a statistical model or machine learning algorithm is not able to capture the underlying patterns of the data. This can happen for a variety of reasons, but one common cause is that the model is too simple and is not able to capture the complexity of the data.
The training error of an underfitting error will be high, i.e., the model will not be able to learn from the training data and will perform poorly. Also, the validation error will be high as it will perform poorly on the new, unseen test data.
Overfitting occurs when the model learns the whole training data and performs extremely well on the training data, but poorly on the test data. The testing error of the model is high compared to the training error.
