Introduction Machine Learning and Scikit-Learn

Machine Learning

It is the study of algorithms that can learn by examples. Simply put, we will feed the algorithm with data and it will recognize patterns, learn from the data, and then it can be deployed for future observations.

Model

Let’s say we have one dimensional data with a single feature $X$ and corresponding values $y$. Our model is a relation that maps the feature $X$ with value $y$, i.e, $f(X) \approx y$.

Supervised and Unsupervised Learning

In supervised learning, we use labelled data to train the model. We are trying to find predictive relationship between features of our data and some sort of output label.
In unsupervised learning we want to find trends in our features without using any target labels. It usually involves reducing the dimensionality of data.
So, a supervised learning problem can be formally put as:
Given a matrix $X$, of dimensions $n \times p$, create a predictive relationship (or function) $f(X)$ where $f(X) \approx y$, where $y$ is a vector of dimension $n$. $X$ is referred to as the feature matrix and $y$ as the labels.

Linear Regression

Given features and corresponding values, we can fit a line to the data we have and extrapolate it to predict values. In the context of machine learning, this is called linear regression.

Overfitting and Underfitting

Suppose we are trying to fit a curve into our data using np.polyfit(). We could tune our models using hyperparameters(in this case the degree of polynomial), and make them more or less flexible to fit the data we have.
If we allow the model too much flexible, it will fit all the data given, including noise in the data, which will result inaccurate predictions on unseen data. This is called overfitting.
If we do not allow enough flexibility, the model will not be able to fit to enough data and results in inaccurate predictions. This is called underfitting.

Scikit-learn

It is the most popular package for machine learning in python. Scikit-learn mainly provides us two things — machine learning algorithms and a few datasets.

Datasets

Scikit-learn comes with a few small datasets which we can use to understand various machine learning models. Details about these datasets can be found here.

Machine Learning Algorithms

Scikit-learn implements machine learning algorithms as classes which we can import. Classes follow the conventional PascalCase. For example, Ridge is a class representing ridge regression model. It can be used as:

from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1) #alpha is a hyperparameter of ridge model
Hyperparameters are set prior to learning and they control what the values of the model parameters are equal to after undergoing training. In scikit-learn, hyperparameters are set while an instance of a class.
Scikit-learn refers to machine learning algorithms as estimators. There are three types of estimators in scikit-learn:

  1. Classifiers
  2. Regressors
  3. Transformers

Estimators are divided into two groups: Predictors and Transformers.

Typical Workflow

Loading Data

Let’s load a standard dataset from scikit-learn:

from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
X = data['data'] #The features of the dataset
y = data['target'] #The target value

print(data['DESCR']) #Prints description of the dataset
In the above snippet, fetch_california_housing will return a dictionary like object, which will be stored on the variable data. Details about the dataset are stored as the value of DESCR key. We can use print(data['DESCR']) to view these details. Feature matrix of the dataset is stored in data key and the target values(i.e, labels) are stored in target key. So, X is our feature matrix and y is the target value in this case.

Predictors

Classifiers and regressors are called predictors, as they are models that makes predictions. The basic workflow when dealing with predictors are(assuming model is an object of an estimator class):

  1. Fit the data using model.fit(X, y). This is the training part.
  2. Score the model using model.score(X, y). This is used for calculating the accuracy of our model. The evaluation methods are different for different models. In case of linear regression, it is $R^2$.
  3. Predict new values using model.predict(X)

We usually split our dataset and use one part for training and other part for testing the accuracy of the model:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y) #By default, splits 75% to training and 25% to testing
Here is a snippet that implements the above steps:
from sklearn.linear_model import LinearRegression

# create model and train/fit
model = LinearRegression()
model.fit(X, y)

# predict label values on X
y_pred = model.predict(X)

print(y_pred)
The above snippet fits the given data into a linear regression model. Here the same data used for training is used to predict new values, just for illustrative purpose. In a real case, we will split the data into training and testing sets. In the above case, our model(Linear regression is): $$ y(X) = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7 + \beta_8 x_8 + \beta_0. $$ The $\beta$ values are the parameters of linear regression model and these are calculated when our model is fitted with data. Here the $\beta_0$ is the intercept and other $\beta$ values are the coefficients. These are stored as states(coefs_ and intercept_) in the model object. We can see these values as:

print("β_0: {}".format(model.intercept_))

for i in range(8):
    print("β_{}: {}".format(i+1, model.coef_[i]))
The process is very similar if we want to use models other than linear regression. Here is an example for using gradient boosting regressor:

from sklearn.ensemble import GradientBoostingRegressor

# create model and train/fit
model = GradientBoostingRegressor()
model.fit(X, y)

# predict label values on X
y_pred = model.predict(X)

print(y_pred)
print("R^2: {:g}".format(model.score(X, y)))

Data Science
pythonmachine learningscikit-learn