Introduction Machine Learning and Scikit-Learn
Machine Learning
It is the study of algorithms that can learn by examples. Simply put, we will feed the algorithm with data and it will recognize patterns, learn from the data, and then it can be deployed for future observations.
Model
Let’s say we have one dimensional data with a single feature $X$ and corresponding values $y$. Our model is a relation that maps the feature $X$ with value $y$, i.e, $f(X) \approx y$.
Supervised and Unsupervised Learning
In supervised learning, we use labelled data to train the model. We are trying to find predictive relationship between features of our data and some sort of output label.
In unsupervised learning we want to find trends in our features without using any target labels. It usually involves reducing the dimensionality of data.
So, a supervised learning problem can be formally put as:
Given a matrix $X$, of dimensions $n \times p$, create a predictive relationship (or function) $f(X)$ where $f(X) \approx y$, where $y$ is a vector of dimension $n$. $X$ is referred to as the feature matrix and $y$ as the labels.
Linear Regression
Given features and corresponding values, we can fit a line to the data we have and extrapolate it to predict values. In the context of machine learning, this is called linear regression.
Overfitting and Underfitting
Suppose we are trying to fit a curve into our data using np.polyfit()
. We could tune our models using hyperparameters(in this case the degree of polynomial), and make them more or less flexible to fit the data we have.
If we allow the model too much flexible, it will fit all the data given, including noise in the data, which will result inaccurate predictions on unseen data. This is called overfitting.
If we do not allow enough flexibility, the model will not be able to fit to enough data and results in inaccurate predictions. This is called underfitting.
Scikit-learn
It is the most popular package for machine learning in python. Scikit-learn mainly provides us two things — machine learning algorithms and a few datasets.
Datasets
Scikit-learn comes with a few small datasets which we can use to understand various machine learning models. Details about these datasets can be found here.
Machine Learning Algorithms
Scikit-learn implements machine learning algorithms as classes which we can import. Classes follow the conventional PascalCase
. For example, Ridge
is a class representing ridge regression model. It can be used as:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1) #alpha is a hyperparameter of ridge model
Scikit-learn refers to machine learning algorithms as estimators. There are three types of estimators in scikit-learn:
- Classifiers
- Regressors
- Transformers
Estimators are divided into two groups: Predictors and Transformers.
Typical Workflow
Loading Data
Let’s load a standard dataset from scikit-learn:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = data['data'] #The features of the dataset
y = data['target'] #The target value
print(data['DESCR']) #Prints description of the dataset
fetch_california_housing
will return a dictionary like object, which will be stored on the variable data
. Details about the dataset are stored as the value of DESCR
key. We can use print(data['DESCR'])
to view these details. Feature matrix of the dataset is stored in data
key and the target values(i.e, labels) are stored in target
key. So, X
is our feature matrix and y
is the target value in this case.
Predictors
Classifiers and regressors are called predictors, as they are models that makes predictions. The basic workflow when dealing with predictors are(assuming model
is an object of an estimator class):
- Fit the data using
model.fit(X, y)
. This is the training part. - Score the model using
model.score(X, y)
. This is used for calculating the accuracy of our model. The evaluation methods are different for different models. In case of linear regression, it is $R^2$. - Predict new values using
model.predict(X)
We usually split our dataset and use one part for training and other part for testing the accuracy of the model:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y) #By default, splits 75% to training and 25% to testing
from sklearn.linear_model import LinearRegression
# create model and train/fit
model = LinearRegression()
model.fit(X, y)
# predict label values on X
y_pred = model.predict(X)
print(y_pred)
coefs_
and intercept_
) in the model object. We can see these values as:
print("β_0: {}".format(model.intercept_))
for i in range(8):
print("β_{}: {}".format(i+1, model.coef_[i]))
from sklearn.ensemble import GradientBoostingRegressor
# create model and train/fit
model = GradientBoostingRegressor()
model.fit(X, y)
# predict label values on X
y_pred = model.predict(X)
print(y_pred)
print("R^2: {:g}".format(model.score(X, y)))