Neural Network Basics
Logistic Regression as a Neural Networks
Logistic regression is an algorithm for binary classification.
Binary Classification
The type of classification problem in which the response is one of the two possible values.
Example: Given an image input, we want to output a label to recognize the image as a cat(label=1) or not(label=0).
Let’s say we have an 64x64 pixel image. We will have three 64x64 matrices corresponding to red, green and blue intensity values. To represent these as input features, we unroll these matrices into a single vector $X$ having 3x64x64=12288 elements(each corresponding to one pixel). We use $n_x$ to represent the dimension of input features X. In this example, $n_x = 12288$. We can also represent it as n for brevity.
Notations
A single training example is represented by a pair $(x,y)$ where $x\in\mathbb{R}^{n_x}$, $n_x$-dimensional feature matrix, and $y\in\{0,1\} $ is the output label.
We have $m$ or $m_{train}$ number of training samples training samples as, $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)})\}$. For denoting number of test samples, we use $m_{test}$. To put all of the training examples in a more compact form, we define a matrix $X$ as:
$$
X = \begin{bmatrix}
\rvert&\rvert&&\rvert \\
x^{(1)}&x^{(2)}&\cdots&x^{(m)} \\
\rvert&\rvert&&\rvert
\end{bmatrix}
$$
Which has each column as a training example(having $n_x$ features). Since there are $m$ number of training examples, we’ll have $m$ number of columns. And $n_x$ number of rows since each $x^i$ has $n_x$ features.
Similarly, the output labels y can also be stacked into columns as $Y = \begin{bmatrix}y^{(1)}&y^{(2)}&…&y^{(m)}\end{bmatrix}$, where each y is 0 or 1.
Logistic Regression
It is a learning algorithm that we use when the output label Y are either all zeros or ones.
Given input $x$, we want output $\hat{y}$ as the probability $P(y=1 \mid x)$, i.e, $\hat{y}$ is the chance that the given image is a cat picture.
We have, $x\in\mathbb{R}^{n_x}$ and $\hat{y}\in[0,1]$.
Let us define two parameters, $w\in\mathbb{R}^{n_x}$ and $b\in\mathbb{R}$.
We need to generate output $\hat{y}$ based on input x, and parameters $w$ and $b$. In case of of linear regression, we could do $\hat{y}=w^{\intercal}x+b$ but it is not suitable for logistic regression as $\hat{y}$ in its case is probability and it should be between 0 and 1. We use the sigmoid function($\sigma$) to make the value between 0 and 1, i.e $\hat{y}=\sigma(w^{\intercal}x+b)$. Where $\sigma(z)=\frac{1}{1+e^{-z}}$. As $z$ gets larger, the value of sigmoid function get closer to 1 and as $z$ gets smaller(large negative number), the value gets closer to 0.
In logistic regression, we try to find/learn the value of parameters, $w$ and $b$, so that $\hat{y}$ becomes a good estimate of the change that $y=1$(cat picture).
Logistic Regression Cost function
Given training examples, $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)})\}$, we want $\hat{y}^{(i)}\approx y^{(i)}$.
Loss (error) Function
To measure how well our algorithm is doing, we could define the loss or error, between predicted $\hat{y}$ and true value $y$. Square error, $\mathcal{L(\hat{y},y)}=\frac{1}{2}(\hat{y}-y)^2$, is not used with logistic regression as it can have multiple local optima and finding a global optima can be hard. We use the following loss function in logistic regression: $$\mathcal{L(\hat{y},y)}=-(y\log\hat{y}+(1-y)\log(1-\hat{y}))$$ The loss function becomes small when $\hat{y}$ is closer to $y$. Substitute $y=1$ and $y=0$ in the above equation and verify.
Cost Function
Cost function measures how well we’re doing in entire training examples. Loss function measures it for only one training example. It is defined as: $$J(w,b) = \frac{1}{m}\sum_{i=0}^{m}\mathcal{L(\hat{y}^{(i)},y^{(i)})} = -\frac{1}{m}\sum_{i=0}^{m}(y\log\hat{y}+(1-y)\log(1-\hat{y}))$$ which is basically the average of all loss functions.
Gradient Descent
Our goal is to find parameters $w$ and $b$, so that $J(w,b)$ is minimum.
To find the the minimum value of our cost function, which is a convex function like shown above with higher dimensions, gradient descent starts at an initial point and takes a step to the most steepest direction. It is repeated till the minimum value is found.
Considering case of a one dimensional value $w$, was can explain it as, cost function can be optimized as:
$$w := w-\alpha\frac{\partial{J(w)}}{\partial{w}}$$
where $\alpha$ is the learning rate and it determines the size of step we take in each iteration. Observing the above equation, if $w$ is more than the minimum point, the slope $(\frac{\partial{J(w)}}{\partial{w}})$ is positive(as shown in the graph above) and the new $w$ will decrease. If $w$ is less than the minimum, the slope is negative and $w$ is increased in the next step by $\alpha\frac{\partial{J(w)}}{\partial{w}}$. When writing code, $(\frac{\partial{J(w)}}{\partial{w}})$ term is represented by dw
, i.e w := w - adw
. Taking the case of parameter $b$ as well, we get:
$$w := w - \alpha\frac{\partial{J(w,b)}}{\partial{w}}$$
$$b := b - \alpha\frac{\partial{J(w,b)}}{\partial{b}}$$
Derivatives
Derivative at a point gives the slope of the function at that point. $\frac{\partial{J(w,b)}}{\partial{w}}$ means the change in $J(w,b)$ when $w$ changes by an infinitesimally small value.
Computation graph
Computations of neural network are organized in terms of a forward pass, or forward propagation step, where we compute the output of a neural network and a backward pass, or backward propagation, where we compute gradients or derivatives. A computation graph can be used to show why it is organized this way. The computation graph for $J(a,b,c)=3(a+bc)$ is: We can compute the output by doing left to right pass on the computation graph(i.e computing a,b,c,u etc. and finally J).
Derivative calculation using computation graph
Here, u = 6, v = 11 and J = 33. Doing a right to left pass(back propagation), $\frac{dJ}{dv}=3$. Similarly, $dJ/da=(dJ/dv)(dv/da)=3\cdot1=3$. When writing python code, we will using the variable name dJdvar
or just dvar
to represent above mentioned derivatives. So, $\frac{dJ}{dv}$ is represented as just dv
, $\frac{dJ}{da}$ as da
and so on. Now, $\frac{dJ}{du}=\frac{dJ}{dv}\cdot\frac{dv}{du}=3\cdot1=3$ and $\frac{dJ}{db}=\frac{dJ}{du}\cdot\frac{du}{db}=3\cdot c=3\cdot2=6$. Similarly we obtain, $\frac{dJ}{dc}=9$.
We see from the above discussion that for computing derivatives, the most efficient way is through a right to left computation since the derivatives needed(that will make computation easier using chain rule) to calculate those in left parts are calculated first.
Logistic Regression Gradient Descent
Like discussed in the previous section, $d\mathcal{L}/dx_1$,$d\mathcal{L}/dx_2$,$d\mathcal{L}/dw_1$,$d\mathcal{L}/dw_2$ and $d\mathcal{L}/db$ can be calculated using back propagation. These values are found to be $$\frac{d\mathcal{L}}{dw_1}=x_1dz$$ $$\frac{d\mathcal{L}}{dw_2}=x_2dz$$
$$\frac{d\mathcal{L}}{db}=dz=a-y$$
In code, these are variables dx1
, dx2
, dw1
, dw2
and db
. After computing these, w1(and w2) and b are updated as discussed previously:
$$w_1 := w_1 - \alpha dw_1$$
$$b := b - \alpha db$$
Gradient Descent on m examples
Recall that, cost function(J) is the average of loss functions(L) of m training examples.
$$J(w,b) = \frac{1}{m}\sum_{i=0}^{m}\mathcal{L(a^{(i)},y^{(i)})}$$
$$\frac{\partial{}}{\partial{w_1}}J(w,b)=\frac{1}{m}\sum_{i=0}^{m}\underbrace{\frac{\partial{}}{\partial{w_1}}\mathcal{L(a^{(i)}},y^{(i)})}_{dw_{1}^{(i)}}$$
Keeping in mind the variable naming convention described in the previous section, we can calculate the overall gradient of m training sets using the following algorithm(assuming two features x1 and x2):
Initialize J=0,dw1=0,dw2=0,db=0
For i=1 to m:
zi = w'*xi+b
ai = σ(zi)
J += -(yi*log(ai)+(1-yi)log(1-ai))
dzi = ai - yi
## Here we only have two features x1 and x2
## When we have nx features, there will be a for loop here
dw1 += xi1*dzi #As derived from previous section
dw2 += xi2*dzi
db += dzi
J/=m
dw1/=m
dw2/=m
w1 := w1 - α*dw1
w2 := w2 - α*dw2
b := b - b*db
In this algorithm, we have two for loops that will slow down our network when we have large amount of data. An algorithm without using for loops can be implemented using vectorization.
Python and vectorization
Vectorization
Vectorization makes use of SIMD(Single Instruction Multiple Data) instructions to do matrix operations without using for loop. It is much faster.
Pseudo code for calculating $z=w^{\intercal}x+b$ without using vectorized code(w and x are column nx dimensional vectors):
z = 0
for i=0 to nx-1:
z+=w[i]*x[i]
z+=b
Vectorized code using numpy would be,np.dot(w,x) + b
.
More vectorization examples
Whenever possible, avoid using explicit for loops and use built-in functions instead.
Multiplying a vector and matrix
Say, we need u = Av
, where A(of ixj) is a matrix and v is a vector. Non-vectorized version:
for i=0 to num_rows
for j=0 to num_columns
u[i] += A[i][j]*v[j]
Vectors and matrix values functions
Save we have a column vector v and we need to take exponential of each element and output a vector u. Non vectorized version:
u = np.zeros((n,1)) #n x 1 zero matrix
for i in range(n):
u[i] = math.exp(v[i])
np.exp(v)
There are many such function like np.log
, np.abs
, np.max
, np.maximum
, v**2
(for squaring) etc.
Logistric Regression Derivatives
Non vectorized version:
Initialize J=0,dw1=0,dw2=0,db=0
For i=1 to m:
zi = w'*xi+b
ai = σ(zi)
J += -(yi*log(ai)+(1-yi)log(1-ai))
dzi = ai - yi
## Here we only have two features x1 and x2
## When we have nx features, there will be a for loop here
dw1 += xi1*dzi #As derived from previous section
dw2 += xi2*dzi
db += dzi
J/=m
dw1/=m
dw2/=m
w1 := w1 - α*dw1
w2 := w2 - α*dw2
b := b - b*db
First, we are going to remove the second for loop. Instead of using dw1
,dw2
etc. and looping through them, we can use a single vector dw as follows:
dw = np.zeros((n_x,1)) #instead of dw1,dw2,...,dwn_x
#...
dw += xi*dz
#...
dw/=m
Vectorizing Logistic Regression
Logistic regression using no for loops. Instead of calculating z1 = w'*x1+b
,z2 = w'*x2+b
etc., we can use Z = np.dot(w',X)
where X is the compact form containing input features of all training sets as discussed in notations section earlier.
Now, a1 = σ(z1)
, a2 = σ(z2)
etc. can also be replaced using vectorized simoid functions, as A = [a1 a2 a3] = σ(Z).
Vectorizing Logistic Regression’s Gradient Descent Output
Now, for removing the first loop for dzi = ai - yi
, we can you dz = A - Y
, where A = [a1 a2 a3 …] and Y = [y1 y2 y3 …].
So, the vectorized logistic regression algorithm would be:
Z = np.dot(w.T,X) + b #np.dot for matrix multiplication
A = σ(Z)
dz = A - Y
dw = (1/m)*X*dz.T # so that dw is equivalent to [dw1 = x1*dz, dw2= x2*dz, ...]
db = (1/m)*np.sum(dz)
w := w-αdw
b := b-αdb
Broadcasting in Python
We can use broadcasting to make our python programs simpler and run faster.
Broadcasting examples on two matrices A(2x2) and B(2x1):
Y = A + B # Add elements in A to corresponding elements in B, column of B copied once to match dimension of A
Y = A- B
Y = A + 2 # Add 2 to all elements in A
Y = A*B # Element-wise multiplication
Y = A/B # Element-wise division
General Principle
Consider a matrix A of shape (m,n) broadcasted using (+
,-
,*
or /
) to a (1,n) or (m,1) shaped vector. Then the result is a matrix of shape (m,n).
A note on Python/numpy vectors
If we do a = np.random.randn(5)
, a
will have a shape of (5,)
. This is called a rank 1 array and it is better to avoid dealing with it. We should instead have a row or column vector of shape (1,5) or (5,1) in this case. We can either reshape a or use a = np.random.randn((1,5))
or a = np.random.rand((5,1))
.
From Jupyter Notebooks
Snippets of useful code from this modules Jupyter notebooks.
Sigmoid Function using broadcasting
def sigmoid(x):
s = 1/(1+np.exp(-x)) # Since we're using broadcasting, it will work on matrices and arrays
return s
x = np.array([1, 2, 3])
sigmoid(x)
Gradient of sigmoid
We need to calculate $\sigma’(x) = s(1-s)$.
def sigmoid_derivative(x):
s = 1/(1+np.exp(-x))
ds = s*(1-s)
return ds
Reshaping arrays
This function is often used in deep learning to unroll image intro a 1D vector. An image is a matrix of shape (length,height,depth=3)
, where depth is used for color.
def image2vector(image):
v = image.reshape(image.shape[0]*image.shape[1]*image.shape[2],1) # v will have length*height*depth rows and 1 column
return v
Matrix multiplication
np.dot(A,B) # Matrix multiplication
np.multiply(A,B) # Element-wise multiplication
Unrolling training set images
To flatten a matrix X of shape (a,b,c,d) to a matrix X_flatten of shape (b∗c∗d, a) is to use:
X_flatten = X.reshape(X.shape[0], -1).T
This is useful when a
is the number of training examples.