Neural Network Basics

Sat, Apr 11, 2020

Logistic Regression as a Neural Networks

Logistic regression is an algorithm for binary classification.

Binary Classification

The type of classification problem in which the response is one of the two possible values.
Example: Given an image input, we want to output a label to recognize the image as a cat(label=1) or not(label=0).
Let’s say we have an 64x64 pixel image. We will have three 64x64 matrices corresponding to red, green and blue intensity values. To represent these as input features, we unroll these matrices into a single vector $X$ having 3x64x64=12288 elements(each corresponding to one pixel). We use $n_x$ to represent the dimension of input features X. In this example, $n_x = 12288$. We can also represent it as n for brevity.

Notations

A single training example is represented by a pair $(x,y)$ where $x\in\mathbb{R}^{n_x}$, $n_x$-dimensional feature matrix, and $y\in\{0,1\} $ is the output label.
We have $m$ or $m_{train}$ number of training samples training samples as, $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)})\}$. For denoting number of test samples, we use $m_{test}$. To put all of the training examples in a more compact form, we define a matrix $X$ as: $$ X = \begin{bmatrix} \rvert&\rvert&&\rvert \\
x^{(1)}&x^{(2)}&\cdots&x^{(m)} \\
\rvert&\rvert&&\rvert \end{bmatrix} $$ Which has each column as a training example(having $n_x$ features). Since there are $m$ number of training examples, we’ll have $m$ number of columns. And $n_x$ number of rows since each $x^i$ has $n_x$ features.
Similarly, the output labels y can also be stacked into columns as $Y = \begin{bmatrix}y^{(1)}&y^{(2)}&…&y^{(m)}\end{bmatrix}$, where each y is 0 or 1.

Logistic Regression

It is a learning algorithm that we use when the output label Y are either all zeros or ones. Given input $x$, we want output $\hat{y}$ as the probability $P(y=1 \mid x)$, i.e, $\hat{y}$ is the chance that the given image is a cat picture.
We have, $x\in\mathbb{R}^{n_x}$ and $\hat{y}\in[0,1]$.
Let us define two parameters, $w\in\mathbb{R}^{n_x}$ and $b\in\mathbb{R}$.
We need to generate output $\hat{y}$ based on input x, and parameters $w$ and $b$. In case of of linear regression, we could do $\hat{y}=w^{\intercal}x+b$ but it is not suitable for logistic regression as $\hat{y}$ in its case is probability and it should be between 0 and 1. We use the sigmoid function($\sigma$) to make the value between 0 and 1, i.e $\hat{y}=\sigma(w^{\intercal}x+b)$. Where $\sigma(z)=\frac{1}{1+e^{-z}}$. As $z$ gets larger, the value of sigmoid function get closer to 1 and as $z$ gets smaller(large negative number), the value gets closer to 0.

Graph of sigmoid function, taken from Wikipedia.

In logistic regression, we try to find/learn the value of parameters, $w$ and $b$, so that $\hat{y}$ becomes a good estimate of the change that $y=1$(cat picture).

Logistic Regression Cost function

Given training examples, $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)})\}$, we want $\hat{y}^{(i)}\approx y^{(i)}$.

Loss (error) Function

To measure how well our algorithm is doing, we could define the loss or error, between predicted $\hat{y}$ and true value $y$. Square error, $\mathcal{L(\hat{y},y)}=\frac{1}{2}(\hat{y}-y)^2$, is not used with logistic regression as it can have multiple local optima and finding a global optima can be hard. We use the following loss function in logistic regression: $$\mathcal{L(\hat{y},y)}=-(y\log\hat{y}+(1-y)\log(1-\hat{y}))$$ The loss function becomes small when $\hat{y}$ is closer to $y$. Substitute $y=1$ and $y=0$ in the above equation and verify.

Cost Function

Cost function measures how well we’re doing in entire training examples. Loss function measures it for only one training example. It is defined as: $$J(w,b) = \frac{1}{m}\sum_{i=0}^{m}\mathcal{L(\hat{y}^{(i)},y^{(i)})} = -\frac{1}{m}\sum_{i=0}^{m}(y\log\hat{y}+(1-y)\log(1-\hat{y}))$$ which is basically the average of all loss functions.

Gradient Descent

Our goal is to find parameters $w$ and $b$, so that $J(w,b)$ is minimum.

Graph of a convex function, taken from Wikimedia commons

To find the the minimum value of our cost function, which is a convex function like shown above with higher dimensions, gradient descent starts at an initial point and takes a step to the most steepest direction. It is repeated till the minimum value is found.
Considering case of a one dimensional value $w$, was can explain it as, cost function can be optimized as: $$w := w-\alpha\frac{\partial{J(w)}}{\partial{w}}$$ where $\alpha$ is the learning rate and it determines the size of step we take in each iteration. Observing the above equation, if $w$ is more than the minimum point, the slope $(\frac{\partial{J(w)}}{\partial{w}})$ is positive(as shown in the graph above) and the new $w$ will decrease. If $w$ is less than the minimum, the slope is negative and $w$ is increased in the next step by $\alpha\frac{\partial{J(w)}}{\partial{w}}$. When writing code, $(\frac{\partial{J(w)}}{\partial{w}})$ term is represented by dw, i.e w := w - adw. Taking the case of parameter $b$ as well, we get: $$w := w - \alpha\frac{\partial{J(w,b)}}{\partial{w}}$$ $$b := b - \alpha\frac{\partial{J(w,b)}}{\partial{b}}$$

Derivatives

Derivative at a point gives the slope of the function at that point. $\frac{\partial{J(w,b)}}{\partial{w}}$ means the change in $J(w,b)$ when $w$ changes by an infinitesimally small value.

Computation graph

Computations of neural network are organized in terms of a forward pass, or forward propagation step, where we compute the output of a neural network and a backward pass, or backward propagation, where we compute gradients or derivatives. A computation graph can be used to show why it is organized this way. The computation graph for $J(a,b,c)=3(a+bc)$ is:

We can compute the output by doing left to right pass on the computation graph(i.e computing a,b,c,u etc. and finally J).

Derivative calculation using computation graph

Computation graph of J(a,b,c) = 3(a+bc) with example values.

Here, u = 6, v = 11 and J = 33. Doing a right to left pass(back propagation), $\frac{dJ}{dv}=3$. Similarly, $dJ/da=(dJ/dv)(dv/da)=3\cdot1=3$. When writing python code, we will using the variable name dJdvar or just dvar to represent above mentioned derivatives. So, $\frac{dJ}{dv}$ is represented as just dv, $\frac{dJ}{da}$ as da and so on. Now, $\frac{dJ}{du}=\frac{dJ}{dv}\cdot\frac{dv}{du}=3\cdot1=3$ and $\frac{dJ}{db}=\frac{dJ}{du}\cdot\frac{du}{db}=3\cdot c=3\cdot2=6$. Similarly we obtain, $\frac{dJ}{dc}=9$.
We see from the above discussion that for computing derivatives, the most efficient way is through a right to left computation since the derivatives needed(that will make computation easier using chain rule) to calculate those in left parts are calculated first.

Logistic Regression Gradient Descent

Computation graph of logistic regression.

Like discussed in the previous section, $d\mathcal{L}/dx_1$,$d\mathcal{L}/dx_2$,$d\mathcal{L}/dw_1$,$d\mathcal{L}/dw_2$ and $d\mathcal{L}/db$ can be calculated using back propagation. These values are found to be $$\frac{d\mathcal{L}}{dw_1}=x_1dz$$ $$\frac{d\mathcal{L}}{dw_2}=x_2dz$$ $$\frac{d\mathcal{L}}{db}=dz=a-y$$ In code, these are variables dx1, dx2, dw1, dw2 and db. After computing these, w1(and w2) and b are updated as discussed previously: $$w_1 := w_1 - \alpha dw_1$$ $$b := b - \alpha db$$

Gradient Descent on m examples

Recall that, cost function(J) is the average of loss functions(L) of m training examples.
$$J(w,b) = \frac{1}{m}\sum_{i=0}^{m}\mathcal{L(a^{(i)},y^{(i)})}$$ $$\frac{\partial{}}{\partial{w_1}}J(w,b)=\frac{1}{m}\sum_{i=0}^{m}\underbrace{\frac{\partial{}}{\partial{w_1}}\mathcal{L(a^{(i)}},y^{(i)})}_{dw_{1}^{(i)}}$$ Keeping in mind the variable naming convention described in the previous section, we can calculate the overall gradient of m training sets using the following algorithm(assuming two features x1 and x2):

Initialize J=0,dw1=0,dw2=0,db=0
For i=1 to m:
    zi = w'*xi+b
    ai = σ(zi)
    J += -(yi*log(ai)+(1-yi)log(1-ai))
    dzi = ai - yi
    ## Here we only have two features x1 and x2
    ## When we have nx features, there will be a for loop here
    dw1 += xi1*dzi    #As derived from previous section
    dw2 += xi2*dzi
    db += dzi
J/=m
dw1/=m
dw2/=m
w1 := w1 - α*dw1
w2 := w2 -  α*dw2
b := b - b*db

In this algorithm, we have two for loops that will slow down our network when we have large amount of data. An algorithm without using for loops can be implemented using vectorization.

Python and vectorization

Vectorization

Vectorization makes use of SIMD(Single Instruction Multiple Data) instructions to do matrix operations without using for loop. It is much faster.
Pseudo code for calculating $z=w^{\intercal}x+b$ without using vectorized code(w and x are column nx dimensional vectors):

z = 0
for i=0 to nx-1:
  z+=w[i]*x[i]
z+=b

Vectorized code using numpy would be,np.dot(w,x) + b.

More vectorization examples

Whenever possible, avoid using explicit for loops and use built-in functions instead.

Multiplying a vector and matrix

Say, we need u = Av, where A(of ixj) is a matrix and v is a vector. Non-vectorized version:

for i=0 to num_rows
  for j=0 to num_columns
    u[i] += A[i][j]*v[j]

Vectors and matrix values functions

Save we have a column vector v and we need to take exponential of each element and output a vector u. Non vectorized version:

u = np.zeros((n,1)) #n x 1 zero matrix
for i in range(n):
    u[i] = math.exp(v[i])

Vectorized version:

np.exp(v)

There are many such function like np.log, np.abs, np.max, np.maximum, v**2(for squaring) etc.

Logistric Regression Derivatives

Non vectorized version:

Initialize J=0,dw1=0,dw2=0,db=0
For i=1 to m:
    zi = w'*xi+b
    ai = σ(zi)
    J += -(yi*log(ai)+(1-yi)log(1-ai))
    dzi = ai - yi
    ## Here we only have two features x1 and x2
    ## When we have nx features, there will be a for loop here
    dw1 += xi1*dzi    #As derived from previous section
    dw2 += xi2*dzi
    db += dzi
J/=m
dw1/=m
dw2/=m
w1 := w1 - α*dw1
w2 := w2 -  α*dw2
b := b - b*db

First, we are going to remove the second for loop. Instead of using dw1,dw2 etc. and looping through them, we can use a single vector dw as follows:

dw = np.zeros((n_x,1)) #instead of dw1,dw2,...,dwn_x
#...
dw += xi*dz
#...
dw/=m

Vectorizing Logistic Regression

Logistic regression using no for loops. Instead of calculating z1 = w'*x1+b,z2 = w'*x2+b etc., we can use Z = np.dot(w',X) where X is the compact form containing input features of all training sets as discussed in notations section earlier. Now, a1 = σ(z1), a2 = σ(z2) etc. can also be replaced using vectorized simoid functions, as A = [a1 a2 a3] = σ(Z).

Vectorizing Logistic Regression’s Gradient Descent Output

Now, for removing the first loop for dzi = ai - yi, we can you dz = A - Y, where A = [a1 a2 a3 …] and Y = [y1 y2 y3 …]. So, the vectorized logistic regression algorithm would be:

Z = np.dot(w.T,X) + b #np.dot for matrix multiplication
A = σ(Z)
dz = A - Y
dw = (1/m)*X*dz.T # so that dw is equivalent to [dw1 = x1*dz, dw2= x2*dz, ...]
db = (1/m)*np.sum(dz)
w := w-αdw
b := b-αdb

Broadcasting in Python

We can use broadcasting to make our python programs simpler and run faster.
Broadcasting examples on two matrices A(2x2) and B(2x1):

Y = A + B # Add elements in A to corresponding elements in B, column of B copied once to match dimension of A
Y = A- B 
Y = A + 2 # Add 2 to all elements in A
Y = A*B # Element-wise multiplication
Y = A/B # Element-wise division

General Principle

Consider a matrix A of shape (m,n) broadcasted using (+,-,* or /) to a (1,n) or (m,1) shaped vector. Then the result is a matrix of shape (m,n).

A note on Python/numpy vectors

If we do a = np.random.randn(5), a will have a shape of (5,). This is called a rank 1 array and it is better to avoid dealing with it. We should instead have a row or column vector of shape (1,5) or (5,1) in this case. We can either reshape a or use a = np.random.randn((1,5)) or a = np.random.rand((5,1)).

From Jupyter Notebooks

Snippets of useful code from this modules Jupyter notebooks.

Sigmoid Function using broadcasting

def sigmoid(x):
    s = 1/(1+np.exp(-x)) # Since we're using broadcasting, it will work on matrices and arrays
    return s
x = np.array([1, 2, 3])
sigmoid(x)

Gradient of sigmoid

We need to calculate $\sigma’(x) = s(1-s)$.

def sigmoid_derivative(x):
    s = 1/(1+np.exp(-x))
    ds = s*(1-s)
    return ds

Reshaping arrays

This function is often used in deep learning to unroll image intro a 1D vector. An image is a matrix of shape (length,height,depth=3), where depth is used for color.

def image2vector(image):
    v = image.reshape(image.shape[0]*image.shape[1]*image.shape[2],1) # v will have length*height*depth rows and 1 column
    return v

Matrix multiplication

np.dot(A,B) # Matrix multiplication
np.multiply(A,B) # Element-wise multiplication

Unrolling training set images

To flatten a matrix X of shape (a,b,c,d) to a matrix X_flatten of shape (b∗c∗d, a) is to use:

X_flatten = X.reshape(X.shape[0], -1).T

This is useful when a is the number of training examples.

Neural Networks