{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Calculus - Week 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Dive into the fundamentals of calculus for machine learning and data science. This week you'll learn about optimization of single and multi-layer neural networks, second derivatives, the Hessian and second-order optimization algorithms like Newton's method.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "import re\n", "import warnings\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import plotly.graph_objects as go\n", "import plotly.io as pio\n", "import sympy as sp\n", "from IPython.core.getipython import get_ipython\n", "from IPython.display import display, HTML, Math\n", "from matplotlib.animation import FuncAnimation\n", "from scipy.interpolate import interp1d\n", "\n", "plt.style.use(\"seaborn-v0_8-whitegrid\")\n", "pio.renderers.default = \"plotly_mimetype+notebook\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimizing neural networks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Single-neuron network with linear activation and Mean Squared Error (MSE) loss function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let the linear model $Z = wX + b$, where\n", "\n", "$X$ is a $(k, m)$ matrix of $k$ features for $m$ samples,\n", "\n", "$w$ is a $(1, k)$ matrix (row vector) containing $k$ weights,\n", "\n", "$b$ is a $(1, 1)$ matrix (scalar) containing 1 bias, such that\n", "\n", "$Z = \\begin{bmatrix}w_1&&w_2&&\\dots w_k\\end{bmatrix} \\begin{bmatrix}x_{11}&&x_{12}&&\\dots&&x_{1m}\\\\x_{21}&&x_{22}&&\\dots&&x_{2m}\\\\\\vdots&&\\vdots&&\\ddots&&\\vdots\\\\x_{k1}&&x_{k2}&&\\dots&&x_{km}\n", "\\end{bmatrix} + \\begin{bmatrix}b\\end{bmatrix}$\n", "\n", "$Z$ is a $(1, m)$ matrix.\n", "\n", "Let $Y$ a $(1, m)$ matrix (row vector) containing the labels of $m$ samples, such that\n", "\n", "$Y = \\begin{bmatrix}y_1&&y_2&&\\dots&&y_m\\end{bmatrix}$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = 40\n", "k = 2\n", "w = np.array([[0.37, 0.57]])\n", "b = np.array([[0.1]])\n", "\n", "rng = np.random.default_rng(4)\n", "X = rng.standard_normal((k, m))\n", "Y = w @ X + b + rng.normal(size=(1, m))\n", "\n", "scatter = go.Scatter3d(\n", " z=Y.squeeze(),\n", " x=X[0],\n", " y=X[1],\n", " mode=\"markers\",\n", " marker=dict(color=\"#1f77b4\", size=5),\n", " name=\"data\",\n", ")\n", "\n", "fig = go.Figure(scatter)\n", "fig.update_layout(\n", " title=\"Regression data\",\n", " autosize=False,\n", " width=600,\n", " height=600,\n", " scene_aspectmode=\"cube\",\n", " margin=dict(l=10, r=10, b=10, t=30),\n", " scene=dict(\n", " xaxis=dict(title=\"x1\", range=[-2.5, 2.5]),\n", " yaxis=dict(title=\"x2\", range=[-2.5, 2.5]),\n", " zaxis_title=\"y\",\n", " camera_eye=dict(x=1.2, y=-1.8, z=0.5),\n", " ),\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The predictions $\\hat{Y}$ are the result of passing $Z$ to a linear activation function, so that $\\hat{Y} = I(Z)$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def linear(x, m):\n", " return x * m\n", "\n", "\n", "def init_neuron_params(k):\n", " w = rng.uniform(size=(1, k)) * 0.5\n", " b = np.zeros((1, 1))\n", " return {\"w\": w, \"b\": b}\n", "\n", "\n", "xx1, xx2 = np.meshgrid(np.linspace(-2.5, 2.5, 100), np.linspace(-2.5, 2.5, 100))\n", "w, b = init_neuron_params(k).values()\n", "random_model_plane = go.Surface(\n", " z=linear(w[0, 0] * xx1 + w[0, 1] * xx2 + b, m=1),\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#FF8920\"], [1, \"#FF8920\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"init params\",\n", ")\n", "fig.add_trace(random_model_plane)\n", "fig.update_layout(title=\"Random model\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a single sample the squared loss is\n", "\n", "$\\ell(w, b, y_i) = (y_i - \\hat{y}_i)^2 = (y_i - wx_i - b)^2$\n", "\n", "For the whole sample the mean squared loss is \n", "\n", "$\\mathcal{L}(w, b, Y) = \\cfrac{1}{m}\\sum \\limits_{i=1}^{m} \\ell(w, b, y_i) = \\cfrac{1}{m}\\sum \\limits_{i=1}^{m} (y_i - wx_i - b)^2$\n", "\n", "So that we don't have a lingering 2 in the partial derivatives, we can rescale it by 0.5\n", "\n", "$\\mathcal{L}(w, b, Y) = \\cfrac{1}{2m}\\sum \\limits_{i=1}^{m} (y_i - wx_i - b)^2$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def neuron_output(X, params, activation, *args):\n", " w = params[\"w\"]\n", " b = params[\"b\"]\n", " Z = w @ X + b\n", " Y_hat = activation(Z, *args)\n", " return Y_hat\n", "\n", "\n", "def compute_mse_loss(Y, Y_hat):\n", " m = Y_hat.shape[1]\n", " return np.sum((Y - Y_hat) ** 2) / (2 * m)\n", "\n", "\n", "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, linear, 1)\n", "print(f\"MSE loss of random model: {compute_mse_loss(Y, Y_hat):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The partial derivatives of the loss function are\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial w}$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial b}$\n", "\n", "Let's calculate $\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}$, $\\cfrac{\\partial \\hat{Y}}{\\partial w}$ and $\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}} = \\cfrac{\\partial}{\\partial \\hat{Y}} \\cfrac{1}{m}\\sum \\limits_{i=1}^{m} (Y - \\hat{Y})^2 = \\cfrac{1}{m}\\sum \\limits_{i=1}^{m} 2(Y - \\hat{Y})(- 1) = -\\cfrac{1}{m}\\sum \\limits_{i=1}^{m}(Y - \\hat{Y})$\n", "\n", "$\\cfrac{\\partial \\hat{Y}}{\\partial w} = \\cfrac{\\partial}{\\partial w} wX + b = X$\n", "\n", "$\\cfrac{\\partial \\hat{Y}}{\\partial b} = wX + b = X = 1$\n", "\n", "Let's put it all together\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = -\\cfrac{1}{m}\\sum \\limits_{i=1}^{m} (Y - \\hat{Y}) \\cdot X^T$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = -\\cfrac{1}{m}\\sum \\limits_{i=1}^{m} (Y - \\hat{Y})$\n", "\n", "> 🔑 $\\cfrac{\\partial \\mathcal{L}}{\\partial w}$ contains all the partial derivative wrt to each of the $k$ elements of $w$; it's a $(k, m)$ matrix, because the dot product is between a matrix of shape $(1, m)$ and $X^T$ which is $(m, k)$.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def compute_grads(X, Y, Y_hat):\n", " m = Y_hat.shape[1]\n", " dw = -1 / m * np.dot(Y - Y_hat, X.T) # (1, k)\n", " db = -1 / m * np.sum(Y - Y_hat, axis=1, keepdims=True) # (1, 1)\n", " return {\"w\": dw, \"b\": db}\n", "\n", "\n", "def update_params(params, grads, learning_rate=0.1):\n", " params = params.copy()\n", " for k in grads.keys():\n", " params[k] = params[k] - learning_rate * grads[k]\n", " return params\n", "\n", "\n", "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, linear, 1)\n", "print(f\"MSE loss before update: {compute_mse_loss(Y, Y_hat):.2f}\")\n", "grads = compute_grads(X, Y, Y_hat)\n", "params = update_params(params, grads)\n", "Y_hat = neuron_output(X, params, linear, 1)\n", "print(f\"MSE loss after update: {compute_mse_loss(Y, Y_hat):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's find the best parameters with gradient descent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, linear, 1)\n", "loss = compute_mse_loss(Y, Y_hat)\n", "print(f\"Iter 0 - MSE loss={loss:.6f}\")\n", "for i in range(1, 50 + 1):\n", " grads = compute_grads(X, Y, Y_hat)\n", " params = update_params(params, grads)\n", " Y_hat = neuron_output(X, params, linear, 1)\n", " loss_new = compute_mse_loss(Y, Y_hat)\n", " if loss - loss_new <= 1e-4:\n", " print(f\"Iter {i} - MSE loss={loss:.6f}\")\n", " print(\"The algorithm has converged\")\n", " break\n", " loss = loss_new\n", " if i % 5 == 0:\n", " print(f\"Iter {i} - MSE loss={loss:.6f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the final model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "w, b = params.values()\n", "final_model_plane = go.Surface(\n", " z=w[0, 0] * xx1 + w[0, 1] * xx2 + b,\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#2ca02c\"], [1, \"#2ca02c\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"final params\",\n", ")\n", "fig.add_trace(final_model_plane)\n", "fig.data[1].visible = False\n", "fig.update_layout(title=\"Final model\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Single-neuron network with sigmoid activation and Log loss function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, let the linear model $Z = wX + b$ and $Y$ a row vector containing the labels of $m$ samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = 40\n", "k = 2\n", "neg_centroid = [-1, -1]\n", "pos_centroid = [1, 1]\n", "\n", "rng = np.random.default_rng(1)\n", "X = np.r_[\n", " rng.standard_normal((m // 2, k)) + neg_centroid,\n", " rng.standard_normal((m // 2, k)) + pos_centroid,\n", "].T\n", "Y = np.array([[0] * (m // 2) + [1] * (m // 2)])\n", "\n", "plt.scatter(X[0], X[1], c=np.where(Y.squeeze() == 0, \"tab:orange\", \"tab:blue\"))\n", "plt.gca().set_aspect(\"equal\")\n", "plt.xlabel(\"$x_1$\")\n", "plt.ylabel(\"$x_2$\")\n", "plt.xlim(-5, 5)\n", "plt.ylim(-5, 5)\n", "plt.title(\"Classification data\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time, though, the predictions $\\hat{Y}$ are the result of passing $Z$ to a sigmoid function, so that $\\hat{Y} = \\sigma(Z)$.\n", "\n", "$\\sigma(Z) = \\cfrac{1}{1+e^{-Z}}$\n", "\n", "To visualize the sigmoid function let's add another axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "neg_scatter = go.Scatter3d(\n", " z=np.full(int(m / 2), 0),\n", " x=X[0, : int(m / 2)],\n", " y=X[1, : int(m / 2)],\n", " mode=\"markers\",\n", " marker=dict(color=\"#ff7f0e\", size=5),\n", " name=\"negative class\",\n", ")\n", "pos_scatter = go.Scatter3d(\n", " z=np.full(int(m / 2), 1),\n", " x=X[0, int(m / 2) :],\n", " y=X[1, int(m / 2) :],\n", " mode=\"markers\",\n", " marker=dict(color=\"#1f77b4\", size=5),\n", " name=\"positive class\",\n", ")\n", "\n", "fig = go.Figure([pos_scatter, neg_scatter])\n", "fig.update_layout(\n", " title=\"Classification data\",\n", " autosize=False,\n", " width=600,\n", " height=600,\n", " margin=dict(l=10, r=10, b=10, t=30),\n", " scene=dict(\n", " xaxis=dict(title=\"x1\", range=[-5, 5]),\n", " yaxis=dict(title=\"x2\", range=[-5, 5]),\n", " zaxis_title=\"y\",\n", " camera_eye=dict(x=0, y=0.3, z=2.5),\n", " camera_up=dict(x=0, y=np.sin(np.pi), z=0),\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's plot the predictions of a randomly initialized model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sigmoid(x):\n", " return 1 / (1 + np.exp(-x))\n", "\n", "\n", "xx1, xx2 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))\n", "w, b = init_neuron_params(k).values()\n", "random_model_plane = go.Surface(\n", " z=sigmoid(w[0, 0] * xx1 + w[0, 1] * xx2 + b),\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#ff7f0e\"], [1, \"#1f77b4\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"init params\",\n", ")\n", "fig.add_trace(random_model_plane)\n", "fig.update_layout(\n", " title=\"Random model\",\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that the output of the sigmoid activation is the probability $p$ of a sample belonging to the positive class, which implies that $(1-p)$ is the probability of a sample belonging to the negative class.\n", "\n", "Intuitively, a loss function will be small when $p_i$ is close to 1.0 and $y_i = 1$ and when $p_i$ is close to 0.0 and $y_i = 0$.\n", "\n", "For a single sample this loss function might look like this\n", "\n", "$\\ell(w, b, y_i) = -p_i^{y_i}(1-p_i)^{1-y_i}$\n", "\n", "For the whole sample the loss would be\n", "\n", "$\\mathcal{L}(w, b, Y) = -\\prod \\limits_{i=1}^{m} p_i^{y_i}(1-p_i)^{1-y_i}$\n", "\n", "In [Univariate optimization (week 1)](ca_w1.html#Univariate-optimization) we've seen how it's easier to calculate the derivative of the logarithm of the PMF of the Binomial Distribution.\n", "\n", "We can do the same here to obtain a more manageable loss function.\n", "\n", "$\\mathcal{L}(w, b, Y) = -\\sum \\limits_{i=1}^{m} y_i \\ln p_i + (1-y_i) \\ln (1-p_i)$\n", "\n", "It turns out it's standard practice to minimize a function and to average the loss over the sample (to manage the scale of the loss for large datasets), so we'll use this instead:\n", "\n", "$\\mathcal{L}(w, b, Y) = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} y_i \\ln p_i + (1-y_i) \\ln (1-p_i)$\n", "\n", "Finally, let's substitute back $\\hat{y}_i$ for $p_i$.\n", "\n", "$\\mathcal{L}(w, b, Y) = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} y_i \\ln \\hat{y}_i + (1-y_i) \\ln (1-\\hat{y}_i)$\n", "\n", "Recall that $\\hat{y}_i = \\sigma(z_i)$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def compute_log_loss(Y, Y_hat):\n", " m = Y_hat.shape[1]\n", " loss = (-1 / m) * (\n", " np.dot(Y, np.log(Y_hat).T) + np.dot((1 - Y), np.log(1 - Y_hat).T)\n", " )\n", " return loss.squeeze()\n", "\n", "\n", "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "print(f\"Log loss of random model: {compute_log_loss(Y, Y_hat):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The partial derivatives of the loss function are\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial w}$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial b}$\n", "\n", "Because we have an activation function around $Z$, we have another chain rule to apply.\n", "\n", "For the MSE loss, the activation was an identity function so it didn't pose the same challenge.\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}\\cfrac{\\partial \\sigma(Z)}{\\partial w} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}\\cfrac{\\partial \\sigma(Z)}{\\partial Z}\\cfrac{\\partial Z}{\\partial w}$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}\\cfrac{\\partial \\sigma(Z)}{\\partial b} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}\\cfrac{\\partial \\sigma(Z)}{\\partial Z}\\cfrac{\\partial Z}{\\partial b}$\n", "\n", "Let's calculate $\\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}$, $\\cfrac{\\partial \\sigma(Z)}{\\partial Z}$, $\\cfrac{\\partial Z}{\\partial w}$ and $\\cfrac{\\partial Z}{\\partial b}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculation of partial derivative of Log loss wrt sigmoid activation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)} = \\cfrac{\\partial}{\\partial \\sigma(Z)} -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} Y\\ln\\sigma(Z)+(Y-1)\\ln(1-\\sigma(Z))$ \n", "\n", "$= -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y}{\\sigma(Z)}+\\cfrac{Y-1}{1-\\sigma(Z)}$\n", "\n", "$= -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y(1-\\sigma(Z)) + (Y-1)\\sigma(Z)}{\\sigma(Z)(1-\\sigma(Z))}$ \n", "\n", "$= -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y -\\sigma(Z)}{\\sigma(Z)(1-\\sigma(Z))}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculation of the derivative of the sigmoid function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\cfrac{\\partial \\sigma(Z)}{\\partial Z} = \\cfrac{\\partial}{\\partial Z} \\cfrac{1}{1+e^{-Z}}$\n", "\n", "$= (1+e^{-Z})^{-1}$ \n", "\n", "$= -(1+e^{-Z})^{-2}(-e^{-Z})$ \n", "\n", "$= (1+e^{-Z})^{-2}e^{-Z}$\n", "\n", "$= \\cfrac{e^{-Z}}{(1+e^{-Z})^2}$ \n", "\n", "$= \\cfrac{1 +e^{-Z} - 1}{(1+e^{-Z})^2}$ \n", "\n", "$= \\cfrac{1 +e^{-Z}}{(1+e^{-Z})^2}-\\cfrac{1}{(1+e^{-Z})^2}$ \n", "\n", "$= \\cfrac{1}{1+e^{-Z}}-\\cfrac{1}{(1+e^{-Z})^2}$\n", "\n", "Because the factor of $x-x^2$ is $x(1-x)$ we can factor it to\n", "\n", "$= \\cfrac{1}{1+e^{-Z}}\\left(1 - \\cfrac{1}{1+e^{-Z}} \\right)$ \n", "\n", "$= \\sigma(Z)(1 - \\sigma(Z))$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The other two partial derivatives are the same as the MSE loss'.\n", "\n", "$\\cfrac{\\partial Z}{\\partial w} = \\cfrac{\\partial}{\\partial w} wX+b = X$\n", "\n", "$\\cfrac{\\partial Z}{\\partial b} = \\cfrac{\\partial}{\\partial b} wX+b = 1$\n", "\n", "Let's put it all together\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y -\\sigma(Z)}{\\sigma(Z)(1-\\sigma(Z))} \\sigma(Z)(1 - \\sigma(Z)) X^T = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} (Y -\\sigma(Z))X^T = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} (Y - \\hat{Y}) X^T$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y -\\sigma(Z)}{\\sigma(Z)(1-\\sigma(Z))} \\sigma(Z)(1 - \\sigma(Z)) = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} Y - \\hat{Y}$\n", "\n", "> 🔑 The partial derivatives of the Log loss $\\cfrac{\\partial \\mathcal{L}}{\\partial w}$ and $\\cfrac{\\partial \\mathcal{L}}{\\partial b}$ are the same as the MSE loss'." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "print(f\"Log loss before update: {compute_log_loss(Y, Y_hat):.2f}\")\n", "grads = compute_grads(X, Y, Y_hat)\n", "params = update_params(params, grads)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "print(f\"Log loss after update: {compute_log_loss(Y, Y_hat):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's find the best parameters using gradient descent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "loss = compute_log_loss(Y, Y_hat)\n", "print(f\"Iter 0 - Log loss={loss:.6f}\")\n", "for i in range(1, 500 + 1):\n", " grads = compute_grads(X, Y, Y_hat)\n", " params = update_params(params, grads)\n", " Y_hat = neuron_output(X, params, sigmoid)\n", " loss_new = compute_log_loss(Y, Y_hat)\n", " if loss - loss_new <= 1e-4:\n", " print(f\"Iter {i} - Log loss={loss:.6f}\")\n", " print(\"The algorithm has converged\")\n", " break\n", " loss = loss_new\n", " if i % 100 == 0:\n", " print(f\"Iter {i} - Log loss={loss:.6f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the final model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "w, b = params.values()\n", "final_model_plane = go.Surface(\n", " z=sigmoid(w[0, 0] * xx1 + w[0, 1] * xx2 + b),\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#ff7f0e\"], [1, \"#1f77b4\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"final params\",\n", ")\n", "fig.add_trace(final_model_plane)\n", "fig.data[2].visible = False\n", "fig.update_layout(title=\"Final model\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Neural network with 1 hidden layer of 3 neurons and sigmoid activations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Motivation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create non-linear classification data to see the shortcomings of the previous model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = 100\n", "k = 2\n", "linspace_out = np.linspace(0, 2 * np.pi, m // 2, endpoint=False)\n", "linspace_in = np.linspace(0, 2 * np.pi, m // 2, endpoint=False)\n", "outer_circ_x = np.cos(linspace_out)\n", "outer_circ_y = np.sin(linspace_out)\n", "inner_circ_x = np.cos(linspace_in) * 0.3\n", "inner_circ_y = np.sin(linspace_in) * 0.3\n", "\n", "rng = np.random.default_rng(1)\n", "X = np.vstack(\n", " [np.append(outer_circ_x, inner_circ_x), np.append(outer_circ_y, inner_circ_y)]\n", ")\n", "X += rng.normal(scale=0.1, size=X.shape)\n", "X = (X - np.mean(X, axis=1, keepdims=True)) / np.std(X, axis=1, keepdims=True)\n", "Y = np.array([[0] * (m // 2) + [1] * (m // 2)])\n", "\n", "plt.scatter(X[0], X[1], c=np.where(Y.squeeze() == 0, \"tab:orange\", \"tab:blue\"))\n", "plt.gca().set_aspect(\"equal\")\n", "plt.xlabel(\"$x_1$\")\n", "plt.ylabel(\"$x_2$\")\n", "plt.xlim(-3, 3)\n", "plt.ylim(-3, 3)\n", "plt.title(\"Non-linear classification data\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before let's add another axis to visualize the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "neg_scatter = go.Scatter3d(\n", " z=np.full(int(m / 2), 0),\n", " x=X[0, : int(m / 2)],\n", " y=X[1, : int(m / 2)],\n", " mode=\"markers\",\n", " marker=dict(color=\"#ff7f0e\", size=5),\n", " name=\"negative class\",\n", ")\n", "pos_scatter = go.Scatter3d(\n", " z=np.full(int(m / 2), 1),\n", " x=X[0, int(m / 2) :],\n", " y=X[1, int(m / 2) :],\n", " mode=\"markers\",\n", " marker=dict(color=\"#1f77b4\", size=5),\n", " name=\"positive class\",\n", ")\n", "\n", "fig = go.Figure([pos_scatter, neg_scatter])\n", "fig.update_layout(\n", " title=\"Non-linear classification data\",\n", " autosize=False,\n", " width=600,\n", " height=600,\n", " margin=dict(l=10, r=10, b=10, t=30),\n", " scene=dict(\n", " xaxis=dict(title=\"x1\", range=[-3, 3]),\n", " yaxis=dict(title=\"x2\", range=[-3, 3]),\n", " zaxis_title=\"y\",\n", " aspectmode=\"cube\",\n", " camera_eye=dict(x=0, y=0, z=2.5),\n", " camera_up=dict(x=0, y=np.sin(np.pi), z=0),\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's train a single-neuron model with sigmoid activation using a Log loss function and plot the respective plane." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "loss = compute_log_loss(Y, Y_hat)\n", "print(f\"Iter 0 - Log loss={loss:.6f}\")\n", "for i in range(1, 500 + 1):\n", " grads = compute_grads(X, Y, Y_hat)\n", " params = update_params(params, grads)\n", " Y_hat = neuron_output(X, params, sigmoid)\n", " loss_new = compute_log_loss(Y, Y_hat)\n", " if loss - loss_new <= 1e-4:\n", " print(f\"Iter {i} - Log loss={loss:.6f}\")\n", " print(\"The algorithm has converged\")\n", " break\n", " loss = loss_new\n", " if i % 100 == 0:\n", " print(f\"Iter {i} - Log loss={loss:.6f}\")\n", "\n", "w, b = params.values()\n", "xx1, xx2 = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))\n", "final_model_plane = go.Surface(\n", " z=sigmoid(w[0, 0] * xx1 + w[0, 1] * xx2 + b),\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#ff7f0e\"], [1, \"#1f77b4\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"final params\",\n", ")\n", "fig.add_trace(final_model_plane)\n", "fig.update_layout(\n", " title=\"Linear model on non-linear data\",\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to improve it by adding more neurons." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Neural network notation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The neural network we will be building is shown below.\n", "\n", "\n", "\n", "A note on the indices of the weights:\n", "\n", "For example, $w_{121}$ ($w_{\\text{AT },\\text{FROM },\\text{TO}}$) means \"weight AT layer 1, FROM node 2 (previous layer) TO node 1 (current layer)\".\n", "\n", "Let $Z_1$ the $(n, m)$ pre-activation matrix of the first and only hidden layer.\n", "\n", "$Z_1 = \\begin{bmatrix}\n", "w_{111}x_{11}+w_{121}x_{21}+b_{1}&&\n", "w_{111}x_{12}+w_{121}x_{22}+b_{1}&&\n", "\\cdots&&\n", "w_{111}x_{1m}+w_{121}x_{2m}+b_{1}\\\\\n", "w_{112}x_{11}+w_{122}x_{21}+b_{1}&&\n", "w_{112}x_{12}+w_{122}x_{22}+b_{1}&&\n", "\\cdots&&\n", "w_{112}x_{1m}+w_{122}x_{2m}+b_{1}\\\\\n", "w_{113}x_{11}+w_{123}x_{21}+b_{1}&&\n", "w_{113}x_{12}+w_{123}x_{22}+b_{1}&&\n", "\\cdots&&\n", "w_{113}x_{1m}+w_{123}x_{2m}+b_{1}\n", "\\end{bmatrix}$\n", "\n", "A more compact matrix notation is shown below.\n", "\n", "$Z_1 = \\begin{bmatrix}\n", "w_{111}&&w_{121}\\\\\n", "w_{112}&&w_{122}\\\\\n", "w_{113}&&w_{123}\\\\\n", "\\end{bmatrix}\n", "\\begin{bmatrix}\n", "x_{11}&&x_{12}&&\\cdots&&x_{1m}\\\\\n", "x_{21}&&x_{22}&&\\cdots&&x_{2m}\n", "\\end{bmatrix} +\n", " \\begin{bmatrix}b_{1}\\end{bmatrix}$\n", "\n", "Let $A_1 = h(Z_1)$ where $h$ is an activation function (eg linear, sigmoid). For the hidden layers it's common to use other a ReLU activation function.\n", "\n", "Let $Z_2$ a $(3, m)$ pre-activation matrix of the output layer.\n", "\n", "$Z_2 = \\begin{bmatrix}w_{211}&&w_{221}&&w_{231}\\end{bmatrix} \n", "\\begin{bmatrix}\n", "a_{111}&&a_{112}&&\\dots&&a_{11m}\\\\\n", "a_{121}&&a_{122}&&\\dots&&a_{12m}\\\\\n", "a_{131}&&a_{132}&&\\dots&&a_{13m}\n", "\\end{bmatrix} +\n", " \\begin{bmatrix}b_{2}\\end{bmatrix}$\n", "\n", "$\\hat{Y} = \\sigma(Z_2)$\n", "\n", "Note that we could use $a_{0km}$ instead of $x_{km}$ if we want to define a generic $Z_{LAYER}$ matrix, but it's better to keep $x$ and $a$ distinct although it can help thinking that $x$ is basically just $a_0$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Partial derivatives of a neural network" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our network we need 5 steps to go from the inputs through the hidden layer to the loss function.\n", "\n", "For 1 sample (omitting the $i$ index for conciseness):\n", "\n", "1. calculate the linear combinations $z_{1n}$ for all $n$ neurons in the first hidden layer.\n", "\n", "$z_{1n} = w_{11n}x_1 + w_{12n}x_2 + b_1$\n", "\n", "2. pass them to the sigmoid activation function\n", "\n", "$a_{1n} = \\sigma(z_{1n})$\n", "\n", "3. calculate the linear combination $z_{21}$ for the final output node\n", "\n", "$z_{21} = w_{211}a_{11} + w_{221}a_{12} + w_{231}a_{13} + b_2$\n", "\n", "4. pass it to the sigmoid activation function\n", "\n", "$\\hat{Y} = \\sigma(z_{21})$\n", "\n", "5. compute the Log loss\n", "\n", "$\\mathcal{L}(w, b, Y) = -y \\ln \\hat{y} + (1-y) \\ln (1-\\hat{y})$\n", "\n", "It turns out that to work out the chain rule of the partial derivative of the loss wrt to a parameter of an input node we just need to chain the partial derivatives of these 5 functions.\n", "\n", "For example, let's workout the chain rule for $\\cfrac{\\partial \\mathcal{L}}{\\partial w_{111}}$\n", "\n", "Moving backwards, the first derivative we encounter is at 5. $\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}$, then at 4. $\\cfrac{\\partial \\hat{Y}}{\\partial z_{21}}$, then at 3. $\\cfrac{\\partial z_{21}}{\\partial a_{11}}$, then at 2. $\\cfrac{\\partial a_{11}}{\\partial z_{11}}$ and finally at 1. $\\cfrac{\\partial z_{11}}{\\partial w_{111}}$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w_{111}} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}} \\cfrac{\\partial \\hat{Y}}{\\partial z_{21}} \\cfrac{\\partial z_{21}}{\\partial a_{11}} \\cfrac{\\partial a_{11}}{\\partial z_{11}} \\cfrac{\\partial z_{11}}{\\partial w_{111}}$\n", "\n", "From [Calculation of partial derivative of Log loss wrt sigmoid activation](#Calculation-of-partial-derivative-of-Log-loss-wrt-sigmoid-activation) we know\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}} = -\\cfrac{Y - \\hat{Y}}{\\hat{Y}(1-\\hat{Y})}$\n", "\n", "From [Calculation of the derivative of the sigmoid function](#Calculation-of-the-derivative-of-the-sigmoid-function) we know\n", "\n", "$\\cfrac{\\partial \\hat{Y}}{\\partial z_{21}} = \\hat{Y}(1-\\hat{Y})$\n", "\n", "And we know that these two simplify to\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial z_{21}} = - (Y - \\hat{Y})$\n", "\n", "Similarly to $\\cfrac{\\partial \\hat{Y}}{\\partial z_{21}}$ (partial derivative of the sigmoid function in the output node wrt its argument) we obtain $\\cfrac{\\partial a_{11}}{\\partial z_{11}}$ (partial derivative of the sigmoid function in the first node of the hidden layer wrt its argument)\n", "\n", "$\\cfrac{\\partial a_{11}}{\\partial z_{11}} = a_{11}(1-a_{11})$\n", "\n", "Then the others are very easy to calculate.\n", "\n", "$\\cfrac{\\partial z_{21}}{\\partial a_{11}} = \\cfrac{\\partial}{\\partial a_{11}} w_{211}a_{11} + w_{221}a_{12} + w_{231}a_{13} + b_2 = w_{211}$\n", "\n", "$\\cfrac{\\partial z_{11}}{\\partial w_{111}} = \\cfrac{\\partial}{\\partial w_{111}} w_{111}x_1 + w_{121}x_2 + b_1 = x_1$\n", "\n", "Let's put it all together.\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w_{111}} = - (Y - \\hat{Y})w_{211}a_{11}(1-a_{11})x_1$\n", "\n", "The partial derivative of the loss wrt of the weights in the hidden layer are the same as when we had a single-neuron neural network.\n", "\n", "As an example, let's calculate $\\cfrac{\\partial \\mathcal{L}}{\\partial w_{231}}$.\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w_{231}} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}} \\cfrac{\\partial \\hat{Y}}{\\partial z_{21}} \\cfrac{\\partial z_{21}}{\\partial w_{231}}$\n", "\n", "The only one we need to calculate is\n", "\n", "$\\cfrac{\\partial z_{21}}{\\partial w_{231}} = \\cfrac{\\partial}{\\partial w_{231}}w_{211}a_{11} + w_{221}a_{12} + w_{231}a_{13} + b_2 = a_{13}$\n", "\n", "Let's put it all together.\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w_{231}} = - (Y - \\hat{Y})a_{13}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Forward and backward propagation\n", "\n", "It turns out there is an algorithm to efficiently train a neural network.\n", "\n", "> 🔑 Forward propagation refers to the calculation and storage of output layers for backward propagation\n", "\n", "> 🔑 Backward propagation (aka backprop) refers to the calculation of partial derivatives for each layer where hidden layers both output the partial derivative of the loss wrt to the parameters used in the layer but also the partial derivative of the loss wrt the output of the previous layer\n", "\n", "Forward propagation doesn't need a lot of explanation. A key aspect, though, is the storage of the outputs.\n", "\n", "Backward propagation is also quite intuitive at the very high level as that's sort of how, for example, we naturally approached the chain rule of $\\cfrac{\\partial \\mathcal{L}}{\\partial w_{111}}$.\n", "\n", "Recall we moved backwards starting from the partial derivative of the loss wrt to $\\hat{Y}$ and we ended up with\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w_{111}} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}} \\cfrac{\\partial \\hat{Y}}{\\partial z_{21}} \\cfrac{\\partial z_{21}}{\\partial a_{11}} \\cfrac{\\partial a_{11}}{\\partial z_{11}} \\cfrac{\\partial z_{11}}{\\partial w_{111}}$\n", "\n", "

 

\n", "\n", "The forward and backward propagation algorithm is shown below.\n", "\n", "\n", "\n", "Note how the partial derivative of the loss wrt the parameters is built progressively and carried over from one layer to the next (or previous depends on how you look at it).\n", "\n", "The gray bit of the chain rule is what's been carried over and the black bit is what's been appended at each step.\n", "\n", "Also note how the process \"drops\" the partial derivative of the loss wrt the parameters for the final layer as it goes, while it keeps building the partial derivative for the parameters used in the first layer.\n", "\n", "That's the main idea behind its efficiency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The backward propagation for a network like the one we will be building will look like this.\n", "\n", "**Step 1**\n", "\n", "Inputs: $Y$, $\\hat{Y}$\n", "\n", "compute $\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}} = -\\cfrac{Y - \\hat{Y}}{\\hat{Y}(1-\\hat{Y})}$\n", "\n", "Outputs: $\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}$ (carry over to the next step)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 2**\n", "\n", "Inputs: $\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}$ (from previous step), $A_1$, $W_2$, $Z_2$ (from the layer's cache)\n", "\n", "compute $\\cfrac{\\partial \\hat{Y}}{\\partial Z_{2}} = \\cfrac{\\partial}{\\partial Z_{2}}\\sigma(Z_2)$\n", "\n", "compute $\\cfrac{\\partial \\mathcal{L}}{\\partial Z_{2}} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial Z_{2}}$\n", "\n", "compute $\\cfrac{\\partial \\mathcal{L}}{\\partial A_{1}} = W_2^T \\cdot \\cfrac{\\partial \\mathcal{L}}{\\partial Z_{2}}$\n", "\n", "compute $\\cfrac{\\partial \\mathcal{L}}{\\partial W_{2}} = \\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{\\partial \\mathcal{L}}{\\partial Z_{2}} \\cdot A_1^T$\n", "\n", "compute $\\cfrac{\\partial \\mathcal{L}}{\\partial b_{2}} = \\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{\\partial \\mathcal{L}}{\\partial Z_{2}}$\n", "\n", "Outputs: $\\cfrac{\\partial \\mathcal{L}}{\\partial A_{1}}$ (carry over to the next step), $\\cfrac{\\partial \\mathcal{L}}{\\partial W_{2}}$, $\\cfrac{\\partial \\mathcal{L}}{\\partial b_{2}}$ (send to update params process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 3**\n", "\n", "Inputs: $\\cfrac{\\partial \\mathcal{L}}{\\partial A_{1}}$ (from the previous step), $X$, $W_1$, $Z_1$ (from the layer's cache)\n", "\n", "compute $\\cfrac{\\partial A_{1}}{\\partial Z_{1}} = \\cfrac{\\partial}{\\partial Z_{1}}\\text{ReLU}(Z_1)$\n", "\n", "compute $\\cfrac{\\partial \\mathcal{L}}{\\partial Z_{1}} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial A_{1}}{\\partial Z_{1}}$\n", "\n", "compute $\\cfrac{\\partial \\mathcal{L}}{\\partial X} = W_1^T \\cdot \\cfrac{\\partial \\mathcal{L}}{\\partial Z_{1}}$\n", "\n", "compute $\\cfrac{\\partial \\mathcal{L}}{\\partial W_{1}} = \\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{\\partial \\mathcal{L}}{\\partial Z_{1}} \\cdot X^T$\n", "\n", "compute $\\cfrac{\\partial \\mathcal{L}}{\\partial b_{1}} = \\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{\\partial \\mathcal{L}}{\\partial Z_{1}}$\n", "\n", "Outputs: $\\cfrac{\\partial \\mathcal{L}}{\\partial A_{0}}$ (do not use further), $\\cfrac{\\partial \\mathcal{L}}{\\partial W_{1}}$, $\\cfrac{\\partial \\mathcal{L}}{\\partial b_{1}}$ (send to update params process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More generally, the pseudo-code might be something along these lines.\n", "\n", "$\\begin{array}{l}\n", "\\textbf{Algorithm: } \\text{Forward and Backward Propagation} \\\\\n", "\\textbf{Input: } X, Y, \\text{initial parameter} [W_1, b_1, \\dots, W_L, b_L] \\\\\n", "\\textbf{Output: } \\text{gradients} [\\nabla W_1, \\nabla b_1, \\dots, \\nabla W_L, \\nabla b_L] \\\\\n", "\\phantom{0}1 : L \\gets \\text{len}([W_1, b_1, \\dots, W_L, b_L]) / 2 \\\\\n", "\\phantom{0}2 : A_{i-1} \\gets X \\\\\n", "\\phantom{0}3 : \\text{Forward propagation:} \\\\\n", "\\phantom{0}4 : \\text{for i = 1 to L do} \\\\\n", "\\phantom{0}5 : \\quad \\text{if } i < L \\text{ then} \\\\\n", "\\phantom{0}6 : \\quad \\quad \\mathcal{f}_i \\gets ReLU \\\\\n", "\\phantom{0}7 : \\quad \\text{else } \\\\\n", "\\phantom{0}8 : \\quad \\quad \\mathcal{f}_i \\gets \\sigma \\\\\n", "\\phantom{0}9 : \\quad \\text {end if} \\\\\n", "10 : \\quad Z_i \\gets W_iA_{i-1} + b_i \\\\\n", "11 : \\quad A_i \\gets \\mathcal{f}_i(Z_i) \\\\\n", "12 : \\quad \\text{cache}_i \\gets [A_{i-1}, \\mathcal{f}_i, W_i, Z_i] \\\\\n", "13 : \\quad \\text{if } i = L \\text{ then} \\\\\n", "14 : \\quad \\quad \\hat{Y} \\gets A_L \\\\\n", "15 : \\quad \\text {end if} \\\\\n", "16 : \\text{end for} \\\\\n", "17 : \\text{Backward propagation:} \\\\\n", "18 : dA_L \\gets -(Y - \\hat{Y}) / \\hat{Y}(1-\\hat{Y}) \\\\\n", "19 : \\text{for i = L down to 1 do} \\\\\n", "20 : \\quad [A_{i-1}, \\mathcal{f}_i, W_i, Z_i] \\gets \\text{unpack } \\text{cache}_i \\\\\n", "21 : \\quad dZ_i = dA_i \\mathcal{f'}_i(Z_i) \\\\\n", "22 : \\quad \\nabla W_i = (1/m) \\sum_{i=1}^{m} dZ_i A_{i-1}^T \\\\\n", "23 : \\quad \\nabla b_i = (1/m) \\sum_{i=1}^{m} dZ_i \\\\\n", "24 : \\quad dA_{i-1} = W_i^T dZ_i \\\\\n", "25 : \\text{end for} \\\\\n", "26 : \\text{return } [\\nabla W_1, \\nabla b_1, \\dots, \\nabla W_L, \\nabla b_L] \\\\\n", "\\end{array}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Building a neural network" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is to randomly initialize all the parameters. \n", "\n", "A common initialization of weights in neural networks is the Glorot Uniform, which takes into account the size of the previous layer and the size of the current layer.\n", "\n", "$W_{i} \\sim U\\left(-\\cfrac{\\sqrt{6}}{\\sqrt{fan_{i-1} + fan_{i}}}, \\cfrac{\\sqrt{6}}{\\sqrt{fan_{i-1} + fan_{i}}}\\right)$, where $U(a, b)$ is the Uniform distribution and $fan_{i-1}$ is the size of the previous layer and $fan_i$ is the size of the current layer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def init_nn_params(n):\n", " params = {}\n", " for i in range(len(n) - 1):\n", " fan_in, fan_out = n[i : i + 2]\n", " limit = np.sqrt(6 / (fan_in + fan_out))\n", " params[\"W\" + str(i + 1)] = rng.uniform(-limit, limit, size=(fan_out, fan_in))\n", " params[\"b\" + str(i + 1)] = np.zeros((n[i + 1], 1))\n", " return params\n", "\n", "\n", "params = init_nn_params([X.shape[0], 3, Y.shape[0]])\n", "params" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we do forward propagation and compute the loss." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def relu(x):\n", " # np.maximum: element-wise max, not np.max along axis\n", " return np.maximum(0, x)\n", "\n", "\n", "def step_forward_prop(A_prev, layer, params, activation):\n", " W = params[\"W\" + str(layer)]\n", " b = params[\"b\" + str(layer)]\n", " Z = W @ A_prev + b\n", " A = activation(Z)\n", " cache = A_prev, W, Z, activation.__name__\n", " return A, cache\n", "\n", "\n", "def forward_prop(X, params, hidden_activation=relu, ouput_activation=sigmoid):\n", " A_prev = X\n", " caches = []\n", " n_layers = len(params) // 2\n", " for layer in range(1, n_layers + 1):\n", " if layer <= n_layers - 1:\n", " activation = hidden_activation\n", " else:\n", " activation = ouput_activation\n", " A, cache = step_forward_prop(A_prev, layer, params, activation)\n", " A_prev = A\n", " caches.append(cache)\n", " return A, caches\n", "\n", "\n", "Y_hat, caches = forward_prop(X, params)\n", "print(f\"Output layer is {caches[-1][2].shape} with {caches[-1][3]} activation\")\n", "for layer in range(len(caches) - 2, -1, -1):\n", " print(\n", " f\"Hidden layer {layer + 1} is {caches[layer][2].shape} with {caches[layer][3]} activation\"\n", " )\n", "print(f\"\\nLog loss of random model: {compute_log_loss(Y, Y_hat):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we do backward propagation.\n", "\n", "In [Proof that ReLU is not differentiable at 0](ca_w1.html#Proof-that-ReLU-is-not-differentiable-at-0) we said that the convention in machine learning is to consider its derivative at 0, well, 0.\n", "\n", "Let's see how just one undetermined derivative can wreak havoc on the parameters gradients during backward propagation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def d_sigmoid(x):\n", " return sigmoid(x) * (1 - sigmoid(x))\n", "\n", "\n", "def d_relu(x):\n", " if isinstance(x, np.ndarray):\n", " x = x.copy()\n", " x[x == 0] = np.nan\n", " elif x == 0:\n", " x = np.nan\n", " return np.sign(relu(x))\n", "\n", "\n", "dA2 = -(Y - Y_hat) / np.multiply(Y_hat, (1 - Y_hat))\n", "A1, W2, Z2, _ = caches[1]\n", "dZ2 = dA2 * d_sigmoid(Z2)\n", "dA1 = np.dot(W2.T, dZ2)\n", "X, W1, Z1, _ = caches[0]\n", "# let's introduce one NaN in Z1\n", "Z1[0, 0] = np.nan\n", "dZ1 = dA1 * d_relu(Z1)\n", "dW1 = (1 / dZ1.shape[1]) * np.dot(dZ1, X.T)\n", "dW1 = dW1.round(2)\n", "\n", "display(\n", " Math(\n", " \"\\\\langle \\\\nabla W_{111}, \\\\nabla W_{121} \\\\rangle=\"\n", " + sp.latex(sp.Matrix(dW1[0]))\n", " ),\n", " Math(\n", " \"\\\\langle \\\\nabla W_{112}, \\\\nabla W_{122} \\\\rangle=\"\n", " + sp.latex(sp.Matrix(dW1[1]))\n", " ),\n", " Math(\n", " \"\\\\langle \\\\nabla W_{113}, \\\\nabla W_{123} \\\\rangle=\"\n", " + sp.latex(sp.Matrix(dW1[2]))\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we can't simply do this.\n", "\n", "$\\cfrac{\\partial A_{1}}{\\partial Z_{1}} = \\cfrac{\\partial}{\\partial Z_{1}}\\text{ReLU}(Z_1) = \\begin{cases}1 \\text{ if } x > 0\\\\0 \\text{ if } x < 0\\\\\\text{undefined if } x = 0\\end{cases}$\n", "\n", "As per the convention, we'll do this instead.\n", "\n", "$\\cfrac{\\partial A_{1}}{\\partial Z_{1}} = \\begin{cases}1 \\text{ if } x > 0\\\\0 \\text{ if } x \\le 0\\end{cases}$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def init_backward_prop(Y_hat):\n", " dAL = -(Y - Y_hat) / np.multiply(Y_hat, (1 - Y_hat))\n", " return dAL\n", "\n", "\n", "def step_backward_prop(dA_prev, cache):\n", " # extract elements from cache\n", " A_prev, W, Z, activation = cache\n", " # propagate dA gradient through activation function\n", " # d_relu is nan when x is 0\n", " # instead of propagating nan's we make them 0\n", " func = {\"sigmoid\": d_sigmoid, \"relu\": lambda x: np.nan_to_num(d_relu(x))}\n", " dZ = dA_prev * func[activation](Z)\n", " # compute dW, db, dA_prev\n", " # dZ2 is (1, m) and A1 is (3, m), so dW1 is (1, 3)\n", " # dZ1 is (3, m) and X is (2, m), so dW0 is (3, 2)\n", " dW = (1 / dZ.shape[1]) * np.dot(dZ, A_prev.T)\n", " # dZ2 is (1, m) so db2 is (1, 1)\n", " # dZ1 is (3, m) so db0 is (1, 3)\n", " db = (1 / dZ.shape[1]) * np.sum(dZ, axis=1, keepdims=True)\n", " # W1 is (1, 3) and dZ2 is (1, m), so dA1 is (1, m)\n", " # W0 is (3, 2) and dZ1 is (3, m), so dX is (3, m) (not used though)\n", " dA_next = np.dot(W.T, dZ)\n", " return dA_next, dW, db\n", "\n", "\n", "def backward_prop(Y_hat, caches):\n", " dA_prev = init_backward_prop(Y_hat)\n", " grads = params.copy()\n", " for layer in range(len(caches), 0, -1):\n", " dA, dW, db = step_backward_prop(dA_prev, caches[layer - 1])\n", " grads[\"W\" + str(layer)] = dW\n", " grads[\"b\" + str(layer)] = db\n", " dA_prev = dA\n", " return grads\n", "\n", "\n", "grads = backward_prop(Y_hat, caches)\n", "grads" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point it's the usual gradient descent algorithm.\n", "\n", "Let's run it 5 times (each with up to 500 iterations of gradient descent) and pick the run with the lowest training loss." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "runs = []\n", "for _ in range(5):\n", " history = \"\"\n", " params = init_nn_params([X.shape[0], 3, Y.shape[0]])\n", " Y_hat, caches = forward_prop(X, params)\n", " loss = compute_log_loss(Y, Y_hat)\n", " history += f\"Iter 0 - Log loss={loss:.6f}\\n\"\n", " for i in range(1, 500 + 1):\n", " grads = backward_prop(Y_hat, caches)\n", " params = update_params(params, grads, learning_rate=0.5)\n", " Y_hat, caches = forward_prop(X, params)\n", " loss_new = compute_log_loss(Y, Y_hat)\n", " if loss - loss_new <= 1e-4:\n", " history += f\"Iter {i} - Log loss={loss:.6f}\\n\"\n", " history += f\"The algorithm has converged\\n\"\n", " break\n", " loss = loss_new\n", " if i % 100 == 0:\n", " history += f\"Iter {i} - Log loss={loss:.6f}\\n\"\n", " runs.append((loss, params, history))\n", "\n", "runs.sort(key=lambda x: x[0])\n", "params = runs[0][1]\n", "print(runs[0][2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "XX = np.c_[xx1.flatten(), xx2.flatten()].T\n", "Y_hat, _ = forward_prop(XX, params)\n", "\n", "final_model_plane = go.Surface(\n", " z=Y_hat.reshape(xx1.shape),\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#ff7f0e\"], [1, \"#1f77b4\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"final params\",\n", ")\n", "fig.data[2].visible = False\n", "fig.add_trace(final_model_plane)\n", "fig.update_layout(\n", " title=\"Neural Network model\",\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Newton-Raphson method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Newton-Raphson method with one variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> 🔑 Newton-Raphson method is an algorithm to find the zero of a function\n", "\n", "It can be used to optimize a function $f(x)$ by finding the zero of $f'(x)$.\n", "\n", "The method works by starting from a random point $x_0$ on $f'(x)$, finding the tangent at $f'(x_0)$ and finding where the tangent crosses the x-axis.\n", "\n", "The idea is that this new point is closer than $x_0$ to a minimum or maximum." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def fixup_animation_js(html_animation):\n", " html_animation = html_animation.replace(\n", " '
',\n", " '
',\n", " )\n", " animation_id = re.findall(r\"onclick=\\\"(.*)\\.\", html_animation)[0]\n", " img_id = re.findall(r\"\n", "setupAnimationIntersectionObserver('{animation_id}', '{img_id}');\n", "\n", "\"\"\"\n", " return html_animation\n", "\n", "\n", "def _update(frame):\n", " global fscat, dscat, dtang\n", " fscat.remove()\n", " dscat.remove()\n", " dtang.remove()\n", " fscat = ax.scatter(xs[frame], f(xs[frame]), color=\"tab:blue\")\n", " dscat = ax.scatter(xs[frame], df(xs[frame]), color=\"tab:orange\")\n", " (dtang,) = ax.plot(\n", " xx,\n", " ddf(xs[frame]) * (xx - xs[frame]) + df(xs[frame]),\n", " \"--\",\n", " color=\"tab:orange\",\n", " alpha=0.5,\n", " )\n", "\n", "\n", "x = sp.symbols(\"x\")\n", "f = sp.exp(x) - sp.log(x)\n", "df = sp.lambdify(x, sp.diff(f, x, 1))\n", "ddf = sp.lambdify(x, sp.diff(f, x, 2))\n", "f = sp.lambdify(x, f)\n", "\n", "xs = [1.75]\n", "for _ in range(4):\n", " xs.append(xs[-1] - df(xs[-1]) / ddf(xs[-1]))\n", "\n", "fig, ax = plt.subplots()\n", "xx = np.linspace(1e-3, 2)\n", "ax.plot(xx, f(xx), color=\"tab:blue\", label=\"$f(x)$\")\n", "ax.plot(xx, df(xx), color=\"tab:orange\", label=\"$f'(x)$\")\n", "fscat = ax.scatter([], [])\n", "dscat = ax.scatter([], [])\n", "(dtang,) = ax.plot([], [])\n", "ax.set_ylim(-2, 11)\n", "ax.set_xlim(0, 2)\n", "ax.spines[[\"left\", \"bottom\"]].set_position((\"data\", 0))\n", "ax.text(1, 0, \"$x$\", transform=ax.get_yaxis_transform())\n", "ax.text(0, 1.02, \"$f(x)$\", transform=ax.get_xaxis_transform())\n", "plt.legend(loc=\"upper right\")\n", "plt.title(\"Minimazation of $f(x)$ using Newton's method\")\n", "\n", "ani = FuncAnimation(fig=fig, func=_update, frames=4, interval=1000)\n", "html_animation = ani.to_jshtml(default_mode=\"loop\")\n", "if \"runtime\" not in get_ipython().config.IPKernelApp.connection_file:\n", " html_animation = fixup_animation_js(html_animation)\n", "display(HTML(html_animation))\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The equation of a line that passes for $\\left(x_0, f(x_0)\\right)$ is the point-slope form\n", "\n", "$f(x) - f(x_0) = m(x-x_0)$\n", "\n", "$f(x) = m(x - x_0) + f(x_0)$\n", "\n", "We substitute $f'(x)$ for $f(x)$, $f'(x_0)$ for $f(x_0)$, and the derivative of $f'(x)$ at $x_0$ (ie the second derivative $f''(x_0)$) for $m$ and we obtain\n", "\n", "$f'(x) = f''(x_0)(x - x_0) + f'(x_0)$\n", "\n", "Now we want to find the $x$ such that $f'(x) = 0$, that is the point where the tangent crosses the x-axis.\n", "\n", "$0 = f''(x_0)(x - x_0) + f'(x_0)$\n", "\n", "$-f''(x_0)(x - x_0) = f'(x_0)$\n", "\n", "$-(x - x_0) = \\cfrac{f'(x_0)}{f''(x_0)}$\n", "\n", "$-x + x_0 = \\cfrac{f'(x_0)}{f''(x_0)}$\n", "\n", "$-x = - x_0 + \\cfrac{f'(x_0)}{f''(x_0)}$\n", "\n", "$x = x_0 - \\cfrac{f'(x_0)}{f''(x_0)}$\n", "\n", "The pseudo-code might be looking something along these lines\n", "\n", "$\\begin{array}{l}\n", "\\textbf{Algorithm: } \\text{Newton's Method for Univariate Optimization} \\\\\n", "\\textbf{Input: } \\text{function to minimize } \\mathcal{f}, \\text{initial guess } x_0, \\text{number of iterations } K, \\text{convergence } \\epsilon \\\\\n", "\\textbf{Output: } x \\\\\n", "1 : x \\gets x_0 \\\\\n", "2 : \\text{for k = 1 to K do} \\\\\n", "3 : \\quad x_{new} = x - f'(x) / f''(x) \\\\\n", "4 : \\quad \\text{if } \\|f'(x_{new}) / f''(x_{new})\\| < \\epsilon \\text{ then} \\\\\n", "5 : \\quad \\quad \\text{return } x \\\\\n", "6 : \\quad \\text {end if} \\\\\n", "7 : \\quad x \\gets x_{new} \\\\\n", "8 : \\text{return } x \\\\\n", "\\end{array}$\n", "\n", "Another way to find $x$ such that $f'(x) = 0$ is through the rise over run equation.\n", "\n", "$f''(x_0) = \\cfrac{f'(x_0)}{x_0 - x}$\n", "\n", "$\\cfrac{1}{f''(x_0)} = \\cfrac{x_0 - x}{f'(x_0)}$\n", "\n", "$\\cfrac{f'(x_0)}{f''(x_0)} = x_0 - x$\n", "\n", "$\\cfrac{f'(x_0)}{f''(x_0)} - x_0 = - x$\n", "\n", "$x_0 - \\cfrac{f'(x_0)}{f''(x_0)} = x$\n", "\n", "A third way to find $x$ such that $f'(x) = 0$ is through expressing $f(x)$ as a quadratic approximation (Taylor series truncated at the second order).\n", "\n", "A Taylor series of $f(x)$ that is differentiable at $x_0$ is\n", "\n", "$f(x)=\\sum_{n=0}^\\infty\\cfrac{f^{(n)}(x_0)(x-x_0)^n}{n!}$\n", "\n", "Before proceeding let's take a look at what a Taylor series is." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def _update(order):\n", " global taylor\n", " taylor.remove()\n", " series = np.zeros((order, 50))\n", " for n in range(1, order + 1):\n", " series[n - 1] = (\n", " (1 / math.factorial(n))\n", " * sp.diff(expr, x, n).evalf(subs={x: 0})\n", " * (xx**n)\n", " )\n", " (taylor,) = ax.plot(xx, np.sum(series, axis=0), color=\"tab:orange\")\n", "\n", "\n", "x = sp.symbols(\"x\")\n", "expr = sp.sin(x)\n", "sin = sp.lambdify(x, expr, \"numpy\")\n", "xx = np.linspace(-10, 10)\n", "fig, ax = plt.subplots()\n", "ax.plot(xx, sin(xx), color=\"tab:blue\", label=\"$\\sin(x)$\")\n", "(taylor,) = ax.plot([], [], color=\"tab:orange\", label=\"Taylor series\")\n", "ax.set_aspect(\"equal\")\n", "ax.set_ylim(-10, 10)\n", "ax.set_xlim(-10, 10)\n", "plt.legend(loc=\"upper right\")\n", "plt.title(\"Taylor series of order 1, 3, ..., 25 of sine function\")\n", "\n", "\n", "ani = FuncAnimation(fig=fig, func=_update, frames=range(1, 27, 2))\n", "html_animation = ani.to_jshtml(default_mode=\"loop\")\n", "if \"runtime\" not in get_ipython().config.IPKernelApp.connection_file:\n", " html_animation = fixup_animation_js(html_animation)\n", "display(HTML(html_animation))\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> 🔑 The Taylor series of a function is an infinite sum of terms (polynomials) that are expressed in terms of the function derivatives at a point.\n", "\n", "The quadratic approximation of $f(x)$ (Taylor series truncated at the second degree polynomial) is\n", "\n", "$f(x) \\approx f(x_0) + f'(x_0)(x-x_0) + \\cfrac{1}{2}f''(x_0)(x-x_0)^2$\n", "\n", "To find $x$ such that $f'(x) = 0$, we set the derivative of the quadratic approximation to zero.\n", "\n", "$\\cfrac{d}{dx}f(x_0) + f'(x_0)(x-x_0) + \\cfrac{1}{2}f''(x_0)(x-x_0)^2 = 0$\n", "\n", "$f'(x_0) + f''(x_0)(x-x_0) = 0$\n", "\n", "$f''(x_0)(x-x_0) = -f'(x_0)$\n", "\n", "$x-x_0 = -\\cfrac{f'(x_0)}{f''(x_0)}$\n", "\n", "$x = x_0 -\\cfrac{f'(x_0)}{f''(x_0)}$\n", "\n", "This derivation of the Newton's method update formula from the quadratic approximation of a function has a few important implications:\n", "\n", "- At each step, Newton's method finds the minimum of the quadratic approximation of $f(x)$\n", "\n", "- When $f(x)$ is indeed quadratic, Newton's method will find the minimum or maximum in one single step\n", "\n", "- When $f(x)$ is not quadratic, Newton's method will exhibit quadratic convergence near the minimum or maximum. Quadratic convergence refers to the rate at which the sequence ${x_k}$ approaches the minimum or maximum. The \"error\" will decrease quadratically at each step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def _update(frame):\n", " global fscat, taylor, taylor_min\n", " fscat.remove()\n", " taylor.remove()\n", " taylor_min.remove()\n", " fscat = ax.scatter(xs[frame], f(xs[frame]), color=\"tab:blue\")\n", " (taylor,) = ax.plot(\n", " xx,\n", " f(xs[frame])\n", " + df(xs[frame]) * (xx - xs[frame])\n", " + (1 / 2) * ddf(xs[frame]) * (xx - xs[frame]) ** 2,\n", " \"--\",\n", " color=\"tab:orange\",\n", " alpha=0.5,\n", " label=r\"$f(x_0) + f'(x_0)(x-x_0) + \\dfrac{1}{2}f''(x_0)(x-x_0)^2$\",\n", " )\n", " taylor_min = ax.scatter(\n", " xs[frame + 1],\n", " f(xs[frame])\n", " + df(xs[frame]) * (xs[frame + 1] - xs[frame])\n", " + (1 / 2) * ddf(xs[frame]) * (xs[frame + 1] - xs[frame]) ** 2,\n", " color=\"tab:orange\",\n", " )\n", "\n", "\n", "x = sp.symbols(\"x\")\n", "f = sp.exp(x) - sp.log(x)\n", "df = sp.lambdify(x, sp.diff(f, x, 1))\n", "ddf = sp.lambdify(x, sp.diff(f, x, 2))\n", "f = sp.lambdify(x, f)\n", "\n", "xs = [1.75]\n", "for _ in range(4):\n", " xs.append(xs[-1] - df(xs[-1]) / ddf(xs[-1]))\n", "\n", "fig, ax = plt.subplots()\n", "xx = np.linspace(1e-3, 2)\n", "ax.plot(xx, f(xx), color=\"tab:blue\", label=\"$f(x)$\")\n", "fscat = ax.scatter([], [])\n", "(taylor,) = ax.plot(\n", " [],\n", " [],\n", " color=\"tab:orange\",\n", " alpha=0.5,\n", " label=r\"$f(x_0) + f'(x_0)(x-x_0) + \\dfrac{1}{2}f''(x_0)(x-x_0)^2$\",\n", ")\n", "taylor_min = ax.scatter([], [])\n", "ax.set_ylim(-2, 11)\n", "ax.set_xlim(0, 2)\n", "ax.spines[[\"left\", \"bottom\"]].set_position((\"data\", 0))\n", "ax.text(1, 0, \"$x$\", transform=ax.get_yaxis_transform())\n", "ax.text(0, 1.02, \"$f(x)$\", transform=ax.get_xaxis_transform())\n", "plt.legend(loc=\"upper right\")\n", "plt.title(\n", " \"At each step, Newton's method finds\\nthe minimum of the quadratic approximation of $f(x)$\"\n", ")\n", "\n", "ani = FuncAnimation(fig=fig, func=_update, frames=4, interval=1000)\n", "html_animation = ani.to_jshtml(default_mode=\"loop\")\n", "if \"runtime\" not in get_ipython().config.IPKernelApp.connection_file:\n", " html_animation = fixup_animation_js(html_animation)\n", "display(HTML(html_animation))\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's inspect the algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def univariate_newtons_method(f, symbols, initial, steps):\n", " x = symbols\n", " df = sp.lambdify(x, sp.diff(f, x, 1))\n", " ddf = sp.lambdify(x, sp.diff(f, x, 2))\n", " p = np.zeros((steps, 1))\n", " g = np.zeros((steps, 1))\n", " h = np.zeros((steps, 1))\n", " step_vector = np.zeros((steps, 1))\n", " p[0] = initial\n", " g[0] = df(p[0])\n", " h[0] = ddf(p[0])\n", " step_vector[0] = df(p[0]) / ddf(p[0])\n", " for i in range(1, steps):\n", " p[i] = p[i - 1] - step_vector[i - 1]\n", " g[i] = df(p[i])\n", " h[i] = ddf(p[i])\n", " step_vector[i] = df(p[i]) / ddf(p[i])\n", " if np.linalg.norm(step_vector[i]) < 1e-4:\n", " break\n", " return p[: i + 1], g[: i + 1], h[: i + 1], step_vector[: i + 1]\n", "\n", "\n", "x = sp.symbols(\"x\")\n", "f = sp.exp(x) - sp.log(x)\n", "p, _, _, _ = univariate_newtons_method(f, symbols=x, initial=1.75, steps=10)\n", "display(\n", " Math(\n", " \"\\\\arg \\\\min \\\\limits_{x}\" + sp.latex(f) + f\"={p[-1].squeeze():.4f}\"\n", " )\n", ")\n", "print(f\"Newton's method converged after {len(p)-1} steps\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's consider the quadratic function $f(x) = x^2$ which has a minimum at 0.\n", "\n", "Based on what we said above we expect Newton's method to find this minimum in one step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = sp.symbols(\"x\")\n", "f = x**2\n", "p, _, _, _ = univariate_newtons_method(f, symbols=x, initial=1.75, steps=10)\n", "display(\n", " Math(\n", " \"\\\\arg \\\\min \\\\limits_{x}\" + sp.latex(f) + f\"={p[-1].squeeze():.4f}\"\n", " )\n", ")\n", "print(f\"Newton's method converged after {len(p)-1} steps\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's consider the quadratic function $f(x) = -x^2$ which has a maximum at 0.\n", "\n", "Based on what we said above we expect Newton's method to find this maximum in one step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = sp.symbols(\"x\")\n", "f = -(x**2)\n", "p, _, _, _ = univariate_newtons_method(f, symbols=x, initial=1.75, steps=10)\n", "display(\n", " Math(\n", " \"\\\\arg \\\\min \\\\limits_{x}\"\n", " + sp.latex(f)\n", " + f\"={p[-1].squeeze():.4f}\"\n", " )\n", ")\n", "print(f\"Newton's method converged after {len(p)-1} steps\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Second derivatives\n", "\n", "In the previous section we introduced second derivatives without too much explanation of what they are.\n", "\n", "After all, analytically, it's just the derivative of a derivative.\n", "\n", "But what do they represent and what they're used for?\n", "\n", "Consider this analogy.\n", "\n", "> 🔑 Let $f(t)$ be the distance, then the first derivative is velocity (instantaneous rate of change of **distance** wrt time), then the second derivative is acceleration (instantaneous rate of change of **velocity** wrt time)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def _update(raw_frame):\n", " global distance_point, velocity_point, acceleration_point\n", " distance_point.remove()\n", " velocity_point.remove()\n", " acceleration_point.remove()\n", " frame = int(frame_func(raw_frame))\n", " xxf = xx[: frame + 1]\n", " distance_line = ax.plot(\n", " xxf, sp.lambdify(x, norm_pdf, \"numpy\")(xxf), c=\"tab:blue\"\n", " )\n", " velocity_line = ax.plot(\n", " xxf,\n", " sp.lambdify(x, sp.diff(norm_pdf, x, 1), \"numpy\")(xxf),\n", " c=\"tab:orange\",\n", " )\n", " acceleration_line = ax.plot(\n", " xxf,\n", " sp.lambdify(x, sp.diff(norm_pdf, x, 2), \"numpy\")(xxf),\n", " c=\"tab:green\",\n", " )\n", " distance_point = ax.scatter(\n", " xx[frame], sp.lambdify(x, norm_pdf, \"numpy\")(xx[frame]), color=\"tab:blue\"\n", " )\n", " velocity_point = ax.scatter(\n", " xx[frame],\n", " sp.lambdify(x, sp.diff(norm_pdf, x, 1), \"numpy\")(xx[frame]),\n", " color=\"tab:orange\",\n", " )\n", " acceleration_point = ax.scatter(\n", " xx[frame],\n", " sp.lambdify(x, sp.diff(norm_pdf, x, 2), \"numpy\")(xx[frame]),\n", " color=\"tab:green\",\n", " )\n", "\n", "\n", "x = sp.symbols(\"x\")\n", "norm_pdf = (1 / sp.sqrt(2 * np.pi)) * sp.exp(-(x**2) / 2)\n", "xx = np.linspace(-3, 3, 100)\n", "fig, ax = plt.subplots()\n", "distance_line = ax.plot([], [], c=\"tab:blue\", label=\"distance\")\n", "velocity_line = ax.plot([], [], c=\"tab:orange\", label=\"velocity\")\n", "acceleration_line = ax.plot([], [], c=\"tab:green\", label=\"acceleration\")\n", "distance_point = ax.scatter([], [])\n", "velocity_point = ax.scatter([], [])\n", "acceleration_point = ax.scatter([], [])\n", "ax.spines[\"bottom\"].set_position((\"data\", 0))\n", "plt.xlim(-3, 3)\n", "plt.ylim(-0.5, 0.5)\n", "plt.legend()\n", "plt.title(\"Distance, velocity and acceleration analogy\")\n", "\n", "# Create velocity as a function of frame\n", "# Takes values between 0 and 100 as arguments and returns the velocity-adjusted frame value.\n", "# For example between 0 and 20 velocity is slow\n", "# Frames instead of going 0, 1, 2 will go 0, 0, 1 etc.\n", "velocity_func = sp.lambdify(x, sp.diff(norm_pdf, x, 1), \"numpy\")\n", "velocity_magnitudes = np.abs(velocity_func(xx))\n", "normalized_magnitudes = (\n", " velocity_magnitudes / np.sum(velocity_magnitudes) * (len(xx) - 1)\n", ")\n", "adjusted_frame_positions = np.cumsum(normalized_magnitudes)\n", "frame_func = interp1d(\n", " np.arange(len(xx)),\n", " adjusted_frame_positions,\n", " kind=\"linear\",\n", " fill_value=\"extrapolate\",\n", ")\n", "\n", "ani = FuncAnimation(fig=fig, func=_update, frames=len(xx))\n", "html_animation = ani.to_jshtml(default_mode=\"loop\")\n", "if \"runtime\" not in get_ipython().config.IPKernelApp.connection_file:\n", " html_animation = fixup_animation_js(html_animation)\n", "display(HTML(html_animation))\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> 🔑 The second derivative tells us about the curvature of a function, or in other words if a function is concave up or concave down.\n", "\n", "When the second derivative is **positive** the function is **concave up**.\n", "\n", "When the second derivative is **negative** the function is **concave down**.\n", "\n", "When the second derivative is **0** the curvature of the function is **undetermined**.\n", "\n", "This information in combination with the first derivative tells us about our position relative to a minimum or maximum.\n", "\n", "$f'(x) > 0$ and $f''(x) > 0$: we're on the right hand side of a concave up function (minimum to our left) (like between -3 and -1 on the graph above).\n", "\n", "$f'(x) > 0$ and $f''(x) < 0$: we're on the left hand side of a concave down function (maximum to our right) (like between -1 and 0 on the graph above).\n", "\n", "$f'(x) < 0$ and $f''(x) < 0$: we're on the right hand side of a concave down function (maximum to our left) (like between 0 and 1 on the graph above).\n", "\n", "$f'(x) < 0$ and $f''(x) > 0$: we're on the left hand side of a concave up function (minimum to our right) (like between 1 and 3 on the graph above)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what happens when the (vanilla) Newton's method gets to a point where the second derivative is 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with warnings.catch_warnings():\n", " warnings.filterwarnings(\"error\")\n", " _, _, _, _ = univariate_newtons_method(norm_pdf, symbols=x, initial=1, steps=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Second-order partial derivatives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For functions of more than one variable we have partial derivatives as well as second-order partial derivatives.\n", "\n", "For example, let $f(x, y) = 2x^2+3y^2 - xy$. We have 2 partial derivatives\n", "\n", "$\\cfrac{\\partial f}{\\partial x} = 4x - y$\n", "\n", "$\\cfrac{\\partial f}{\\partial y} = 6y - x$\n", "\n", "and 4 second-order partial derivatives\n", "\n", "$\\partial\\cfrac{\\partial f / \\partial x}{\\partial x} = \\cfrac{\\partial^2 f}{\\partial x^2} = 4$\n", "\n", "$\\partial\\cfrac{\\partial f / \\partial x}{\\partial y} = \\cfrac{\\partial^2 f}{\\partial x \\partial y} = -1$\n", "\n", "$\\partial\\cfrac{\\partial f / \\partial y}{\\partial y} = \\cfrac{\\partial^2 f}{\\partial y^2} = 6$\n", "\n", "$\\partial\\cfrac{\\partial f / \\partial y}{\\partial x} = \\cfrac{\\partial^2 f}{\\partial y \\partial x} = -1$\n", "\n", "The fact that $\\cfrac{\\partial^2 f}{\\partial x \\partial y} = \\cfrac{\\partial^2 f}{\\partial y \\partial x}$ is not a coincidence but rather the symmetry property of partial second-order derivatives.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hessian matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we can summarize partial derivatives into a gradient vector $\\nabla_f = \\langle \\cfrac{\\partial f}{\\partial x}, \\cfrac{\\partial f}{\\partial y} \\rangle$, it seems almost obvious that we can summarize second-order partial derivatives into a matrix.\n", "\n", "Such matrix is called Hessian.\n", "\n", "$H_{f(x, y)} = \\begin{bmatrix}\n", "\\cfrac{\\partial^2 f}{\\partial x^2}&&\\cfrac{\\partial^2 f}{\\partial x \\partial y}\\\\\n", "\\cfrac{\\partial^2 f}{\\partial y \\partial x}&&\\cfrac{\\partial^2 f}{\\partial y^2}\n", "\\end{bmatrix}$\n", "\n", "> 🔑 The Hessian matrix is a symmetric matrix containing the second-order partial derivatives of a function\n", "\n", "Continuing the previous example, the Hessian matrix of $f(x, y) = 2x^2+3y^2 - xy$ (at any point over its domain) is\n", "\n", "$H = \\begin{bmatrix}4&&-1\\\\-1&&6\\end{bmatrix}$\n", "\n", "We can also show that the original function can be expressed as \n", "\n", "$q(x, y) = \\cfrac{1}{2}\\begin{bmatrix}x&&y\\end{bmatrix}\\begin{bmatrix}4&&-1\\\\-1&&6\\end{bmatrix}\\begin{bmatrix}x\\\\y\\end{bmatrix}$\n", "\n", "$= \\cfrac{1}{2}\\begin{bmatrix}x&&y\\end{bmatrix}\\begin{bmatrix}4x - y\\\\6y - x\\end{bmatrix}$\n", "\n", "$= \\cfrac{1}{2}(x(4x - y)+y(6y - x))$\n", "\n", "$= \\cfrac{1}{2}(4x^2 - xy + 6y^2 - xy)$\n", "\n", "$= 2x^2 + 3y^2 - xy$\n", "\n", "It turns out that $q(x, y) = x^THx$ creates a concave up function if $H$ is a positive-definite matrix ($x^THx>0$) and a concave down function if H is a negative-definite matrix ($x^THx<0$).\n", "\n", "How do we determine the positive definiteness of $H$?\n", "\n", "With its eigenvalues.\n", "\n", "Recall that an eigenvector is a vector $x$ such that $Hx = \\lambda x$.\n", "\n", "If we multiply both sides by $x^T$ we get $x^THx = \\lambda x^Tx$.\n", "\n", "An eigenvector is usually scaled such that its norm is 1. The scaling doesn't change its status of eigenvector nor the eigenvalue. This because $H(cx) = \\lambda (cx)$ where $c$ is a scaling constant.\n", "\n", "So if $\\|x\\| = x^Tx = 1$, we get that $x^THx = \\lambda$.\n", "\n", "Let's verify this, before proceeding any further." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "H = np.array([[4, -1], [-1, 6]])\n", "eigvals, eigvecs = np.linalg.eig(H)\n", "xTHx = np.dot(eigvecs[:, 0].T, H).dot(eigvecs[:, 0])\n", "lamxTx = eigvals[0] * np.dot(eigvecs[:, 0].T, eigvecs[:, 0].T)\n", "\n", "assert np.isclose(xTHx, lamxTx)\n", "assert np.isclose(xTHx, eigvals[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the eigenvalues of $H$ we can conclude one of the following about the curvature of a function:\n", "- the function at a point is concave up if the the eigenvalues of $H$ are all strictly positive (positive-definite)\n", "- the function at a point is concave down if the the eigenvalues of $H$ are all strictly negative (negative-definite)\n", "- the point of the function is a saddle point if the the eigenvalues of $H$ are both strictly positive and negative (indefinite)\n", "- we can't conclude anything if one of the eigenvalues is zero (positive semi-definite or negative semi-definite.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Netwon-Raphson method with more than one variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall in the univariate case the update of the parameter was $x_{new} = x - f'(x) / f''(x)$.\n", "\n", "Let $p = (x, y)$. In the bivariate case, the update of the parameter is $p_{new} = p - H_f^{-1}(p) \\cdot \\nabla_f(p)$, where $H_f^{-1}(p)$ is 2x2 matrix and $\\nabla_f(p)$ is a 2x1 column vector, so it results that $p_{new}$ is a 2x1 column vector.\n", "\n", "$\\begin{array}{l}\n", "\\textbf{Algorithm: } \\text{Newton's Method for Multivariate Optimization} \\\\\n", "\\textbf{Input: } \\text{function to minimize } \\mathcal{f}, \\text{initial guess } p_0, \\text{number of iterations } K, \\text{convergence } \\epsilon \\\\\n", "\\textbf{Output: } p \\\\\n", "1 : p \\gets p_0 \\\\\n", "2 : \\text{for k = 1 to K do} \\\\\n", "3 : \\quad p_{new} = p - H^{-1}(p)\\nabla f(p) \\\\\n", "4 : \\quad \\text{if } \\|H^{-1}(p)\\nabla f(p)\\| < \\epsilon \\text{ then} \\\\\n", "5 : \\quad \\quad \\text{return } p \\\\\n", "6 : \\quad \\text {end if} \\\\\n", "7 : \\quad p \\gets p_{new} \\\\\n", "8 : \\text{return } p \\\\\n", "\\end{array}$\n", "\n", "Continuing the previous example, the gradient and the Hessian are\n", "\n", "$\\nabla = \\begin{bmatrix}4x-y\\\\-x+6y\\end{bmatrix}$\n", "\n", "$H = \\begin{bmatrix}4&&-1\\\\-1&&6\\end{bmatrix}$\n", "\n", "Let's practice inverting the Hessian using the row-echelon form method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x, y = sp.symbols(\"x, y\")\n", "f = 2 * x**2 + 3 * y**2 - x * y\n", "\n", "H = sp.hessian(f, (x, y))\n", "HI = H.row_join(sp.Matrix.eye(2))\n", "HI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "HI = HI.elementary_row_op(\"n->n+km\", row=1, k=1 / 4, row1=1, row2=0)\n", "HI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "HI = HI.elementary_row_op(\"n->kn\", row=1, k=4 / 23)\n", "HI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "HI = HI.elementary_row_op(\"n->n+km\", row=0, k=1, row1=0, row2=1)\n", "HI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "HI = HI.elementary_row_op(\"n->kn\", row=0, k=1 / 4)\n", "HI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The inverse of the Hessian is" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "HI[-2:, -2:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's verify it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert np.array_equal(\n", " np.array(H.inv(), dtype=\"float\"), np.array(HI[-2:, -2:], dtype=\"float\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's calculate $H_f^{-1} \\cdot \\nabla_f$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "H.inv() @ sp.Matrix([sp.diff(f, x), sp.diff(f, y)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So $\\begin{bmatrix}x_{new}\\\\y_{new}\\end{bmatrix} = \\begin{bmatrix}x\\\\y\\end{bmatrix} - \\begin{bmatrix}x\\\\y\\end{bmatrix}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, regardless of our starting point we get that the minimum is at [0, 0] which is consistent with\n", "\n", "$\\begin{cases}4x-y=0\\\\-x+6y=0\\end{cases}$\n", "\n", "$\\begin{cases}x=\\cfrac{y}{4}\\\\-\\cfrac{y}{4}+6y=0\\end{cases}$\n", "\n", "$\\begin{cases}x=\\cfrac{y}{4}\\\\\\cfrac{23y}{4}=0\\end{cases}$\n", "\n", "$\\begin{cases}x=0\\\\y=0\\end{cases}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's verify we get the same result with the Newton's method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def bivariate_newtons_method(f, symbols, initial, steps):\n", " x, y = symbols\n", " H = sp.hessian(f, (x, y))\n", " hessian_eval = sp.lambdify((x, y), H, \"numpy\")\n", " nabla = sp.Matrix([sp.diff(f, x), sp.diff(f, y)])\n", " nabla_eval = sp.lambdify((x, y), nabla, \"numpy\")\n", " H_dot_nabla = sp.lambdify((x, y), H.inv() @ nabla, \"numpy\")\n", " p = np.zeros((steps, 2))\n", " g = np.zeros((steps, 2))\n", " h = np.zeros((steps, 2, 2))\n", " step_vector = np.zeros((steps, 2))\n", " p[0] = initial\n", " g[0] = nabla_eval(*p[0]).squeeze()\n", " h[0] = hessian_eval(*p[0]).squeeze()\n", " step_vector[0] = H_dot_nabla(*p[0]).squeeze()\n", " for i in range(1, steps):\n", " p[i] = p[i - 1] - step_vector[i - 1]\n", " g[i] = nabla_eval(*p[i]).squeeze()\n", " h[i] = hessian_eval(*p[i]).squeeze()\n", " step_vector[i] = H_dot_nabla(*p[i]).squeeze()\n", " if np.linalg.norm(step_vector[i]) < 1e-4:\n", " break\n", " return p[: i + 1], g[: i + 1], h[: i + 1], step_vector[: i + 1]\n", "\n", "\n", "x, y = sp.symbols(\"x, y\")\n", "f = 2 * x**2 + 3 * y**2 - x * y\n", "p, _, h, _ = bivariate_newtons_method(\n", " f, symbols=(x, y), initial=np.array([1, 3]), steps=10\n", ")\n", "\n", "display(\n", " Math(\n", " \"\\\\arg \\\\min \\\\limits_{x,y}\"\n", " + sp.latex(f)\n", " + \"=\"\n", " + sp.latex(np.array2string(p[-1], precision=2, suppress_small=True))\n", " )\n", ")\n", "print(f\"Newton's method converged after {len(p)-1} steps\")\n", "print(\n", " f\"Eigenvalues of H: {np.array2string(np.linalg.eigvals(h[-1]), precision=2, suppress_small=True)}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the function is quadratic Newton's method only needed one step to find the critical point, which is a minimum because the Hessian at the critical point is positive-definite.\n", "\n", "Let's consider a non-quadratic function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x, y = sp.symbols(\"x, y\")\n", "g = x**4 + 0.8 * y**4 + 4 * x**2 + 2 * y**2 - x * y - 0.2 * x**2 * y\n", "p, _, h, _ = bivariate_newtons_method(\n", " g, symbols=(x, y), initial=np.array([4, 4]), steps=10\n", ")\n", "\n", "display(\n", " Math(\n", " \"\\\\arg \\\\min \\\\limits_{x,y}\"\n", " + sp.latex(g)\n", " + \"=\"\n", " + sp.latex(np.array2string(p[-1], precision=2, suppress_small=True))\n", " )\n", ")\n", "print(f\"Newton's method converged after {len(p)-1} steps\")\n", "print(\n", " f\"Eigenvalues of H: {np.array2string(np.linalg.eigvals(h[-1]), precision=2, suppress_small=True)}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The critical point $[0, 0]$ is a minimum because the Hessian is positive-definite." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }