{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Calculus - Week 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "import re\n", "import warnings\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import plotly.graph_objects as go\n", "import plotly.io as pio\n", "import sympy as sp\n", "from IPython.core.getipython import get_ipython\n", "from IPython.display import display, HTML, Math\n", "from matplotlib.animation import FuncAnimation\n", "from scipy.interpolate import interp1d\n", "\n", "plt.style.use(\"seaborn-v0_8-whitegrid\")\n", "pio.renderers.default = \"plotly_mimetype+notebook\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimizing neural networks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Single-neuron network with linear activation and Mean Squared Error (MSE) loss function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let the linear model $Z = wX + b$, where\n", "\n", "$X$ is a $(k, m)$ matrix of $k$ features for $m$ samples,\n", "\n", "$w$ is a $(1, k)$ matrix (row vector) containing $k$ weights,\n", "\n", "$b$ is a $(1, 1)$ matrix (scalar) containing 1 bias, such that\n", "\n", "$Z = \\begin{bmatrix}w_1&&w_2&&\\dots w_k\\end{bmatrix} \\begin{bmatrix}x_{11}&&x_{12}&&\\dots&&x_{1m}\\\\x_{21}&&x_{22}&&\\dots&&x_{2m}\\\\\\vdots&&\\vdots&&\\ddots&&\\vdots\\\\x_{k1}&&x_{k2}&&\\dots&&x_{km}\n", "\\end{bmatrix} + \\begin{bmatrix}b\\end{bmatrix}$\n", "\n", "$Z$ is a $(1, m)$ matrix.\n", "\n", "Let $Y$ a $(1, m)$ matrix (row vector) containing the labels of $m$ samples, such that\n", "\n", "$Y = \\begin{bmatrix}y_1&&y_2&&\\dots&&y_m\\end{bmatrix}$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = 40\n", "k = 2\n", "w = np.array([[0.37, 0.57]])\n", "b = np.array([[0.1]])\n", "\n", "rng = np.random.default_rng(4)\n", "X = rng.standard_normal((k, m))\n", "Y = w @ X + b + rng.normal(size=(1, m))\n", "\n", "scatter = go.Scatter3d(\n", " z=Y.squeeze(),\n", " x=X[0],\n", " y=X[1],\n", " mode=\"markers\",\n", " marker=dict(color=\"#1f77b4\", size=5),\n", " name=\"data\",\n", ")\n", "\n", "fig = go.Figure(scatter)\n", "fig.update_layout(\n", " title=\"Regression data\",\n", " autosize=False,\n", " width=600,\n", " height=600,\n", " scene_aspectmode=\"cube\",\n", " margin=dict(l=10, r=10, b=10, t=30),\n", " scene=dict(\n", " xaxis=dict(title=\"x1\", range=[-2.5, 2.5]),\n", " yaxis=dict(title=\"x2\", range=[-2.5, 2.5]),\n", " zaxis_title=\"y\",\n", " camera_eye=dict(x=1.2, y=-1.8, z=0.5),\n", " ),\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The predictions $\\hat{Y}$ are the result of passing $Z$ to a linear activation function, so that $\\hat{Y} = I(Z)$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def linear(x, m):\n", " return x * m\n", "\n", "\n", "def init_neuron_params(k):\n", " w = rng.uniform(size=(1, k)) * 0.5\n", " b = np.zeros((1, 1))\n", " return {\"w\": w, \"b\": b}\n", "\n", "\n", "xx1, xx2 = np.meshgrid(np.linspace(-2.5, 2.5, 100), np.linspace(-2.5, 2.5, 100))\n", "w, b = init_neuron_params(k).values()\n", "random_model_plane = go.Surface(\n", " z=linear(w[0, 0] * xx1 + w[0, 1] * xx2 + b, m=1),\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#FF8920\"], [1, \"#FF8920\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"init params\",\n", ")\n", "fig.add_trace(random_model_plane)\n", "fig.update_layout(title=\"Random model\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a single sample the squared loss is\n", "\n", "$\\ell(w, b, y_i) = (y_i - \\hat{y}_i)^2 = (y_i - wx_i - b)^2$\n", "\n", "For the whole sample the mean squared loss is \n", "\n", "$\\mathcal{L}(w, b, Y) = \\cfrac{1}{m}\\sum \\limits_{i=1}^{m} \\ell(w, b, y_i) = \\cfrac{1}{m}\\sum \\limits_{i=1}^{m} (y_i - wx_i - b)^2$\n", "\n", "So that we don't have a lingering 2 in the partial derivatives, we can rescale it by 0.5\n", "\n", "$\\mathcal{L}(w, b, Y) = \\cfrac{1}{2m}\\sum \\limits_{i=1}^{m} (y_i - wx_i - b)^2$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def neuron_output(X, params, activation, *args):\n", " w = params[\"w\"]\n", " b = params[\"b\"]\n", " Z = w @ X + b\n", " Y_hat = activation(Z, *args)\n", " return Y_hat\n", "\n", "\n", "def compute_mse_loss(Y, Y_hat):\n", " m = Y_hat.shape[1]\n", " return np.sum((Y - Y_hat) ** 2) / (2 * m)\n", "\n", "\n", "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, linear, 1)\n", "print(f\"MSE loss of random model: {compute_mse_loss(Y, Y_hat):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The partial derivatives of the loss function are\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial w}$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial b}$\n", "\n", "Let's calculate $\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}$, $\\cfrac{\\partial \\hat{Y}}{\\partial w}$ and $\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}} = \\cfrac{\\partial}{\\partial \\hat{Y}} \\cfrac{1}{m}\\sum \\limits_{i=1}^{m} (Y - \\hat{Y})^2 = \\cfrac{1}{m}\\sum \\limits_{i=1}^{m} 2(Y - \\hat{Y})(- 1) = -\\cfrac{1}{m}\\sum \\limits_{i=1}^{m}(Y - \\hat{Y})$\n", "\n", "$\\cfrac{\\partial \\hat{Y}}{\\partial w} = \\cfrac{\\partial}{\\partial w} wX + b = X$\n", "\n", "$\\cfrac{\\partial \\hat{Y}}{\\partial b} = wX + b = X = 1$\n", "\n", "Let's put it all together\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = -\\cfrac{1}{m}\\sum \\limits_{i=1}^{m} (Y - \\hat{Y}) \\cdot X^T$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = -\\cfrac{1}{m}\\sum \\limits_{i=1}^{m} (Y - \\hat{Y})$\n", "\n", "> 🔑 $\\cfrac{\\partial \\mathcal{L}}{\\partial w}$ contains all the partial derivative wrt to each of the $k$ elements of $w$; it's a $(k, m)$ matrix, because the dot product is between a matrix of shape $(1, m)$ and $X^T$ which is $(m, k)$.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def compute_grads(X, Y, Y_hat):\n", " m = Y_hat.shape[1]\n", " dw = -1 / m * np.dot(Y - Y_hat, X.T) # (1, k)\n", " db = -1 / m * np.sum(Y - Y_hat, axis=1, keepdims=True) # (1, 1)\n", " return {\"w\": dw, \"b\": db}\n", "\n", "\n", "def update_params(params, grads, learning_rate=0.1):\n", " params = params.copy()\n", " for k in grads.keys():\n", " params[k] = params[k] - learning_rate * grads[k]\n", " return params\n", "\n", "\n", "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, linear, 1)\n", "print(f\"MSE loss before update: {compute_mse_loss(Y, Y_hat):.2f}\")\n", "grads = compute_grads(X, Y, Y_hat)\n", "params = update_params(params, grads)\n", "Y_hat = neuron_output(X, params, linear, 1)\n", "print(f\"MSE loss after update: {compute_mse_loss(Y, Y_hat):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's find the best parameters with gradient descent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, linear, 1)\n", "loss = compute_mse_loss(Y, Y_hat)\n", "print(f\"Iter 0 - MSE loss={loss:.6f}\")\n", "for i in range(1, 50 + 1):\n", " grads = compute_grads(X, Y, Y_hat)\n", " params = update_params(params, grads)\n", " Y_hat = neuron_output(X, params, linear, 1)\n", " loss_new = compute_mse_loss(Y, Y_hat)\n", " if loss - loss_new <= 1e-4:\n", " print(f\"Iter {i} - MSE loss={loss:.6f}\")\n", " print(\"The algorithm has converged\")\n", " break\n", " loss = loss_new\n", " if i % 5 == 0:\n", " print(f\"Iter {i} - MSE loss={loss:.6f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the final model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "w, b = params.values()\n", "final_model_plane = go.Surface(\n", " z=w[0, 0] * xx1 + w[0, 1] * xx2 + b,\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#2ca02c\"], [1, \"#2ca02c\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"final params\",\n", ")\n", "fig.add_trace(final_model_plane)\n", "fig.data[1].visible = False\n", "fig.update_layout(title=\"Final model\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Single-neuron network with sigmoid activation and Log loss function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, let the linear model $Z = wX + b$ and $Y$ a row vector containing the labels of $m$ samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = 40\n", "k = 2\n", "neg_centroid = [-1, -1]\n", "pos_centroid = [1, 1]\n", "\n", "rng = np.random.default_rng(1)\n", "X = np.r_[\n", " rng.standard_normal((m // 2, k)) + neg_centroid,\n", " rng.standard_normal((m // 2, k)) + pos_centroid,\n", "].T\n", "Y = np.array([[0] * (m // 2) + [1] * (m // 2)])\n", "\n", "plt.scatter(X[0], X[1], c=np.where(Y.squeeze() == 0, \"tab:orange\", \"tab:blue\"))\n", "plt.gca().set_aspect(\"equal\")\n", "plt.xlabel(\"$x_1$\")\n", "plt.ylabel(\"$x_2$\")\n", "plt.xlim(-5, 5)\n", "plt.ylim(-5, 5)\n", "plt.title(\"Classification data\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time, though, the predictions $\\hat{Y}$ are the result of passing $Z$ to a sigmoid function, so that $\\hat{Y} = \\sigma(Z)$.\n", "\n", "$\\sigma(Z) = \\cfrac{1}{1+e^{-Z}}$\n", "\n", "To visualize the sigmoid function let's add another axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "neg_scatter = go.Scatter3d(\n", " z=np.full(int(m / 2), 0),\n", " x=X[0, : int(m / 2)],\n", " y=X[1, : int(m / 2)],\n", " mode=\"markers\",\n", " marker=dict(color=\"#ff7f0e\", size=5),\n", " name=\"negative class\",\n", ")\n", "pos_scatter = go.Scatter3d(\n", " z=np.full(int(m / 2), 1),\n", " x=X[0, int(m / 2) :],\n", " y=X[1, int(m / 2) :],\n", " mode=\"markers\",\n", " marker=dict(color=\"#1f77b4\", size=5),\n", " name=\"positive class\",\n", ")\n", "\n", "fig = go.Figure([pos_scatter, neg_scatter])\n", "fig.update_layout(\n", " title=\"Classification data\",\n", " autosize=False,\n", " width=600,\n", " height=600,\n", " margin=dict(l=10, r=10, b=10, t=30),\n", " scene=dict(\n", " xaxis=dict(title=\"x1\", range=[-5, 5]),\n", " yaxis=dict(title=\"x2\", range=[-5, 5]),\n", " zaxis_title=\"y\",\n", " camera_eye=dict(x=0, y=0.3, z=2.5),\n", " camera_up=dict(x=0, y=np.sin(np.pi), z=0),\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's plot the predictions of a randomly initialized model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sigmoid(x):\n", " return 1 / (1 + np.exp(-x))\n", "\n", "\n", "xx1, xx2 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))\n", "w, b = init_neuron_params(k).values()\n", "random_model_plane = go.Surface(\n", " z=sigmoid(w[0, 0] * xx1 + w[0, 1] * xx2 + b),\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#ff7f0e\"], [1, \"#1f77b4\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"init params\",\n", ")\n", "fig.add_trace(random_model_plane)\n", "fig.update_layout(\n", " title=\"Random model\",\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that the output of the sigmoid activation is the probability $p$ of a sample belonging to the positive class, which implies that $(1-p)$ is the probability of a sample belonging to the negative class.\n", "\n", "Intuitively, a loss function will be small when $p_i$ is close to 1.0 and $y_i = 1$ and when $p_i$ is close to 0.0 and $y_i = 0$.\n", "\n", "For a single sample this loss function might look like this\n", "\n", "$\\ell(w, b, y_i) = -p_i^{y_i}(1-p_i)^{1-y_i}$\n", "\n", "For the whole sample the loss would be\n", "\n", "$\\mathcal{L}(w, b, Y) = -\\prod \\limits_{i=1}^{m} p_i^{y_i}(1-p_i)^{1-y_i}$\n", "\n", "In [Univariate optimization (week 1)](ca_w1.html#Univariate-optimization) we've seen how it's easier to calculate the derivative of the logarithm of the PMF of the Binomial Distribution.\n", "\n", "We can do the same here to obtain a more manageable loss function.\n", "\n", "$\\mathcal{L}(w, b, Y) = -\\sum \\limits_{i=1}^{m} y_i \\ln p_i + (1-y_i) \\ln (1-p_i)$\n", "\n", "It turns out it's standard practice to minimize a function and to average the loss over the sample (to manage the scale of the loss for large datasets), so we'll use this instead:\n", "\n", "$\\mathcal{L}(w, b, Y) = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} y_i \\ln p_i + (1-y_i) \\ln (1-p_i)$\n", "\n", "Finally, let's substitute back $\\hat{y}_i$ for $p_i$.\n", "\n", "$\\mathcal{L}(w, b, Y) = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} y_i \\ln \\hat{y}_i + (1-y_i) \\ln (1-\\hat{y}_i)$\n", "\n", "Recall that $\\hat{y}_i = \\sigma(z_i)$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def compute_log_loss(Y, Y_hat):\n", " m = Y_hat.shape[1]\n", " loss = (-1 / m) * (\n", " np.dot(Y, np.log(Y_hat).T) + np.dot((1 - Y), np.log(1 - Y_hat).T)\n", " )\n", " return loss.squeeze()\n", "\n", "\n", "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "print(f\"Log loss of random model: {compute_log_loss(Y, Y_hat):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The partial derivatives of the loss function are\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial w}$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\hat{Y}}\\cfrac{\\partial \\hat{Y}}{\\partial b}$\n", "\n", "Because we have an activation function around $Z$, we have another chain rule to apply.\n", "\n", "For the MSE loss, the activation was an identity function so it didn't pose the same challenge.\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}\\cfrac{\\partial \\sigma(Z)}{\\partial w} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}\\cfrac{\\partial \\sigma(Z)}{\\partial Z}\\cfrac{\\partial Z}{\\partial w}$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}\\cfrac{\\partial \\sigma(Z)}{\\partial b} = \\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}\\cfrac{\\partial \\sigma(Z)}{\\partial Z}\\cfrac{\\partial Z}{\\partial b}$\n", "\n", "Let's calculate $\\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)}$, $\\cfrac{\\partial \\sigma(Z)}{\\partial Z}$, $\\cfrac{\\partial Z}{\\partial w}$ and $\\cfrac{\\partial Z}{\\partial b}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculation of partial derivative of Log loss wrt sigmoid activation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\cfrac{\\partial \\mathcal{L}}{\\partial \\sigma(Z)} = \\cfrac{\\partial}{\\partial \\sigma(Z)} -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} Y\\ln\\sigma(Z)+(Y-1)\\ln(1-\\sigma(Z))$ \n", "\n", "$= -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y}{\\sigma(Z)}+\\cfrac{Y-1}{1-\\sigma(Z)}$\n", "\n", "$= -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y(1-\\sigma(Z)) + (Y-1)\\sigma(Z)}{\\sigma(Z)(1-\\sigma(Z))}$ \n", "\n", "$= -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y -\\sigma(Z)}{\\sigma(Z)(1-\\sigma(Z))}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculation of the derivative of the sigmoid function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\cfrac{\\partial \\sigma(Z)}{\\partial Z} = \\cfrac{\\partial}{\\partial Z} \\cfrac{1}{1+e^{-Z}}$\n", "\n", "$= (1+e^{-Z})^{-1}$ \n", "\n", "$= -(1+e^{-Z})^{-2}(-e^{-Z})$ \n", "\n", "$= (1+e^{-Z})^{-2}e^{-Z}$\n", "\n", "$= \\cfrac{e^{-Z}}{(1+e^{-Z})^2}$ \n", "\n", "$= \\cfrac{1 +e^{-Z} - 1}{(1+e^{-Z})^2}$ \n", "\n", "$= \\cfrac{1 +e^{-Z}}{(1+e^{-Z})^2}-\\cfrac{1}{(1+e^{-Z})^2}$ \n", "\n", "$= \\cfrac{1}{1+e^{-Z}}-\\cfrac{1}{(1+e^{-Z})^2}$\n", "\n", "Because the factor of $x-x^2$ is $x(1-x)$ we can factor it to\n", "\n", "$= \\cfrac{1}{1+e^{-Z}}\\left(1 - \\cfrac{1}{1+e^{-Z}} \\right)$ \n", "\n", "$= \\sigma(Z)(1 - \\sigma(Z))$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The other two partial derivatives are the same as the MSE loss'.\n", "\n", "$\\cfrac{\\partial Z}{\\partial w} = \\cfrac{\\partial}{\\partial w} wX+b = X$\n", "\n", "$\\cfrac{\\partial Z}{\\partial b} = \\cfrac{\\partial}{\\partial b} wX+b = 1$\n", "\n", "Let's put it all together\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial w} = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y -\\sigma(Z)}{\\sigma(Z)(1-\\sigma(Z))} \\sigma(Z)(1 - \\sigma(Z)) X^T = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} (Y -\\sigma(Z))X^T = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} (Y - \\hat{Y}) X^T$\n", "\n", "$\\cfrac{\\partial \\mathcal{L}}{\\partial b} = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} \\cfrac{Y -\\sigma(Z)}{\\sigma(Z)(1-\\sigma(Z))} \\sigma(Z)(1 - \\sigma(Z)) = -\\cfrac{1}{m} \\sum \\limits_{i=1}^{m} Y - \\hat{Y}$\n", "\n", "> 🔑 The partial derivatives of the Log loss $\\cfrac{\\partial \\mathcal{L}}{\\partial w}$ and $\\cfrac{\\partial \\mathcal{L}}{\\partial b}$ are the same as the MSE loss'." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "print(f\"Log loss before update: {compute_log_loss(Y, Y_hat):.2f}\")\n", "grads = compute_grads(X, Y, Y_hat)\n", "params = update_params(params, grads)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "print(f\"Log loss after update: {compute_log_loss(Y, Y_hat):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's find the best parameters using gradient descent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "loss = compute_log_loss(Y, Y_hat)\n", "print(f\"Iter 0 - Log loss={loss:.6f}\")\n", "for i in range(1, 500 + 1):\n", " grads = compute_grads(X, Y, Y_hat)\n", " params = update_params(params, grads)\n", " Y_hat = neuron_output(X, params, sigmoid)\n", " loss_new = compute_log_loss(Y, Y_hat)\n", " if loss - loss_new <= 1e-4:\n", " print(f\"Iter {i} - Log loss={loss:.6f}\")\n", " print(\"The algorithm has converged\")\n", " break\n", " loss = loss_new\n", " if i % 100 == 0:\n", " print(f\"Iter {i} - Log loss={loss:.6f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the final model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "w, b = params.values()\n", "final_model_plane = go.Surface(\n", " z=sigmoid(w[0, 0] * xx1 + w[0, 1] * xx2 + b),\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#ff7f0e\"], [1, \"#1f77b4\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"final params\",\n", ")\n", "fig.add_trace(final_model_plane)\n", "fig.data[2].visible = False\n", "fig.update_layout(title=\"Final model\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Neural network with 1 hidden layer of 3 neurons and sigmoid activations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Motivation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create non-linear classification data to see the shortcomings of the previous model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = 100\n", "k = 2\n", "linspace_out = np.linspace(0, 2 * np.pi, m // 2, endpoint=False)\n", "linspace_in = np.linspace(0, 2 * np.pi, m // 2, endpoint=False)\n", "outer_circ_x = np.cos(linspace_out)\n", "outer_circ_y = np.sin(linspace_out)\n", "inner_circ_x = np.cos(linspace_in) * 0.3\n", "inner_circ_y = np.sin(linspace_in) * 0.3\n", "\n", "rng = np.random.default_rng(1)\n", "X = np.vstack(\n", " [np.append(outer_circ_x, inner_circ_x), np.append(outer_circ_y, inner_circ_y)]\n", ")\n", "X += rng.normal(scale=0.1, size=X.shape)\n", "X = (X - np.mean(X, axis=1, keepdims=True)) / np.std(X, axis=1, keepdims=True)\n", "Y = np.array([[0] * (m // 2) + [1] * (m // 2)])\n", "\n", "plt.scatter(X[0], X[1], c=np.where(Y.squeeze() == 0, \"tab:orange\", \"tab:blue\"))\n", "plt.gca().set_aspect(\"equal\")\n", "plt.xlabel(\"$x_1$\")\n", "plt.ylabel(\"$x_2$\")\n", "plt.xlim(-3, 3)\n", "plt.ylim(-3, 3)\n", "plt.title(\"Non-linear classification data\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before let's add another axis to visualize the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "neg_scatter = go.Scatter3d(\n", " z=np.full(int(m / 2), 0),\n", " x=X[0, : int(m / 2)],\n", " y=X[1, : int(m / 2)],\n", " mode=\"markers\",\n", " marker=dict(color=\"#ff7f0e\", size=5),\n", " name=\"negative class\",\n", ")\n", "pos_scatter = go.Scatter3d(\n", " z=np.full(int(m / 2), 1),\n", " x=X[0, int(m / 2) :],\n", " y=X[1, int(m / 2) :],\n", " mode=\"markers\",\n", " marker=dict(color=\"#1f77b4\", size=5),\n", " name=\"positive class\",\n", ")\n", "\n", "fig = go.Figure([pos_scatter, neg_scatter])\n", "fig.update_layout(\n", " title=\"Non-linear classification data\",\n", " autosize=False,\n", " width=600,\n", " height=600,\n", " margin=dict(l=10, r=10, b=10, t=30),\n", " scene=dict(\n", " xaxis=dict(title=\"x1\", range=[-3, 3]),\n", " yaxis=dict(title=\"x2\", range=[-3, 3]),\n", " zaxis_title=\"y\",\n", " aspectmode=\"cube\",\n", " camera_eye=dict(x=0, y=0, z=2.5),\n", " camera_up=dict(x=0, y=np.sin(np.pi), z=0),\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's train a single-neuron model with sigmoid activation using a Log loss function and plot the respective plane." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = init_neuron_params(k)\n", "Y_hat = neuron_output(X, params, sigmoid)\n", "loss = compute_log_loss(Y, Y_hat)\n", "print(f\"Iter 0 - Log loss={loss:.6f}\")\n", "for i in range(1, 500 + 1):\n", " grads = compute_grads(X, Y, Y_hat)\n", " params = update_params(params, grads)\n", " Y_hat = neuron_output(X, params, sigmoid)\n", " loss_new = compute_log_loss(Y, Y_hat)\n", " if loss - loss_new <= 1e-4:\n", " print(f\"Iter {i} - Log loss={loss:.6f}\")\n", " print(\"The algorithm has converged\")\n", " break\n", " loss = loss_new\n", " if i % 100 == 0:\n", " print(f\"Iter {i} - Log loss={loss:.6f}\")\n", "\n", "w, b = params.values()\n", "xx1, xx2 = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))\n", "final_model_plane = go.Surface(\n", " z=sigmoid(w[0, 0] * xx1 + w[0, 1] * xx2 + b),\n", " x=xx1,\n", " y=xx2,\n", " colorscale=[[0, \"#ff7f0e\"], [1, \"#1f77b4\"]],\n", " showscale=False,\n", " opacity=0.5,\n", " name=\"final params\",\n", ")\n", "fig.add_trace(final_model_plane)\n", "fig.update_layout(\n", " title=\"Linear model on non-linear data\",\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to improve it by adding more neurons." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Neural network notation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The neural network we will be building is shown below.\n", "\n", "\n", "\n", "The forward and backward propagation algorithm is shown below.\n", "\n", "