Skip to article frontmatterSkip to article content

Neural nets

Dense layer

Consider a dense affine layer with 10 inputs and 20 outputs. How many parameters does it have (assume there is a “bias” term)?

Convolutional layer

Consider a 2d convolutional layer with filterwidth 2 and filterheight 6, 10 input channels and 20 output channels. We are using the layer for images with width 50 and height 50. How many parameters does the layer have?

Composing layers

Consider a network architecture formed by composing three simple families of functions. The final network will involve an affine layer

(f(x;β))k=βk0+jβkjxj,(f(x;\beta))_k = \beta_{k0}+ \sum_j \beta_{kj} x_j,

another affine layer

(g(z;γ))k=γk0+jγkjzj,(g(z;\gamma))_k = \gamma_{k0}+ \sum_j \gamma_{kj} z_j,

and a nonlinearity (ReLU(x))k=max(xk,0)(\mathrm{ReLU}(x))_k = \max(x_k,0). Which would be a more typical way to compose these layers into a final neural network architecture?

  1. h(x;β,γ)=ReLU(f(g(x;γ);β))h(x;\beta,\gamma) = \mathrm{ReLU}(f(g(x;\gamma);\beta))

  2. h(x;β,γ)=f(ReLU(g(x;γ));β)h(x;\beta,\gamma) = f(\mathrm{ReLU}(g(x;\gamma));\beta)

  3. h(x;β,γ)=f(g(ReLU(x);γ);β)h(x;\beta,\gamma) = f(g(\mathrm{ReLU}(x);\gamma);\beta)

Convolution by hand

Compute the convolution of the input X=(1,1,2,3,4,5,6,6)X=(1,1,2,3,4,5,6,6) with the filter (1,2,1)(1,-2,1).

Function families

Consider the following three functions.

  • g1(x)=x2cosxg_1(x) = x^2 - \cos x

  • g2(x)=(x+3)2g_2(x) = (x+3)^2

  • g3(x)=(x3)2+20cosxg_3(x) = (x-3)^2 + 20 \cos x

Devise a two-parameter family of functions that could include all three of those functions. Your answer should define a layer f(x;β1,β2)f(x;\beta_1,\beta_2) that takes a scalar input xx and two parameters β1,β2\beta_1,\beta_2 and returns a number. The layer should be designed so that, for each function gig_i listed above, we can write gi(x)=f(x;β1,β2)g_i(x)=f(x;\beta_1,\beta_2) by choosing β1,β2\beta_1,\beta_2 appropriately.

More fun with function families

Consider the family of linear functions with positive slope: F={g:g(x)=β0+β1x}β1>0F=\{g: g(x)=\beta_0+ \beta_1x\}_{\beta_1> 0}. Devise a one-parameter architecture that includes all these functions, and does not include any functions not in FF. That is, write a layer f(x;β)f(x;\beta) that takes a scalar input and parameter βR2\beta \in \mathbb{R}^2 and returns a number. The layer should be designed so that for any gFg \in F it is possible to find β so that f(x;β)=g(x)f(x;\beta)=g(x).

Gradient descent by hand

Consider the objective L(β)=(β5)2L(\beta) = (\beta-5)^2. Run one iteration of gradient descent with β initialized at 6. Use a learning rate of 3. What is the loss at initialization? What is the new value of β you get? What is the loss with the new value of β? What can we say about the choice of learning rate in this case?

Improvements to vanilla gradient descent

Consider the loss L(β1,β2)=cos(β110)eβ12+500cos(β23)2eβ22L(\beta_1,\beta_2) = \cos(\beta_1-10)e^{\beta_1^2} + 500 \cos(\beta_2-3)^2 e^{\beta_2^2}. This loss can be difficult to minimize using gradient descent. Of the following, which would be the most typical method we might employ to help gradient descent find a local optima more quickly?

  1. Minibatch stochastic gradient descent

  2. Gradient descent with momentum

  3. Using GPUs (allowing us to quickly compute a large number of iterations of SGD with very small step-sizes)

  4. Ridge regularization on β

The chain rule

All autodiff methods rely on the chain rule to help them figure out how to take derivatives of complicated functions. If h(x)=f(g(x))h(x)=f(g(x)), the chain rule states that...

  1. h(x)=f(g(x))g(x)h'(x)=f'(g(x))g'(x)

  2. h(x)=f(g(x))g(x)h'(x)=f(g'(x))g(x)

  3. h(x)=f(g(x))/g(x)h'(x)=f'(g(x))/g'(x)

  4. h(x)=f(g(x))/g(x)h'(x)=f(g'(x))/g(x)