Neural nets
Dense layer¶
Consider a dense affine layer with 10 inputs and 20 outputs. How many parameters does it have (assume there is a “bias” term)?
Solutions
weight parameters and 20 bias parameters so 220.
Convolutional layer¶
Consider a 2d convolutional layer with filterwidth 2 and filterheight 6, 10 input channels and 20 output channels. We are using the layer for images with width 50 and height 50. How many parameters does the layer have?
Solutions
Image size doesn’t matter one whit.
There are 20 filters, each with parameters. So 2400 filter parameters. Then there is a bias term for each output channel. So 2420 in total.
Composing layers¶
Consider a network architecture formed by composing three simple families of functions. The final network will involve an affine layer
another affine layer
and a nonlinearity . Which would be a more typical way to compose these layers into a final neural network architecture?
Solutions
Middle answer is most typical.
Convolution by hand¶
Compute the convolution of the input with the filter .
Solutions
First window is , so we’re looking at .
Next window is , so we’re looking at .
Next window is , so we’re looking at . Ah there’s a pattern here.
Next window is , so we’re looking at . Yeah these will all be zero.
... will be different. We’ll get .
Final answer is .
Note there is some ambiguity about what we meant by convolution here (because sometimes people think about flipping the filter before applying it at each window). Here we avoided that ambiguity by making the filter symmetric.
Function families¶
Consider the following three functions.
Devise a two-parameter family of functions that could include all three of those functions. Your answer should define a layer that takes a scalar input and two parameters and returns a number. The layer should be designed so that, for each function listed above, we can write by choosing appropriately.
Solutions
would suffice.
More fun with function families¶
Consider the family of linear functions with positive slope: . Devise a one-parameter architecture that includes all these functions, and does not include any functions not in . That is, write a layer that takes a scalar input and parameter and returns a number. The layer should be designed so that for any it is possible to find β so that .
Solutions
will suffice.
Gradient descent by hand¶
Consider the objective . Run one iteration of gradient descent with β initialized at 6. Use a learning rate of 3. What is the loss at initialization? What is the new value of β you get? What is the loss with the new value of β? What can we say about the choice of learning rate in this case?
Solutions
At initialization loss is
, so the gradient at initialization is . We then take a descent step, , which in this case comes to .
Now loss is . Not good! Things got worse! Learning rate was way too big.
Improvements to vanilla gradient descent¶
Consider the loss . This loss can be difficult to minimize using gradient descent. Of the following, which would be the most typical method we might employ to help gradient descent find a local optima more quickly?
Minibatch stochastic gradient descent
Gradient descent with momentum
Using GPUs (allowing us to quickly compute a large number of iterations of SGD with very small step-sizes)
Ridge regularization on β
Solutions
Second option. Momentum can help with the unequal magnitude of curvature (function is much curvier in parameter).
Minibatch stochastic gradient descent makes no sense because we can very rapidly compute the gradient of .
GPUs won’t help here because computing the gradient at each step is pretty fast. GPUs would help if we had to compute many different quantities all at the same time, but they won’t help us compute a problem like this where the result at each stage depends on the result from the previous stage.
Ridge regularization could conceivably help, maybe? But I can’t actually think how right now. Would not be typical. Ridge regularization is usually employed in the context of improving the final estimate, not to make gradient descent converge faster.
The chain rule¶
All autodiff methods rely on the chain rule to help them figure out how to take derivatives of complicated functions. If , the chain rule states that...
Solutions
First answer is correct.