Neural nets - Some Data Science Questions

Dense layer¶

Consider a dense affine layer with 10 inputs and 20 outputs. How many parameters does it have (assume there is a “bias” term)?

Convolutional layer¶

Consider a 2d convolutional layer with filterwidth 2 and filterheight 6, 10 input channels and 20 output channels. We are using the layer for images with width 50 and height 50. How many parameters does the layer have?

Composing layers¶

Consider a network architecture formed by composing three simple families of functions. The final network will involve an affine layer

(f(x;\beta))_k = \beta_{k0}+ \sum_j \beta_{kj} x_j,

(1)

another affine layer

(g(z;\gamma))_k = \gamma_{k0}+ \sum_j \gamma_{kj} z_j,

(2)

and a nonlinearity $(\mathrm{ReLU}(x))_k = \max(x_k,0)$ . Which would be a more typical way to compose these layers into a final neural network architecture?

$h(x;\beta,\gamma) = \mathrm{ReLU}(f(g(x;\gamma);\beta))$
$h(x;\beta,\gamma) = f(\mathrm{ReLU}(g(x;\gamma));\beta)$
$h(x;\beta,\gamma) = f(g(\mathrm{ReLU}(x);\gamma);\beta)$

Convolution by hand¶

Compute the convolution of the input $X=(1,1,2,3,4,5,6,6)$ with the filter $(1,-2,1)$ .

Function families¶

Consider the following three functions.

$g_1(x) = x^2 - \cos x$
$g_2(x) = (x+3)^2$
$g_3(x) = (x-3)^2 + 20 \cos x$

Devise a two-parameter family of functions that could include all three of those functions. Your answer should define a layer $f(x;\beta_1,\beta_2)$ that takes a scalar input $x$ and two parameters $\beta_1,\beta_2$ and returns a number. The layer should be designed so that, for each function $g_i$ listed above, we can write $g_i(x)=f(x;\beta_1,\beta_2)$ by choosing $\beta_1,\beta_2$ appropriately.

More fun with function families¶

Consider the family of linear functions with positive slope: $F=\{g: g(x)=\beta_0+ \beta_1x\}_{\beta_1> 0}$ . Devise a one-parameter architecture that includes all these functions, and does not include any functions not in $F$ . That is, write a layer $f(x;\beta)$ that takes a scalar input and parameter $\beta \in \mathbb{R}^2$ and returns a number. The layer should be designed so that for any $g \in F$ it is possible to find β so that $f(x;\beta)=g(x)$ .

Gradient descent by hand¶

Consider the objective $L(\beta) = (\beta-5)^2$ . Run one iteration of gradient descent with β initialized at 6. Use a learning rate of 3. What is the loss at initialization? What is the new value of β you get? What is the loss with the new value of β? What can we say about the choice of learning rate in this case?

Improvements to vanilla gradient descent¶

Consider the loss $L(\beta_1,\beta_2) = \cos(\beta_1-10)e^{\beta_1^2} + 500 \cos(\beta_2-3)^2 e^{\beta_2^2}$ . This loss can be difficult to minimize using gradient descent. Of the following, which would be the most typical method we might employ to help gradient descent find a local optima more quickly?

Minibatch stochastic gradient descent
Gradient descent with momentum
Using GPUs (allowing us to quickly compute a large number of iterations of SGD with very small step-sizes)
Ridge regularization on β

The chain rule¶

All autodiff methods rely on the chain rule to help them figure out how to take derivatives of complicated functions. If $h(x)=f(g(x))$ , the chain rule states that...

$h'(x)=f'(g(x))g'(x)$
$h'(x)=f(g'(x))g(x)$
$h'(x)=f'(g(x))/g'(x)$
$h'(x)=f(g'(x))/g(x)$

Featurization, standardization, and the magic of splines

Reinforcement learning

Featurization, standardization, and the magic of splines

Misc