Dimensionality reduction - Some Data Science Questions

Fitting an autoencoder by hand¶

Let’s say you have the following dataset, $\mathcal{D}=\{x_1,x_2\}$ , shown below.

   **Sample**   Height   Weight
  ------------ -------- --------
       #1         1        2
       #2         1        4
       #3         1        6
       #4         1        13

Say you use this data and find an optimal linear autoencoder with one latent dimension. You get two functions, $f_e:\ \mathbb{R}^2\rightarrow \mathbb{R}$ , and $f_d:\ \mathbb{R}\rightarrow \mathbb{R}^2$ .

What will be the value of

\sum_{i=1}^6\Vert f_d(f_e(x_i))-x_i\Vert^2?

(1)

(Hint: it may be helpful to try to explicitly find suitable functions $f_e,f_d$ .)

Reconstruction error¶

You have been given an autoencoder:

\begin{aligned} f_e(x_1,x_2,x_3) &= (0.5*x_1+0.5*x_2,x_3) \\ f_d(z_1,z_2) &= (z_1,z_1,z_2) \ \end{aligned}

(3)

What is the squared reconstruction error for the point $x=(2,4,6)$ ?

PCA and scaling¶

You have a dataset $\mathcal{D}$ with $n=100$ samples and $p=30$ features per sample. You run PCA to get a 2d latent representations (summaries) for each point. You store these summaries in a matrix, $U \in \mathbb{R}^{100 \times 2}$ . Then you standardize your dataset with the standard scaler, run PCA on the standardized data, and get a different matrix of summaries $\tilde U \in \mathbb{R}^{100 \times 2}$ . Which is true?

$U=\tilde U$ .
There exists a $2\times 2$ matrix $A$ such that $UA=\tilde U$ .
None of the above.

Variance explained¶

Let $X \in \mathbb{R}^p$ be a random variable. Let $\mu=\mathbb{E}[X]$ and $b=\mathbb{E}[\Vert X - \mu \Vert^2]$ . Say you have fit an autoencoder $f_e,f_d$ . Say the mean squared reconstruction error is given by $\epsilon = \mathbb{E}[\Vert f_d(f_e(X))-X\Vert^2]$ . What is the formula for the proportion of variance explained by the autoencoder?

Unsupervised learning

Partitional clustering

The quality of your estimator

Bootstrap