This page contains my answers to questions about my papers, general comments that maybe inappropriate to include in the actual paper and references to related and followup works.

Fast dropout

Some slides by Naiyan Wang

post-ICML discussions


The equation before (7): \(s\) should be changed to \( s^2 \).


Q: Did you have any experiments with the regularized LR? I don’t see any

I did not mean plain as in unregularized. The provided code does a scan over all L2 regularization parameter to show that you cannot choose any L2 strength to beat this Gaussian dropout, at least on some datasets…

Q: What is MC Dropout, and Real Dropout?

Sorry about the inconsistency, MC means Monte Carlo, and Real means using MC to do real dropout.

Q: Where does the approximation formula (7) come from?

I got this trick from this paper:

MacKay, David J.C. The evidence framework applied to classification networks

Firstly we stress that this trick is non-essential to the main point of Fast Dropout paper, since accurately computing the value of any smooth function in 1D or 2D is probably quite easy by tabulating and interpolating.

However, the trick is quite interesting and it does give us some insights on the effect of dropout. So here is how it goes: Let \( \Phi(x)=\int_{-\infty}^{x} f(x)\ dx\) be the Gaussian cumulative distribution with \( f(x) = \frac{1}{\sqrt{2 pi}}\exp(-x^2/2) \) being Gaussian density. The main point is that we have the following integral (Eq. 1)

The substitution rule (chain rule, since \(\Phi’(x) = f(x)\)) suggests that the above can be evaluated analytically. So we substitute \(z=\frac{x-\mu}{\sigma}\), and we get \(I(\mu, \sigma)=\int_{-\infty}^{\infty} \Phi(\sigma z + \mu) f(z) \sigma dz\), so if we differentiate wrp to \(\mu\) we get: Since the product of two Gaussians is a Gaussian (in \(z\)), the above integral is just the normalization constant of the Gaussian density in \(z\), and a Gaussian density function in \(\mu\) (a few lines of algebra omitted, and may be good exercise). Lastly, we can integrate \(\mu\) back to get another Gaussian cumulative distribution in (Eq. 1).

So far everything in exact, and now we make the approximation that \(\sigma(x) = \Phi(\sqrt{\pi/8} x)\) to get the desired approximation. If one were to use probit regression instead of logistic regression, then this whole chain is exact. Page 12 of my slides plots the errors. However, the inaccuracy from making the Gaussian assumption is a lot larger than this approximation here so this is not at all the weakest link.

Baselines and Bigrams

Q: he data structure seems weird, why is it not just a sparse design matrix?

all the presented algorithm indeed just use a sparse design matrix as input. That is, these bag of words models do not make use of the order in which words appear. But the .mat data being loaded in does contain order information.