<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Sida I. Wang</title>
    <description></description>
    <link>http://www.sidaw.xyz/</link>
    <atom:link href="http://www.sidaw.xyz/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Sat, 10 Jan 2026 18:31:42 +0000</pubDate>
    <lastBuildDate>Sat, 10 Jan 2026 18:31:42 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Why measure all the noises</title>
        <description>&lt;p&gt;For my paper &lt;a href=&quot;https://arxiv.org/abs/2512.21326&quot;&gt;Measuring all the noises of LLM evals&lt;/a&gt;,
to decompose and actually measure all noise types seemed to me the most useful message,
but I want to share the story behind it here and pose some remaining questions.
It started as a small analysis for &lt;a href=&quot;https://crux-eval.github.io/&quot;&gt;CRUXEval&lt;/a&gt; as power analysis and quality assurance,
but noise has been an obsession for me since my first paper &lt;a href=&quot;https://aclanthology.org/P12-2018.pdf&quot;&gt;Baselines and Bigrams&lt;/a&gt; in 2012, where in Table 1 we calculated the noise upper bound
based on which eval (test set) was used. At the time I did not understand paired comparisons well but had an appreciation for the bootstrap,
probably common from a ML background.&lt;/p&gt;

&lt;p&gt;There is no doubt that guessable classification evals such as MNIST/MMLU need to be large enough and they are (~10k questions), since any correct answer could just be a random guess (~10 choices).
However, this is less clear with hard generative evals, initially represented by &lt;a href=&quot;https://github.com/openai/human-eval/&quot;&gt;HumanEval&lt;/a&gt;, which only has 164 problems.
Real human evals such as IMO, IOI, Putnam are similarly small. 
For our benchmark work &lt;a href=&quot;https://crux-eval.github.io/&quot;&gt;CRUXEval&lt;/a&gt;, we did a power analysis that showed 1k questions are needed to reliably distinguish the scaling law effect of doubling the model size, but then we thought it must be enough to be 5x larger than the leading generative eval of the time HumanEval, so we settled at 800. Around the same time, I also attempted to increase the signal by modelling the difficulty of the questions, similar to the &lt;a href=&quot;https://en.wikipedia.org/wiki/Item_response_theory&quot;&gt;Item Response Theory&lt;/a&gt;, and I got all failures, which is also reported by &lt;a href=&quot;https://arxiv.org/abs/2406.10229&quot;&gt;Madaan et al, 2024&lt;/a&gt;. Projects such as &lt;a href=&quot;https://github.com/evalplus/evalplus&quot;&gt;EvalPlus&lt;/a&gt; improved the quality of the tests and audited the questions for errors. While this is intuitively very appealing to me, I saw no improvement in the noise levels nor the &lt;a href=&quot;https://github.com/crux-eval/eval-arena/blob/main/doc/eval-arena-readme.md#signal-to-noise-ratio&quot;&gt;signal-to-noise&lt;/a&gt; ratio. &lt;a href=&quot;https://openai.com/index/introducing-swe-bench-verified/&quot;&gt;SWEBench-Verified&lt;/a&gt; is another project where some human labour was used to improve the subjective data quality while reducing the data size. Another piece of evidence is that MBPP, a subjectively poor quality eval where a random sample likely contains errors, still gave measurably higher signal to noise than HumanEval, a subjectively high quality eval where most of supposed &lt;a href=&quot;https://github.com/openai/human-eval/pull/23&quot;&gt;errors&lt;/a&gt; reflects realistic cases that you’d want to evaluate on rather than errors that give misleading signals. As an eval creator, I wondered if we should just focus on getting more questions vs. getting better quality questions. The noise levels measurements suggest we should just have more questions, but intuitively it seems hard to trust a larger eval that is full of errors. Despite these more impressive generative evals, we rightly trusted the guessable MMLU more for model development, which has a larger size and &lt;a href=&quot;https://all-the-noises.github.io/signal_noise.html&quot;&gt;high signal-to-noise&lt;/a&gt;.
Some practitioners do not trust evals at all, and would rather trust their own interactive testing.&lt;/p&gt;

&lt;p&gt;As LLM evals start to capture even more interesting and not-guessable problems like LiveCodeBench and SWEBench, and as I contributed to a few more generative evals, I was confused by how small and noisy our evals have become in the pursuit of difficulty and agents – and how to reconcile with the fact that answering one hard, not-guessable question could remove all doubts. While we are rightly impressed by these hard evals, they are quite noisy, so we partly cope with the following methods: “look at the variance of the training curve”, “let’s average over multiple seeds”, and “let’s average over multiple predictions”, which we know to be nonsense in theory because the data noise is completely ignored by these methods.
With the unpaired analysis, since question difficulties are roughly bimodal beta distributed, the data noise is sure to be significant.
Even when paired, I also got some initial empirical evidence from CRUXEval, where for the results sampled at T=0.2, the paired data noise is a bit higher than the prediction noise, but this turned out to be more of an exception.
With CoT, reasoning models, and sampling at more natural temperatures, prediction noise is typically higher.
At this time, practitioners often drew multiple predictions per question (HumanEval, CRUXEval) to measure pass@k, but perhaps they are also sensing the prediction noise as such metrics are developed.
With multiple predictions per question on CRUXEval, I tried to measure the data noise vs. prediction noise by resampling different objects with the bootstrap, which is correct but it was not clear (to me) how each variance relates to the total.
The traditional hypothesis testing framework also provides some answers, which is consistent with bootstrap but still unclear (to me) on how to treat the prediction noise.
Then I read &lt;a href=&quot;https://arxiv.org/abs/2411.00640&quot;&gt;Miller, 2024&lt;/a&gt; more carefully, and their variance decomposition really clarified that these are meaningful components that can be measured. So I reanalyzed all the data using this method instead of bootstrap and sign-test. While I still appreciate the bootstrap, when the answer is var(A) or p(1-p) or var(mean(A, axis=1)), “let’s use the bootstrap” is pretty underwhelming. More importantly, we still need to be clear what exactly are we resampling: the data, the prediction, both, with averaging vs. not, paired vs. unpaired. So the typical “we computed the 95% confidence intervals with the bootstrap” can be meaningless.
&lt;!-- I adapted  multiple predictions into estimators like `var(mean(A, axis=1))` and `var(mean(A, axis=1)-mean(B, axis=1))`, worked out the issues. --&gt;
&lt;!-- With a careful reading, Miller recommended averaging (resample) but their main example is for the unpaired case, which could not reduce the data noise too much, and so I dismissed the recommendation on my initial reading. --&gt;
&lt;!-- Their power analysis appendix did have a clear indication of the paired plus averaging. Still, all the numbers and examples are not based on real data.   --&gt;&lt;/p&gt;

&lt;p&gt;So I took the opportunity to carefully measure all the noise types on all the evals data I can get, with clearly separated contributions of prediction noise vs. data noise, paired vs unpaired, averaged vs not. From these measurements, a lot of my confusions can be explained by that prediction noise usually dominates the paired data noise:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Methods that attempt to model the questions like the &lt;a href=&quot;https://en.wikipedia.org/wiki/Item_response_theory&quot;&gt;Item Response Theory&lt;/a&gt; won’t work if most of the noise comes from model predictions instead of the questions, so we must first reduce the prediction noise before trying to model question difficulty.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Answering one hard question can only be very significant if it’s not guessable by the baseline model. While hard questions are not guessable by picking at random, they do seem to be guessable by weaker LLMs. So far we do not see these 0 to 1 improvements, most questions that can be answered by stronger models are also guessable by weaker models with less probability. We can only observe this using the paired comparison with multiple prediction samples per question.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Looking at the training curve, sampling more predictions and using multiple seeds are actually sound methods when the prediction noise dominates. So it’s an example of not dismissing practical wisdom on the basis of theory.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The bonus is the main result of the paper: because LLMs are still very similar to each other, they also generate predictable total noise levels, so these measurements are actually more general than the specific pairs measured, and can be used to assess noise in context. Furthermore, we can actually get a lot more signal from current evals if we reduce the prediction noise by averaging multiple samples, detecting effect sizes 2-6x times smaller, which is pretty nice. The paper goes into more details on these main findings.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few related big-picture questions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;I see that LLMs have high prediction noise, but compared to what? Do humans have similarly high prediction noise? Groups of humans also seem to have pretty high prediction noise looking at the &lt;a href=&quot;https://www.imo-official.org/year_individual_r.aspx?year=2025&quot;&gt;IMO&lt;/a&gt; and IOI question level results among different humans. While I don’t know how to measure this for individual humans, I think individual humans have less prediction noise when they have a grounded understanding of the material and enough time, where they tend to know if they really answered a question.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;How to get more bits of feedback from the trajectory of solving a hard problem? This will be important for both training and evals. 99% of my learning do not come with a label, and when the label/reward comes it’s usually noisy and unreliable anyways.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Can we define guessable and relatedly pass@inf more precisely? Everything is guessable with a random string, and pass@inf is always 1. So somehow there needs to be a tier of generations that’s better than random, maybe like in the typical set of the model.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few related technical questions, roughly in increasing vagueness:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Can we still reduce the prediction noise by lowering the temp if we also have some noisy model changes like a small number of training steps? My feeling is we cannot reduce the prediction noise that way, just like how we cannot reduce the prediction noise robustly via fixing the random seed. This can easily be tested but I have not.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;So after reducing prediction noise, is it worth revisiting IRT and the like? There may still be other obstacles or maybe it’s a big success.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Is there something obvious causing the bimodal Beta distribution? if not, does it say something about what LLMs are?&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Recommending averaging still causes some discomfort to me. The paper has a caveat section elaborating on that where I tried but failed to come up with something that really defeats averaging. With these non-fatal flaws as discussed, it seems better than having large prediction noise. But maybe there is something worse.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Comments and discussion about this work is welcome. Email or &lt;a href=&quot;https://github.com/crux-eval/eval-arena&quot;&gt;github issues&lt;/a&gt; are both good formats.&lt;/p&gt;

</description>
        <pubDate>Fri, 26 Dec 2025 07:00:00 +0000</pubDate>
        <link>http://www.sidaw.xyz/research/2025/12/26/allthenoise.html</link>
        <guid isPermaLink="true">http://www.sidaw.xyz/research/2025/12/26/allthenoise.html</guid>
        
        
        <category>research</category>
        
      </item>
    
      <item>
        <title>Interactive language learning</title>
        <description>&lt;p&gt;&lt;em&gt;Nadav Lidor, Sida Wang&lt;/em&gt; (cross-posted from &lt;a href=&quot;http://nlp.stanford.edu/blog/interactive-language-learning/&quot;&gt;StanfordNLP&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Today, natural language interfaces (NLIs) on computers or phones are often trained once and deployed, and users must just live with their limitations. Allowing users to demonstrate or teach the computer appears to be a central component to enable more natural and usable NLIs. Examining language acquisition research, there is considerable evidence suggesting that human children require interactions to learn language, as opposed to passively absorbing language, such as when watching TV (&lt;a href=&quot;/res/2017-1-2-interactivelanguage/kuhl_nature_neuroscience_reviews_2004.pdf&quot;&gt;Kuhl et al., 2003&lt;/a&gt;, &lt;a href=&quot;https://www.cambridge.org/core/journals/applied-psycholinguistics/article/language-learning-with-restricted-input-case-studies-of-two-hearing-children-of-deaf-parents/4F5BF799996DCD5977A94BC5F1233578&quot;&gt;Sachs et al., 1981&lt;/a&gt;). Research suggests that when learning a language, rather than consciously analyzing increasingly complex linguistic structures (e.g. sentence forms, word conjugations), humans advance their linguistic ability through meaningful interactions (&lt;a href=&quot;/res/2017-1-2-interactivelanguage/principles_and_practice.pdf&quot;&gt;Kreshen, 1983&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In contrast, the standard machine learning dataset setting has no interaction. The feedback stays the same and does not depend on the state of the system or the actions taken. We think that interactivity is important, and that an interactive language learning setting will enable adaptive and customizable systems, especially for resource-poor languages and new domains where starting from close to scratch is unavoidable.&lt;/p&gt;

&lt;p&gt;We describe two attempts towards interactive language learning — an agent for manipulating blocks, and a calendar scheduler.&lt;/p&gt;

&lt;h2 id=&quot;language-games-a-blocks-world-domain-to-learn-language-interactively&quot;&gt;Language Games: A Blocks-World Domain to Learn Language Interactively&lt;/h2&gt;

&lt;p&gt;Inspired by the human language acquisition process, we investigated a simple setting where language learning starts from scratch. We explored the idea of language games, where the computer and the human user need to collaboratively accomplish a goal even though they do not initially speak a common language. Specifically, in our pilot we created a game called SHRDLURN, in homage to the seminal work of Terry Winograd. As shown in Figure 1a, the objective is to transform a start state into a goal state, but the only action the human can take is entering an utterance. The computer parses the utterance and produces a ranked list of possible interpretations according to its current model. The human scrolls through the list and chooses the intended one, simultaneously advancing the state of the blocks and providing feedback to the computer. Both the human and the computer wish to reach the goal state (only known to the human) with as little scrolling as possible. For the computer to be successful, it has to learn the human’s language quickly over the course of the game, so that the human can accomplish the goal more efficiently. Conversely, the human can also speed up progress by accommodating to the computer, by at least partially understanding what it can and cannot currently do.&lt;/p&gt;

&lt;p&gt;We model the computer as a semantic parser (&lt;a href=&quot;/res/2017-1-2-interactivelanguage/uai05.pdf&quot;&gt;Zettlemoyer and Collins, 2005&lt;/a&gt;; &lt;a href=&quot;/res/2017-1-2-interactivelanguage/liang-potts-semantics.pdf&quot;&gt;Liang and Potts, 2015&lt;/a&gt;), which maps natural language utterances (e.g., ‘remove red’) into logical forms (e.g., remove(with(red))). The semantic parser has no seed lexicon and no annotated logical forms, so it just generates many candidate logical forms. From the human’s feedback, it learn by adjusting the parameters corresponding to simple and generic lexical features. It is crucial that the computer learns quickly, or users are frustrated and the system is less usable. In addition to feature engineering and tuning online learning algorithms, we achieved higher learning speed by incorporating pragmatics.&lt;/p&gt;

&lt;p&gt;However, what is special here is the real-time nature of learning, in which the human also learns and adapts to the computer, thus making it easier to achieve good task performance. While the human can teach the computer any language - in our pilot, Mechanical Turk users tried English, Arabic, Polish, and a custom programming language - a good human player will choose to use utterances so that the computer is more likely to learn quickly.&lt;/p&gt;

&lt;p&gt;You can find more information in the &lt;a href=&quot;https://arxiv.org/abs/1606.02447&quot;&gt;SHDLURN paper&lt;/a&gt;, a &lt;a href=&quot;http://shrdlurn.sidaw.xyz&quot;&gt;demo&lt;/a&gt;, code, data, and experiments on &lt;a href=&quot;https://worksheets.codalab.org/worksheets/0x9fe4d080bac944e9a6bd58478cb05e5e&quot;&gt;CodaLab&lt;/a&gt; and the &lt;a href=&quot;https://github.com/sidaw/shrdlurn/tree/acl16-demo&quot;&gt;client side code&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/res/2017-1-2-interactivelanguage/SHRDLURN%20photo.png&quot; alt=&quot;alt text&quot; title=&quot;SHRDLURN Interface&quot; /&gt;  &lt;img src=&quot;/res/2017-1-2-interactivelanguage/SCHEDLURN%20photo.png&quot; alt=&quot;alt text&quot; title=&quot;SCHEDLURN Interface&quot; /&gt;&lt;/p&gt;

&lt;p&gt;1a SCHRDLURN (top) and 1b SCHEDLURN (bottom)&lt;/p&gt;

&lt;p&gt;Figure 1: 1a: A pilot for learning language through user interaction. The system attempts an action in response to a user instruction and the user indicates whether it has chosen correctly. This feedback allows the system to learn word meaning and grammar. 1b: the interface for interactive learning in the calendars domain.&lt;/p&gt;

&lt;h2 id=&quot;a-calendar-employing-community-learning-with-demonstration&quot;&gt;A Calendar Employing Community Learning with Demonstration&lt;/h2&gt;

&lt;p&gt;Many challenges remain if we want to advance to NLIs for broader domains. First, in order to scale to more open, complex action spaces, we need richer feedback signals that are both natural for humans and useful for the computer. Second, to allow for quick, generalizable data collection, we seek to support collective, rather than individual, languages, in a community-based learning framework. We now outline our first attempt at addressing these challenges and scaling the framework to a calendar setting. You can find a &lt;a href=&quot;https://youtu.be/PfW4_3tCiw0&quot;&gt;short video overview&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Event scheduling is a common yet unsolved task: while several available calendar programs allow limited natural language input, in our experience they all fail as soon as they are given something slightly complicated, such as ‘Move all the tuesday afternoon appointments back an hour’. We think interactive learning can give us a better NLI for calendars, which has more real world impact than blocks world. Furthermore, aiming to expand our learning methodology from definition to demonstration, we chose this domain as most users are already familiar with the common calendar GUI with an intuition for its manual manipulation. Additionally, as calendar NLIs are already deployed, particularly on mobile, we hoped users will naturally be inclined to use natural language style phrasing rather than a more technical language as we saw in the blocks world domain. Lastly, a calendar is a considerably more complex domain, with a wider set of primitives and possible actions, and will allow us to test our framework with a larger action space.&lt;/p&gt;

&lt;h3 id=&quot;learning-from-demonstration-and-community&quot;&gt;Learning from Demonstration and Community&lt;/h3&gt;

&lt;p&gt;In our pilot, user feedback was provided by scrolling and selecting the proper action for a given utterance - a process both unnatural and un-scalable for large action spaces. Feedback signals in human communication include reformulation, paraphrases, repair sequences etc. (&lt;a href=&quot;/res/2017-1-2-interactivelanguage/Clark.UsingLanguage.Ch12.96.pdf&quot;&gt;Clark, 1996&lt;/a&gt;). We expanded our system to receive feedback through demonstration, as it is 1) natural for people, especially using a calendar, allowing for easy data collection, and 2) informative for language learning and can be leveraged by current machine learning methods. In practice, if the correct interpretation is not among the top choices, the system falls back to a GUI and the user uses the GUI to show the system what they meant. Algorithms for learning from denotations are well-suited for this, where the interactivity can potentially help in the search for the latent logical forms.&lt;/p&gt;

&lt;p&gt;While learning and adapting to each user provided a clean setting for the pilot study, we would not expect good coverage if each person has to teach the computer everything from scratch. Despite individual variations, there should be much in common across users which allows the computer to learn faster and generalize better. For our calendar, we abandoned the individualized user-specific language model for a collective community model where a model consists of a set of grammar rules and parameters collected across all users and interactions. Each user contributes to the expressiveness and complexity of the language where jargons and conventions are invented, modified, or rejected in a distributed way.&lt;/p&gt;

&lt;h3 id=&quot;preliminary-results&quot;&gt;Preliminary Results&lt;/h3&gt;

&lt;p&gt;Using Amazon Mechanical Turk (AMT), we paid 20 workers 2 dollars each to play with our calendar. In total, out of 356 total utterances, in 196 cases the worker selected a state out of the suggested ranked list as the desired calendar state, and 68 times the worker used the calendar GUI to manually modify and submit feedback by demonstration.&lt;/p&gt;

&lt;p&gt;A small subset of commands collected is displayed in figure 2. While a large percentage involved relatively simple commands (Basic), AMT workers did challenge the system for complex tasks using non-trivial phrasing (Advanced). As we hoped, users were highly inclined to use natural language, and did not develop a technical, artificial language. A small number of commands were questionable in nature, with unusual calendar commands (see Questionable).&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Basic&lt;/th&gt;
      &lt;th&gt;Advanced&lt;/th&gt;
      &lt;th&gt;Questionable&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;move “ideas dinner tressider” to Saturday&lt;/td&gt;
      &lt;td&gt;change “family room” to “game night” and add location “family room&lt;/td&gt;
      &lt;td&gt;duplicate all calendar entries&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;cancel “team lunch” Friday between 12 pm and 1 pm&lt;/td&gt;
      &lt;td&gt;Duplicate the “family dinner” event to 9pm today&lt;/td&gt;
      &lt;td&gt;remove all appointments for the entire week&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Change “golf lesson” to 5pm&lt;/td&gt;
      &lt;td&gt;remove all appointments on monday&lt;/td&gt;
      &lt;td&gt;Remove all entries&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Schedule a “meeting with Bob” Tuesday at 10:30am”&lt;/td&gt;
      &lt;td&gt;change all “team lunch” to after 2 pm&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Figure 2. A categorized sample of commands collected in our experiment&lt;/p&gt;

&lt;p&gt;To assess learning performance, we measure the system’s ability to correctly predict the correct calendar action given a natural language command. We see that the top-ranked action is correct about 60% of the time, and the correct meaning is in the top three system-ranked actions about 80% of the time.&lt;/p&gt;

&lt;h2 id=&quot;discussion&quot;&gt;Discussion&lt;/h2&gt;

&lt;p&gt;The key challenge is figuring out which feedback signals are both usable for the computer and natural for humans. We explored providing alternatives and learning from demonstration. We are also trying definitions and rephrasing. For example, when a user rephrases “my meetings tomorrow morning” as “my meetings tomorrow after 7 am and before noon”, we can infer the meaning of “morning”.&lt;/p&gt;

&lt;p&gt;Looking forward, we believe NLIs must learn through interaction with users, and improve over time. NLIs have the potential to replace GUIs and scripting for many tasks, and doing so can bridge the great digital divide of skills and enable all of us to better make use of computers.&lt;/p&gt;

</description>
        <pubDate>Sun, 01 Jan 2017 07:00:00 +0000</pubDate>
        <link>http://www.sidaw.xyz/research/2017/01/01/interactivelanguage.html</link>
        <guid isPermaLink="true">http://www.sidaw.xyz/research/2017/01/01/interactivelanguage.html</guid>
        
        
        <category>research</category>
        
      </item>
    
      <item>
        <title>questions about some of my papers (old)</title>
        <description>&lt;p&gt;This page contains my answers to questions about my papers, general comments that maybe inappropriate to include in the actual paper and references to related and followup works.&lt;/p&gt;

&lt;h2 id=&quot;fast-dropout&quot;&gt;Fast dropout&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://winsty.net/talks/dropout.pptx&quot;&gt;Some slides by Naiyan Wang&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://hips.seas.harvard.edu/blog/2013/08/01/icml-highlight-fast-dropout-training/&quot;&gt;post-ICML discussions&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;errata&quot;&gt;Errata&lt;/h3&gt;

&lt;p&gt;The equation before (7): \(s\) should be changed to \( s^2 \).&lt;/p&gt;

&lt;h3 id=&quot;questions&quot;&gt;Questions&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Q: Did you have any experiments with the regularized LR? I don’t see any&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I did not mean plain as in unregularized. The provided code does a scan over all L2 regularization parameter to show that you cannot choose any L2 strength to beat this Gaussian dropout, at least on some datasets…&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What is MC Dropout, and Real Dropout?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sorry about the inconsistency, MC means Monte Carlo, and Real means using MC to do real dropout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Where does the approximation formula (7) come from?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I got this trick from this paper:&lt;/p&gt;

&lt;p&gt;MacKay, David J.C. The evidence framework applied to classification networks&lt;/p&gt;

&lt;p&gt;Firstly we stress that this trick is non-essential to the main point of Fast Dropout paper, since accurately computing the value of any smooth function in 1D or 2D is probably quite easy by tabulating and interpolating.&lt;/p&gt;

&lt;p&gt;However, the trick is quite interesting and it does give us some insights on the effect of dropout. So here is how it goes:
Let
\( \Phi(x)=\int_{-\infty}^{x} f(x)\ dx\) be the Gaussian cumulative distribution with  \( f(x) = \frac{1}{\sqrt{2 pi}}\exp(-x^2/2) \) being Gaussian density. The main point is that we have the following integral (Eq. 1)&lt;/p&gt;

\[\int_{-\infty}^{\infty} \Phi(x) f(\frac{x-\mu}{\sigma}) = \Phi(\frac{\mu}{\sqrt{1+\sigma^2}})\]

&lt;p&gt;The substitution rule (chain rule, since \(\Phi’(x) = f(x)\)) suggests that the above can be evaluated analytically. So we substitute \(z=\frac{x-\mu}{\sigma}\), and we get
\(I(\mu, \sigma)=\int_{-\infty}^{\infty} \Phi(\sigma z + \mu) f(z) \sigma dz\), so if we differentiate wrp to \(\mu\) we get:
\(\frac{\partial I(\mu, \sigma)}{\partial \mu}=\int_{-\infty}^{\infty} f(\sigma z + \mu) f(z) \sigma dz\)
Since the product of two Gaussians is a Gaussian (in \(z\)), the above integral is just the normalization constant of the Gaussian density in \(z\), and a Gaussian density function in \(\mu\) (a few lines of algebra omitted, and may be good exercise). Lastly, we can integrate \(\mu\) back to get another Gaussian cumulative distribution in (Eq. 1).&lt;/p&gt;

&lt;p&gt;So far everything in exact, and now we make the approximation that \(\sigma(x) = \Phi(\sqrt{\pi/8} x)\) to get the desired approximation. If one were to use probit regression instead of logistic regression, then this whole chain is exact.
Page 12 of my slides plots the errors.
However, the inaccuracy from making the Gaussian assumption is a lot larger than this approximation here so this is not at all the weakest link.&lt;/p&gt;

&lt;h2 id=&quot;baselines-and-bigrams&quot;&gt;Baselines and Bigrams&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: he data structure seems weird, why is it not just a sparse design matrix?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;all the presented algorithm indeed just use a sparse design matrix as input. That is, these bag of words models do not make use of the order in which words appear. But the .mat data being loaded in does contain order information.&lt;/p&gt;
</description>
        <pubDate>Mon, 25 Mar 2013 07:17:46 +0000</pubDate>
        <link>http://www.sidaw.xyz/discussion/fastdropout/errata/2013/03/25/discussion.html</link>
        <guid isPermaLink="true">http://www.sidaw.xyz/discussion/fastdropout/errata/2013/03/25/discussion.html</guid>
        
        
        <category>discussion</category>
        
        <category>fastdropout</category>
        
        <category>errata</category>
        
      </item>
    
  </channel>
</rss>
