Naftali Harris: Statistician, Hacker, and Climber
https://www.naftaliharris.com
Blog on Statistics, hacking, and climbingMap Transformation
https://www.naftaliharris.com/blog/map-transformation/
https://www.naftaliharris.com/blog/map-transformation/Note: I was describing my map transformation project to Sasha Trubetskoy recently, who is even more into maps than I am. This is a project I completed in March 2011 with Shir Yehoshua for a CS class, where we transformed a map of Washington DC so that the Euclidean distances on the transformed map became more proportional to travel times. I've described this project many times over the years to various friends, and figured that it was long past time I wrote it up. What follows is a write-up that's partly based on contemporaneous emails I wrote about this project, and partly based on my now-hazy recollection:Fri, 23 Mar 2018 00:00:00 PSTStyle for Python Multiline If-Statements
https://www.naftaliharris.com/blog/python-multiline-if-statements/
https://www.naftaliharris.com/blog/python-multiline-if-statements/PEP 8 gives a number of acceptable ways of handling multiple line if-statements in Python. But to be honest, most of the styles I've seen--even those that conform with the PEP--seem ugly and hard to read for me. So here's my favorite style:Mon, 10 Apr 2017 00:00:00 PSTHypothesis Tests for Machine Learning
https://www.naftaliharris.com/blog/machine-learning-hypothesis-tests/
https://www.naftaliharris.com/blog/machine-learning-hypothesis-tests/Statisticians have spent a lot of time attempting to do complicated inference for various machine learning models. In fact, there's an enormously simple and naive way to do this in complete generality: Simply use a paired T-test to compare the performance of two models on your test set!Mon, 27 Mar 2017 00:00:00 PSTThe Communication Loss Function
https://www.naftaliharris.com/blog/communication-loss-function/
https://www.naftaliharris.com/blog/communication-loss-function/My first ever task writing software professionally was to make some small change to the Kaggle server. I spent a day or so following painstakingly moving down the call stack from the API endpoint to figure out which file I needed to make my changes in. I proudly showed my mentor the five line change I'd made. His response: "Oh, that file is actually automatically generated. Your changes are going to get wiped out during the build."Mon, 20 Mar 2017 00:00:00 PSTLogistic Regression Isn't Interpretable
https://www.naftaliharris.com/blog/logistic-regression-uninterpretable/
https://www.naftaliharris.com/blog/logistic-regression-uninterpretable/Suppose two events A and B are independent, with the odds of A occurring being 4, and the odds of B being 5. What are the odds of both A and B occurring? I'll give you a hint: it's not 20.Mon, 13 Mar 2017 00:00:00 PSTNontrivial: Exception Handling in Python
https://www.naftaliharris.com/blog/nontrivial-exception-handling-python/
https://www.naftaliharris.com/blog/nontrivial-exception-handling-python/Most code can fail in multiple places and for multiple reasons. Handling these failures seems pretty trivial, something you'd cover in the basic tutorial to your programming language. Actually, I think that doing this well can be surprisingly subtle, and ultimately dances around the control flow constructs of your programming language of choice. Let me illustrate with a simple example in Python, my second-favorite programming language:Mon, 06 Mar 2017 00:00:00 PSTImplementing "nonlocal" in Tauthon: Part I
https://www.naftaliharris.com/blog/nonlocal/
https://www.naftaliharris.com/blog/nonlocal/Tauthon is a fork of Python 2.7 with syntax, builtins, and libraries backported from Python 3. It aspires to be able to run all valid Python 2 and 3 code. In this article, I begin discussing how I was able to backport the "nonlocal" keyword from Python 3. I hope this post is useful for people who are interested in hacking on the CPython interpreter or CPython forks: it sounds hard, and it can be a bit tedious, but it's actually a lot easier than you'd think.Mon, 27 Feb 2017 00:00:00 PSTDay-to-Day Operations of Palo Alto
https://www.naftaliharris.com/blog/palo-alto-finances/
https://www.naftaliharris.com/blog/palo-alto-finances/Palo Alto runs a pretty open city government, with a number of interesting documents available for download on their website. Of particular interest are their annual budgets and annual financial reports, both of which are for the fiscal year ending June 30. These documents are a few hundred pages long each and full of accounting tables, but with a bit of persistence and the help of some friendly city employees I think I was able to figure out much of what is going on. In this post I give an overview of what the city of Palo Alto does on a day-to-day basis--(i.e., excluding long-term capital projects).Mon, 20 Feb 2017 00:00:00 PSTContinuous Time Lending
https://www.naftaliharris.com/blog/continuous-time-lending/
https://www.naftaliharris.com/blog/continuous-time-lending/Assume a borrower takes out an installment loan of size $1$ and makes continuous-time payments on it. The installment loan starts at time $0$, ends at time $T$, and has an interest rate of $r$, compounding continually. We'll let $P(t)$ be the principal owed by the borrower at time $t$, and $I(t)$ be the interest they've paid on the loan so far. Unlike in discrete time lending, we don't need to keep track of the amount of unpaid interest owed by the borrower; it's always zero since the borrower makes continuous payments. To check your comprehension, $P(0) = 1$, $P(T) = 0$, $I(0) = 0$, and $I(T)$ is the total amount of interest the borrower will pay on their loan (assuming no defaults, fees, or prepayment).Sun, 15 Jan 2017 00:00:00 PSTAn Easy Chess Puzzle
https://www.naftaliharris.com/blog/chess-puzzle/
https://www.naftaliharris.com/blog/chess-puzzle/I was looking through Markovian, my old chess engine, recently, and came across the first game it won against another chess engine. Stepping through the game, it seems that both engines actually played pretty poorly. Even so, I was proud that Markovian found a long forced mate to end the game. Here's that mate in puzzle form; play black, and find the fastest mate. (If there's more than one, any will be accepted).Thu, 29 Dec 2016 00:00:00 PSTWhy I'm Making Tauthon
https://www.naftaliharris.com/blog/why-making-python-2.8/
https://www.naftaliharris.com/blog/why-making-python-2.8/For the past two months I've been spending half my time on Tauthon. Tauthon is a backwards-compatible Python interpreter that runs Python 2 code and C-extensions exactly as-is, while also allowing Python 2 programmers to use the most exciting new language features from Python 3. These new backported language features include async/await syntax, function annotations and typing support, keyword-only arguments, and new metaclass syntax, among many others. I use Tauthon as my system python now, and haven't had any problems running my old 2.7 code or using packages like IPython, pip, numpy, pandas, requests, and flask. I've been enjoying the new language features--I especially like underscores in numeric literals!Wed, 30 Nov 2016 00:00:00 PSTDesperation Motivated Creativity
https://www.naftaliharris.com/blog/desperation-motivated-creativity/
https://www.naftaliharris.com/blog/desperation-motivated-creativity/I am not the strongest climber. Some of the people I've climbed with are so strong that they can do a one-arm pull-up, and then--while locking off with one arm--sing the "Head, Shoulders, Knees and Toes" song and use the other arm to do the corresponding dance. This is not in my future anytime soon.Mon, 25 Jul 2016 00:00:00 PSTOHMS Lessons Learned
https://www.naftaliharris.com/blog/ohms-lessons-learned/
https://www.naftaliharris.com/blog/ohms-lessons-learned/Note: I found the following post as an almost complete draft as I was reading some of my unpublished posts. I wrote it around October 1st, 2013, at the beginning of what would end up being my last year at Stanford. Later that quarter I learned a lot more lessons from OHMS, including--the hard way, at 12:30AM the night before a homework was due--not to run SQLite over a network file system. Wanting to write up some of those additional lessons probably contributed to my never publishing this until now. I've corrected typos or completed sentences in four or five places to make this post publishable, but the rest of it is exactly as it was in October 2013:Sun, 10 Jul 2016 00:00:00 PSTMachine Learning over JSON
https://www.naftaliharris.com/blog/machine-learning-json/
https://www.naftaliharris.com/blog/machine-learning-json/Supervised machine learning is the problem of approximating functions X -> Y from many example (x, y) pairs. Now, the vast majority of supervised learning algorithms assume that X is p-dimensional Euclidean space. As I'll argue, this is a poor model for many problems, and modeling X as a JSON document fits much better.Wed, 20 May 2015 00:00:00 PSTVisualizing DBSCAN Clustering
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/A previous post covered clustering with the k-means algorithm. In this post, we consider a fundamentally different, density-based approach called DBSCAN. In contrast to k-means, which modeled clusters as sets of points near to their center, density-based approaches like DBSCAN model clusters as high-density clumps of points. To begin, choose a data set below:Sat, 24 Jan 2015 00:00:00 PSTYou Can't Predict Small Counts
https://www.naftaliharris.com/blog/predicting-small-counts/
https://www.naftaliharris.com/blog/predicting-small-counts/A small restaurant is interested in predicting how many customers will come in on a given night. This is valuable information to know ahead of time, for example, so that the restaurant can figure out whether to ask employees to work extra shifts. Unfortunately, under reasonable conditions no amount of data will permit even the most talented data scientist or statistician to make particularly good predictions.Sat, 17 Jan 2015 00:00:00 PSTHalf the Decimal Trick
https://www.naftaliharris.com/blog/half-the-decimal-trick/
https://www.naftaliharris.com/blog/half-the-decimal-trick/If something happened 1,234 out of 10,000 times, we'd estimate that the true probability of occurence is about 0.1234. Of course, we wouldn't expect the true probability to be exactly 0.1234, and to quantify the uncertainty in this estimate statisticians have long computed confidence intervals. But in this particular case, there's a simple eyeballing trick we can use to get approximate error bars: we round the proportion to half the number of decimal places, (0.12|34 becomes 0.12), and add a plus or minus 1 in the least significant digit, (0.12 +/- 0.01).Fri, 09 Jan 2015 00:00:00 PSTHow to Forge an Email
https://www.naftaliharris.com/blog/forging-emails/
https://www.naftaliharris.com/blog/forging-emails/Most people don't realize how easy it is to forge an email. Say my brother John Doe uses the email address john.doe@example.com. If I get an email from that address, it's natural to assume that John actually sent it. In fact, it's also remarkably easy for an attacker to have sent it.Mon, 22 Dec 2014 00:00:00 PSTT-Tests Aren't Monotonic
https://www.naftaliharris.com/blog/t-test-non-monotonic/
https://www.naftaliharris.com/blog/t-test-non-monotonic/R. A. Fisher and Karl Pearson play a heated round of golf. Being Statisticians, they agree before the round to run a two-sided paired T-test to see if either of them is statistically significantly better. After the first 17 holes, Fisher is ahead by 19 strokes, and openly gloating. On the 18th hole, he sinks a 20-foot putt for birdey, and smirks at Pearson. Pearson then "accidentally" hits his ball into several sand traps, trees, and water hazards, taking 100 strokes on the last hole.Wed, 22 Oct 2014 00:00:00 PSTRobust Machine Learning
https://www.naftaliharris.com/blog/robust-machine-learning/
https://www.naftaliharris.com/blog/robust-machine-learning/Real data often has incorrect values in it. Origins of incorrect data include programmer errors, ("oops, we're double counting!"), surprise API changes, (a function used to return proportions, suddenly it instead returns percents), or poorly scraped data. When working with data, a desirable property of whatever you're doing is that it be robust, or continue to work in the presence of some incorrect values.Sun, 05 Oct 2014 00:00:00 PSTPython Subclass Relationships Aren't Transitive
https://www.naftaliharris.com/blog/python-subclass-intransitivity/
https://www.naftaliharris.com/blog/python-subclass-intransitivity/Subclass relationships are not transitive in Python. That is, if A is a subclass of B, and B is a subclass of C, it is not necessarily true that A is a subclass of C. The reason for this is that with PEP 3119, anyone is allowed to define their own, arbitrary __subclasscheck__ in a metaclass.Tue, 26 Aug 2014 00:00:00 PSTSensitivity of Independence Assumptions
https://www.naftaliharris.com/blog/sensitivity-of-independence-assumption/
https://www.naftaliharris.com/blog/sensitivity-of-independence-assumption/Recently I was considering an interesting problem: Several people interview a potential job candidate, and each of them scores that candidate numerically on some scale. What's the variation associated with the average score?Sun, 03 Aug 2014 00:00:00 PSTVisualizing Lasso Polytope Geometry
https://www.naftaliharris.com/blog/lasso-polytope-geometry/
https://www.naftaliharris.com/blog/lasso-polytope-geometry/Some recent research about the lasso exploits a beautiful geometric picture: Suppose you fix the design matrix X and the regularization parameter $\lambda$. For a particular value of y, the n-dimensional response variable, you can then solve the lasso problem and examine the signs of the coefficients. Now if you partition n-dimensional space based upon the signs of the coefficients you'd get if you solved the lasso problem at each value of y, then you'll find that each of the resulting regions are convex polytopes. This is kind of a mouthful, so I made the visualization below to illustrate this partitioning. You can drag x1, x2, and x3 to rotate them, and slide the slider to increase or decrease $\lambda$.Sun, 08 Jun 2014 00:00:00 PSTCollege Interview Tips
https://www.naftaliharris.com/blog/college-interview-tips/
https://www.naftaliharris.com/blog/college-interview-tips/The college admissions interview is a valuable component of college applications because it provides admissions officers with a holistic evaluation from a source that has no vested interest in your success. This contrasts with your teacher evaluations, which are holistic but (hopefully) written by people who want you to be admitted, and your standardized tests scores, which aren't biased towards your success the way that teacher evaluations are, but which aren't holistic at all.Thu, 15 May 2014 00:00:00 PSTA College Waitlisting Model
https://www.naftaliharris.com/blog/college-waitlist-model/
https://www.naftaliharris.com/blog/college-waitlist-model/Suppose a selective college wants $N_0$ students in their freshman class. How many students should they admit, and what's the distribution of the number of students they'll admit off the waitlist? Of course, you could just look for data from previous years, but in the fast-changing world of college admissions, data goes stale quickly. And honestly, I really enjoy simple probability models!Sun, 04 May 2014 00:00:00 PSTDon't Double Major
https://www.naftaliharris.com/blog/dont-double-major/
https://www.naftaliharris.com/blog/dont-double-major/Double majoring in college is a very suboptimal strategy. The reason is simple: It adds a substantial set of constraints to the courses you can take, but in return gives you only a very modest extra credential.Tue, 08 Apr 2014 00:00:00 PSTA Statistical Analysis of Climbing
https://www.naftaliharris.com/blog/climbing-statistical-analysis/
https://www.naftaliharris.com/blog/climbing-statistical-analysis/Recently, 28 of us on the Stanford Climbing Team completed a short survey on our climbing abilities. Although the survey was intended to assess our interest in different clinics, the answers to the survey question also shed light on some interesting climbing questions, like how bouldering grades compare to top-rope grades, how much harder leading is than top-roping, and what different "climber types" there are. These questions really excited me, so I asked for permission to analyze this data, which the team graciously granted.Mon, 17 Feb 2014 00:00:00 PSTHow to Solve Problems
https://www.naftaliharris.com/blog/problem-solving/
https://www.naftaliharris.com/blog/problem-solving/I spend a lot of my time solving problems: I solve well-defined math problems in grad school, open-ended problems in statistical consulting, physical puzzles when I'm rock climbing, architectural problems when I build software systems, and mysteries when I debug them. Here are some strategies that have worked for me in solving the problems in these domains, presented in the order that I usually try them in. My hope is that they prove useful to you in solving your problems:Tue, 04 Feb 2014 00:00:00 PSTVisualizing K-Means Clustering
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/Suppose you plotted the screen width and height of all the devices accessing this website. You'd probably find that the points form three clumps: one clump with small dimensions, (smartphones), one with moderate dimensions, (tablets), and one with large dimensions, (laptops and desktops). Getting an algorithm to recognize these clumps of points without help is called clustering. To gain insight into how common clustering techniques work (and don't work), I've been making some visualizations that illustrate three fundamentally different approaches. This post, the first in this series of three, covers the k-means algorithm. To begin, click an initialization strategy below:Sun, 19 Jan 2014 00:00:00 PSTHow I Got 2x Speedup with One Line of Code
https://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/
https://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/If you had asked me whether or not it was possible to get a 2x speedup for my LazySorted project by adding a single line of code, I would have told you "No way, substantial speedups can really only come from algorithm changes." But surprisingly, I was able to do so by adding a single line using the __builtin_prefetch function in GCC and Clang. Here's the story about how adding this got me a 2x speedup.Thu, 14 Nov 2013 00:00:00 PSTThe LaTex Numbers
https://www.naftaliharris.com/blog/latex-numbers/
https://www.naftaliharris.com/blog/latex-numbers/Let's define the LaTex numbers to be the set of all real numbers that can be unambiguously expressed with the LaTex type system. This set of numbers has a few fun properties, not least of which, as we'll see later, is that it doesn't quite exist.Sat, 12 Oct 2013 00:00:00 PSTThe Ten Best Ideas in Statistics
https://www.naftaliharris.com/blog/ten-stat-ideas/
https://www.naftaliharris.com/blog/ten-stat-ideas/I've been studying Statistics for six years now, seriously for the last four years, and as my main focus for the last three. Now that I've finished the core PhD curriculum at Stanford, I've spent some time reflecting on the best ideas I've learned in Probability and Statistics over the years. I've compiled a list of brilliant and beautiful ideas, ones that I'm still impressed with every time I think about them.Fri, 04 Oct 2013 00:00:00 PSTThe Zero Times Infinity Problem
https://www.naftaliharris.com/blog/zero-times-infinity/
https://www.naftaliharris.com/blog/zero-times-infinity/There are two ways to keep yourself safe while rock climbing: The first option is to protect yourself carefully with ropes and gear, so that if you fall you won't fall too far or hard. The second option is to make sure not to fall. I think that the option climbers choose reflects their perception of what I like to call the "zero times infinity" problem, in which you multiply a near-zero probability by a near-infinite loss.Sat, 24 Aug 2013 00:00:00 PSTMartingale Implications Graph
https://www.naftaliharris.com/blog/mg-graph/
https://www.naftaliharris.com/blog/mg-graph/Here's another directed graph of statement implications that I used to study for quals. This one is about convergence of stochastic processes with martingales. Like my Markov Chain Implications Graph, each node is a statement about a stochastic process, and each edge is an implication.Mon, 19 Aug 2013 00:00:00 PSTMarkov Chain Implications Graph
https://www.naftaliharris.com/blog/mc-graph/
https://www.naftaliharris.com/blog/mc-graph/I've been studying for quals for the last several weeks. Today, I was reviewing basic Markov Chain theory, and decided to understand it by drawing a graph of various statements about Markov Chains. Each statement is a node, and each edge is an implication, (so if A points to B then A implies B). If you're not familiar with the various definitions in the nodes, I'd recommend taking a look at Professor Lalley's very intuitive and readable lecture notes, or at Amir Dembo's comprehensive probability notes, which the edges in the graph below reference for proofs.Mon, 29 Jul 2013 00:00:00 PSTVisualizing the James-Stein Estimator
https://www.naftaliharris.com/blog/steinviz/
https://www.naftaliharris.com/blog/steinviz/In the words of one of my professors, "Stein's Paradox may very well be the most significant result in Mathematical Statistics since World War II." The problem is this: You observe $X_1, \ldots, X_n \sim \mathcal{N}_p(\mu, \sigma^2 I_p)$, with $\sigma^2$ known, and wish to estimate the mean vector $\mu \in \mathbb{R}^p$. The obvious thing to do, of course, is to use the sample mean $\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$ as an estimator of $\mu$. Stein's Paradox is the counterintuitive fact that in dimension $p \ge 3$, this estimator is inadmissible under squared error loss.Mon, 06 May 2013 00:00:00 PSTMemory Locality and Python Objects
https://www.naftaliharris.com/blog/heapobjects/
https://www.naftaliharris.com/blog/heapobjects/I've been obsessed with sorting over the last few weeks as I write a python C-extension implementing a lazily-sorted list. Among the many algorithms that this lazily-sorted list implements is quickselect, which finds the kth smallest element in a list in expected linear time. To examine the performance of this implementation, I decided to plot the time it takes to compute the median of a random list divided by the length of the list. Theoretically, since quickselect runs in expected linear time, this plot should be roughly constant. In fact, this is what the plot looks like:Mon, 15 Apr 2013 00:00:00 PSTThe Hottest Person in the Group
https://www.naftaliharris.com/blog/tenpeople/
https://www.naftaliharris.com/blog/tenpeople/Suppose you think you're pretty good-looking. In fact, you think you're at about the 90th percentile--you're hotter than 90% of people and not as hot as the other 10%. If I put you with another nine random people, you'd think, at least heuristically, that you're probably the hottest one out of the ten. This is in fact not true.Thu, 04 Apr 2013 00:00:00 PSTGoldbach's Conjecture and Coding Length
https://www.naftaliharris.com/blog/goldbach/
https://www.naftaliharris.com/blog/goldbach/Goldbach's conjecture is that every even integer greater or equal to four can be written as the sum of two prime numbers. (Try it: $4 = 2 + 2$, $6 = 3 + 3$, $8 = 3 + 5$, $10 = 3 + 7$, $12 = 5 + 7$...) It occurred to me that if this conjecture were true, it could be used as a way to encode even integers greater than four, and that this encoding would need to be no more efficient than the most efficient encoding, which simply enumerates the even integers $n \ge 4$. If it were more efficient, this would constitute a counterproof of the conjecture, which is widely believed to be true.Tue, 02 Apr 2013 00:00:00 PSTDon't Trust Asymptotics: Part I
https://www.naftaliharris.com/blog/asymptotics/
https://www.naftaliharris.com/blog/asymptotics/Suppose I give you a sequence of real numbers $x_n$, and tell you that $\lim_n x_n = \infty$. What can you tell me about $x_{100}$? How about $x_{1,000,000}$?Mon, 25 Mar 2013 00:00:00 PSTBeing the First
https://www.naftaliharris.com/blog/first/
https://www.naftaliharris.com/blog/first/I was at the climbing gym a few weeks ago with a friend of mine. We were just messing around, making up with bouldering problems for ourselves to do. One of the problems my friend came up with was a nice, muscular traverse moving through the overhanging wall, with pretty good hands but not much for feet. For whatever reason, our attempts on this problem started getting a lot of attention from the other climbers at the gym, and pretty soon we had a sizeable group watching us and asking about which holds were on. As these other people started trying the problem, I felt that strong climber's desire to bag the first ascent, especially before the others started learning the beta, (tricks for how to do it).Sat, 16 Mar 2013 00:00:00 PSTCSS Gotchas
https://www.naftaliharris.com/blog/cssgotchas/
https://www.naftaliharris.com/blog/cssgotchas/I'm still learning html and css. As I debug simple websites I write, (including this one), I've encountered a lot of behavior that seemed very counterintuitive to me. I thought I'd share some of this so that future people (especially myself) can avoid the same mistakes I've made.Thu, 10 Jan 2013 00:00:00 PSTHow My Chess Engine Works
https://www.naftaliharris.com/blog/chess/
https://www.naftaliharris.com/blog/chess/I have always had a lot of respect for chess, despite the fact that I'm not very good at it myself. As I learned more about the game, I also heard about the successes of computer chess AIs, in particular the sensational defeat of Gary Kasparov by Deep Blue. This inspired me to write a primitive chess engine in Python in high school, which played rather abysmally.Wed, 26 Dec 2012 00:00:00 PSTFinding Isomorphisms Between Finite Groups
https://www.naftaliharris.com/blog/groupiso/
https://www.naftaliharris.com/blog/groupiso/One of the most interesting problems I came across as I was building my Abstract Algebra package was that of finding an isomorphism between two finite groups G and H, represented by their Cayley tables, or proving that G and H aren't isomorphic. Before reading further, give it a little thought--how would you do it?Fri, 23 Nov 2012 00:00:00 PSTPopping the Hood
https://www.naftaliharris.com/blog/popthehood/
https://www.naftaliharris.com/blog/popthehood/Lots of things appear to be magic: Computers, the Banach-Tarski Paradox, cars, the phenomenal success of companies like Facebook, airplanes, the Internet, the Central Limit Theorem, and the fact that bicycles and spinning tops don't fall over are just a few examples.Sun, 28 Oct 2012 00:00:00 PST