# Map Transformation

March 23, 2018

*
Note: I was describing my map transformation project to Sasha Trubetskoy recently, who is even more into maps than I am. This is a project I completed in March 2011 with Shir Yehoshua for a CS class, where we transformed a map of Washington DC so that the Euclidean distances on the transformed map became more proportional to travel times. I've described this project many times over the years to various friends, and figured that it was long past time I wrote it up. What follows is a write-up that's partly based on contemporaneous emails I wrote about this project, and partly based on my now-hazy recollection:
*

(show me more)

# Style for Python Multiline If-Statements

April 10, 2017

PEP 8 gives a number of acceptable ways of handling multiple line if-statements in Python. But to be honest, most of the styles I've seen--even those that conform with the PEP--seem ugly and hard to read for me. So here's my favorite style:

(show me more)

# Hypothesis Tests for Machine Learning

March 27, 2017

Statisticians have spent a lot of time attempting to do complicated inference for various machine learning models. In fact, there's an enormously simple and naive way to do this in complete generality: Simply use a paired T-test to compare the performance of two models on your test set!

(show me more)

# The Communication Loss Function

March 20, 2017

My first ever task writing software professionally was to make some small change to the Kaggle server. I spent a day or so following painstakingly moving down the call stack from the API endpoint to figure out which file I needed to make my changes in. I proudly showed my mentor the five line change I'd made. His response: "Oh, that file is actually automatically generated. Your changes are going to get wiped out during the build."

(show me more)

# Logistic Regression Isn't Interpretable

March 13, 2017

Suppose two events A and B are independent, with the odds of A occurring being 4, and the odds of B being 5. What are the odds of both A and B occurring? I'll give you a hint: it's not 20.

(show me more)

# Nontrivial: Exception Handling in Python

March 6, 2017

Most code can fail in multiple places and for multiple reasons. Handling these failures seems pretty trivial, something you'd cover in the basic tutorial to your programming language. Actually, I think that doing this well can be surprisingly subtle, and ultimately dances around the control flow constructs of your programming language of choice. Let me illustrate with a simple example in Python, my second-favorite programming language:

(show me more)

# Implementing "nonlocal" in Tauthon: Part I

February 27, 2017

Tauthon is a fork of Python 2.7 with syntax, builtins, and libraries backported from Python 3. It aspires to be able to run all valid Python 2 and 3 code. In this article, I begin discussing how I was able to backport the "nonlocal" keyword from Python 3. I hope this post is useful for people who are interested in hacking on the CPython interpreter or CPython forks: it sounds hard, and it can be a bit tedious, but it's actually a lot easier than you'd think.

(show me more)

# Day-to-Day Operations of Palo Alto

February 20, 2017

Palo Alto runs a pretty open city government, with a number of interesting documents available for download on their website. Of particular interest are their annual budgets and annual financial reports, both of which are for the fiscal year ending June 30. These documents are a few hundred pages long each and full of accounting tables, but with a bit of persistence and the help of some friendly city employees I think I was able to figure out much of what is going on. In this post I give an overview of what the city of Palo Alto does on a day-to-day basis--(i.e., excluding long-term capital projects).

(show me more)

# Continuous Time Lending

January 15, 2017

Assume a borrower takes out an installment loan of size $1$ and makes continuous-time payments on it. The installment loan starts at time $0$, ends at time $T$, and has an interest rate of $r$, compounding continually. We'll let $P(t)$ be the principal owed by the borrower at time $t$, and $I(t)$ be the interest they've paid on the loan so far. Unlike in discrete time lending, we don't need to keep track of the amount of unpaid interest owed by the borrower; it's always zero since the borrower makes continuous payments. To check your comprehension, $P(0) = 1$, $P(T) = 0$, $I(0) = 0$, and $I(T)$ is the total amount of interest the borrower will pay on their loan (assuming no defaults, fees, or prepayment).

(show me more)

# An Easy Chess Puzzle

December 29, 2016

I was looking through Markovian, my old chess engine, recently, and came across the first game it won against another chess engine. Stepping through the game, it seems that both engines actually played pretty poorly. Even so, I was proud that Markovian found a long forced mate to end the game. Here's that mate in puzzle form; play black, and find the fastest mate. (If there's more than one, any will be accepted).

(show me more)

# Why I'm Making Tauthon

November 30, 2016

For the past two months I've been spending half my time on Tauthon. Tauthon is a backwards-compatible Python interpreter that runs Python 2 code and C-extensions exactly as-is, while also allowing Python 2 programmers to use the most exciting new language features from Python 3. These new backported language features include async/await syntax, function annotations and typing support, keyword-only arguments, and new metaclass syntax, among many others. I use Tauthon as my system python now, and haven't had any problems running my old 2.7 code or using packages like IPython, pip, numpy, pandas, requests, and flask. I've been enjoying the new language features--I especially like underscores in numeric literals!

(show me more)

# Desperation Motivated Creativity

July 25th, 2016

I am not the strongest climber. Some of the people I've climbed with are so strong that they can do a one-arm pull-up, and then--while locking off with one arm--sing the "Head, Shoulders, Knees and Toes" song and use the other arm to do the corresponding dance. This is not in my future anytime soon.

(show me more)

# OHMS Lessons Learned

July 10th, 2016

*
Note: I found the following post as an almost complete draft as I was reading some of my unpublished posts. I wrote it around October 1st, 2013, at the beginning of what would end up being my last year at Stanford. Later that quarter I learned a lot more lessons from OHMS, including--the hard way, at 12:30AM the night before a homework was due--not to run SQLite over a network file system. Wanting to write up some of those additional lessons probably contributed to my never publishing this until now. I've corrected typos or completed sentences in four or five places to make this post publishable, but the rest of it is exactly as it was in October 2013:
*

(show me more)

# Machine Learning over JSON

May 20, 2015

Supervised machine learning is the problem of approximating functions X -> Y from many example (x, y) pairs. Now, the vast majority of supervised learning algorithms assume that X is p-dimensional Euclidean space. As I'll argue, this is a poor model for many problems, and modeling X as a JSON document fits much better.

(show me more)

# Visualizing DBSCAN Clustering

January 24, 2015

A previous post covered clustering with the k-means algorithm. In this post, we consider a fundamentally different, density-based approach called DBSCAN. In contrast to k-means, which modeled clusters as sets of points near to their center, density-based approaches like DBSCAN model clusters as high-density clumps of points. To begin, choose a data set below:

(show me more)

# You Can't Predict Small Counts

January 17, 2015

A small restaurant is interested in predicting how many customers will come in on a given night. This is valuable information to know ahead of time, for example, so that the restaurant can figure out whether to ask employees to work extra shifts. Unfortunately, under reasonable conditions no amount of data will permit even the most talented data scientist or statistician to make particularly good predictions.

(show me more)

# Half the Decimal Trick

January 9, 2015

If something happened 1,234 out of 10,000 times, we'd estimate that the true probability of occurence is about 0.1234. Of course, we wouldn't expect the true probability to be exactly 0.1234, and to quantify the uncertainty in this estimate statisticians have long computed confidence intervals. But in this particular case, there's a simple eyeballing trick we can use to get approximate error bars: we round the proportion to half the number of decimal places, (0.12|34 becomes 0.12), and add a plus or minus 1 in the least significant digit, (0.12 +/- 0.01).

(show me more)

# How to Forge an Email

December 22, 2014

Most people don't realize how easy it is to forge an email. Say my brother John Doe uses the email address john.doe@example.com. If I get an email from that address, it's natural to assume that John actually sent it. In fact, it's also remarkably easy for an attacker to have sent it.

(show me more)

# T-Tests Aren't Monotonic

October 22, 2014

R. A. Fisher and Karl Pearson play a heated round of golf. Being Statisticians, they agree before the round to run a two-sided paired T-test to see if either of them is statistically significantly better. After the first 17 holes, Fisher is ahead by 19 strokes, and openly gloating. On the 18th hole, he sinks a 20-foot putt for birdey, and smirks at Pearson. Pearson then "accidentally" hits his ball into several sand traps, trees, and water hazards, taking 100 strokes on the last hole.

(show me more)

# Robust Machine Learning

October 5, 2014

Real data often has incorrect values in it. Origins of incorrect data include programmer errors, ("oops, we're double counting!"), surprise API changes, (a function used to return proportions, suddenly it instead returns percents), or poorly scraped data. When working with data, a desirable property of whatever you're doing is that it be *robust*, or continue to work in the presence of some incorrect values.

(show me more)

# Python Subclass Relationships Aren't Transitive

August 26, 2014

Subclass relationships are not transitive in Python. That is, if A is a subclass of B, and B is a subclass of C, it is not necessarily true that A is a subclass of C. The reason for this is that with PEP 3119, anyone is allowed to define their own, arbitrary __subclasscheck__ in a metaclass.

(show me more)

# Sensitivity of Independence Assumptions

August 3, 2014

Recently I was considering an interesting problem: Several people interview a potential job candidate, and each of them scores that candidate numerically on some scale. What's the variation associated with the average score?

(show me more)

# Visualizing Lasso Polytope Geometry

June 8, 2014

Some recent research about the lasso exploits a beautiful geometric picture: Suppose you fix the design matrix X and the regularization parameter $\lambda$. For a particular value of y, the n-dimensional response variable, you can then solve the lasso problem and examine the signs of the coefficients. Now if you partition n-dimensional space based upon the signs of the coefficients you'd get if you solved the lasso problem at each value of y, then you'll find that each of the resulting regions are convex polytopes. This is kind of a mouthful, so I made the visualization below to illustrate this partitioning. You can drag x1, x2, and x3 to rotate them, and slide the slider to increase or decrease $\lambda$.

(show me more)

# College Interview Tips

May 15, 2014

The college admissions interview is a valuable component of college applications because it provides admissions officers with a holistic evaluation from a source that has no vested interest in your success. This contrasts with your teacher evaluations, which are holistic but (hopefully) written by people who want you to be admitted, and your standardized tests scores, which aren't biased towards your success the way that teacher evaluations are, but which aren't holistic at all.

(show me more)

# A College Waitlisting Model

May 4, 2014

Suppose a selective college wants $N_0$ students in their freshman class. How many students should they admit, and what's the distribution of the number of students they'll admit off the waitlist? Of course, you could just look for data from previous years, but in the fast-changing world of college admissions, data goes stale quickly. And honestly, I really enjoy simple probability models!

(show me more)

# Don't Double Major

April 8, 2014

Double majoring in college is a very suboptimal strategy. The reason is simple: It adds a substantial set of constraints to the courses you can take, but in return gives you only a very modest extra credential.

(show me more)

# A Statistical Analysis of Climbing

February 17, 2014

Recently, 28 of us on the Stanford Climbing Team completed a short survey on our climbing abilities. Although the survey was intended to assess our interest in different clinics, the answers to the survey question also shed light on some interesting climbing questions, like how bouldering grades compare to top-rope grades, how much harder leading is than top-roping, and what different "climber types" there are. These questions really excited me, so I asked for permission to analyze this data, which the team graciously granted.

(show me more)

# How to Solve Problems

February 4, 2014

I spend a lot of my time solving problems: I solve well-defined math problems in grad school, open-ended problems in statistical consulting, physical puzzles when I'm rock climbing, architectural problems when I build software systems, and mysteries when I debug them. Here are some strategies that have worked for me in solving the problems in these domains, presented in the order that I usually try them in. My hope is that they prove useful to you in solving your problems:

(show me more)

# Visualizing K-Means Clustering

January 19, 2014

Suppose you plotted the screen width and height of all the devices accessing this website. You'd probably find that the points form three clumps: one clump with small dimensions, (smartphones), one with moderate dimensions, (tablets), and one with large dimensions, (laptops and desktops). Getting an algorithm to recognize these clumps of points without help is called *clustering*. To gain insight into how common clustering techniques work (and don't work), I've been making some visualizations that illustrate three fundamentally different approaches. This post, the first in this series of three, covers the k-means algorithm. To begin, click an initialization strategy below:

(show me more)

# How I Got 2x Speedup with One Line of Code

November 14, 2013

If you had asked me whether or not it was possible to get a 2x speedup for my LazySorted project by adding a single line of code, I would have told you "No way, substantial speedups can really only come from algorithm changes." But surprisingly, I was able to do so by adding a single line using the `__builtin_prefetch`

function in GCC and Clang. Here's the story about how adding this got me a 2x speedup.

(show me more)

# The LaTex Numbers

October 12, 2013

Let's define the *LaTex numbers* to be the set of all real numbers that can be unambiguously expressed with the LaTex type system. This set of numbers has a few fun properties, not least of which, as we'll see later, is that it doesn't quite exist.

(show me more)

# The Ten Best Ideas in Statistics

October 4, 2013

I've been studying Statistics for six years now, seriously for the last four years, and as my main focus for the last three. Now that I've finished the core PhD curriculum at Stanford, I've spent some time reflecting on the best ideas I've learned in Probability and Statistics over the years. I've compiled a list of brilliant and beautiful ideas, ones that I'm still impressed with every time I think about them.

(show me more)

# The Zero Times Infinity Problem

August 24, 2013

There are two ways to keep yourself safe while rock climbing: The first option is to protect yourself carefully with ropes and gear, so that if you fall you won't fall too far or hard. The second option is to make sure not to fall. I think that the option climbers choose reflects their perception of what I like to call the "zero times infinity" problem, in which you multiply a near-zero probability by a near-infinite loss.

(show me more)

# Martingale Implications Graph

August 19, 2013

Here's another directed graph of statement implications that I used to study for quals. This one is about convergence of stochastic processes with martingales. Like my Markov Chain Implications Graph, each node is a statement about a stochastic process, and each edge is an implication.

(show me more)

# Markov Chain Implications Graph

July 29, 2013

I've been studying for quals for the last several weeks. Today, I was reviewing basic Markov Chain theory, and decided to understand it by drawing a graph of various statements about Markov Chains. Each statement is a node, and each edge is an implication, (so if A points to B then A implies B). If you're not familiar with the various definitions in the nodes, I'd recommend taking a look at Professor Lalley's very intuitive and readable lecture notes, or at Amir Dembo's comprehensive probability notes, which the edges in the graph below reference for proofs.

(show me more)

# Visualizing the James-Stein Estimator

May 6, 2013

In the words of one of my professors, "Stein's Paradox may very well be the most significant result in Mathematical Statistics since World War II." The problem is this: You observe $X_1, \ldots, X_n \sim \mathcal{N}_p(\mu, \sigma^2 I_p)$, with $\sigma^2$ known, and wish to estimate the mean vector $\mu \in \mathbb{R}^p$. The obvious thing to do, of course, is to use the sample mean $\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$ as an estimator of $\mu$. Stein's Paradox is the counterintuitive fact that in dimension $p \ge 3$, this estimator is inadmissible under squared error loss.

(show me more)

# Memory Locality and Python Objects

April 15, 2013

I've been obsessed with sorting over the last few weeks as I write a python C-extension implementing a lazily-sorted list. Among the many algorithms that this lazily-sorted list implements is quickselect, which finds the kth smallest element in a list in expected linear time. To examine the performance of this implementation, I decided to plot the time it takes to compute the median of a random list divided by the length of the list. Theoretically, since quickselect runs in expected linear time, this plot should be roughly constant. In fact, this is what the plot looks like:

(show me more)

# The Hottest Person in the Group

April 4, 2013

Suppose you think you're pretty good-looking. In fact, you think you're at about the 90th percentile--you're hotter than 90% of people and not as hot as the other 10%. If I put you with another nine random people, you'd think, at least heuristically, that you're probably the hottest one out of the ten. This is in fact not true.

(show me more)

# Goldbach's Conjecture and Coding Length

April 2, 2013

Goldbach's conjecture is that every even integer greater or equal to four can be written as the sum of two prime numbers. (Try it: $4 = 2 + 2$, $6 = 3 + 3$, $8 = 3 + 5$, $10 = 3 + 7$, $12 = 5 + 7$...) It occurred to me that if this conjecture were true, it could be used as a way to encode even integers greater than four, and that this encoding would need to be no more efficient than the most efficient encoding, which simply enumerates the even integers $n \ge 4$. If it were more efficient, this would constitute a counterproof of the conjecture, which is widely believed to be true.

(show me more)

# Don't Trust Asymptotics: Part I

March 25, 2013

Suppose I give you a sequence of real numbers $x_n$, and tell you that $\lim_n x_n = \infty$. What can you tell me about $x_{100}$? How about $x_{1,000,000}$?

(show me more)

# Being the First

March 16, 2013

I was at the climbing gym a few weeks ago with a friend of mine. We were just messing around, making up with bouldering problems for ourselves to do. One of the problems my friend came up with was a nice, muscular traverse moving through the overhanging wall, with pretty good hands but not much for feet. For whatever reason, our attempts on this problem started getting a lot of attention from the other climbers at the gym, and pretty soon we had a sizeable group watching us and asking about which holds were on. As these other people started trying the problem, I felt that strong climber's desire to bag the first ascent, especially before the others started learning the beta, (tricks for how to do it).

(show me more)

# CSS Gotchas

January 10, 2013

I'm still learning html and css. As I debug simple websites I write, (including this one), I've encountered a lot of behavior that seemed very counterintuitive to me. I thought I'd share some of this so that future people (especially myself) can avoid the same mistakes I've made.

(show me more)

# How My Chess Engine Works

December 26, 2012

I have always had a lot of respect for chess, despite the fact that I'm not very good at it myself. As I learned more about the game, I also heard about the successes of computer chess AIs, in particular the sensational defeat of Gary Kasparov by Deep Blue. This inspired me to write a primitive chess engine in Python in high school, which played rather abysmally.

(show me more)

# Finding Isomorphisms Between Finite Groups

November 23, 2012

One of the most interesting problems I came across as I was building my Abstract Algebra package was that of finding an isomorphism between two finite groups G and H, represented by their Cayley tables, or proving that G and H aren't isomorphic. Before reading further, give it a little thought--how would you do it?

(show me more)

# Popping the Hood

October 28, 2012

Lots of things appear to be magic: Computers, the Banach-Tarski Paradox, cars, the phenomenal success of companies like Facebook, airplanes, the Internet, the Central Limit Theorem, and the fact that bicycles and spinning tops don't fall over are just a few examples.

(show me more)