posts – alex hayes

Modified

April 6, 2024

what i want from academic code

the purpose, audience, and value of methods software

It is a bit of meme that academic software is disappointing, because it either fails to solve applied problems or behaves unreliably in production. I don’t think these are…

fairness in repeated lotteries

how to run repeated lotteries in a fair way

Lotteries are a reasonably fair way to allocate scarce resources. One interesting feature of lotteries, however, is that they can quickly become unfair when they happen on a…

how to effectively ask for statistics help

help me help you

You’re probably here because you reached out with a stats question and I pointed you to this post. Thank you so much for your question! Chances are that I’d love to answer…

the accumulation of small risks

a simple plot that informs how i make decisions

When we take small risks repeatedly, that risk accumulates, and it can accumulate surprisingly quickly.

hypothesis testing by example

some pointers on things that can go right and wrong

In a data science course that I am currently TA-ing, we just gave out the following problem. Suppose you have the following six sequences of coin flips. Exactly one sequence…

avoiding data races with intel mkl

set the MKL_THREADING_LAYER environment variable to GNU

If you are using Intel MKL for your BLAS/LAPACK implementation on Linux, you can set the environment variable MKL_THREADING_LAYER to GNU to avoid data races during matrix…

many models workflows in python ii

a tidymodels workflow in python using list columns in pandas dataframes, now with hyperparameter sweep and parallelism

In this followup to my earlier post on modeling workflows in Python, I demonstrate how to integrate sample splitting, parallel processing, exception handling and caching…

many models workflows in python i

a tidymodels workflow in python using list columns in pandas dataframes

This summer I worked on my first substantial research project in Python. I’ve used Python for a number of small projects before, but this was the first time that it was…

using the data twice

my best attempt at explaining why frequentism is about sampling procedures

Berna Devezer, Danielle Navarro, Joachim Vandekerckhove, and Erkan Ozge Buzbas recently posted a pre-print, Devezer et al. (2020), responding to various claims within the…

synthetic control: elon’s tweet tanked tesla’s stock

a terse analysis of tesla stock prices and how one of elon’s tweets moves them

At 2020-05-01 15:11:26 UTC Elon Musk tweeted

to transform or not to transform

a likelihood ratio test to check if transforming your data leads to better model fit

You may have heard that it is impossible to compare models when the outcome has been transformed in one model but not the other. This is not the case. Models fit to…

overfitting: a guided tour

fleshing out intuition about structure in random processes beyond the standard bias-variance decomposition

This post introduces overfitting, describes how overfitting influences both prediction and inference problems, provides supervised and unsupervised examples of overfitting…

consistency and the linear probability model

an explainer about ordinary least squares regression and when it is an acceptable estimator

A while back Twitter once again lost its collective mind and decided to rehash the logistic regression versus linear probability model debate for the umpteenth time. The…

an annotated bibliography on stochastic blockmodels

some pointers to papers that were helpful when i got started in spectral network analysis, a woefully incomplete list

I’ve been reading a lot of papers on network analysis recently. I thought I’d write down some takeaways and point out papers that I’ve found helpful. This collection of…

testing statistical software

an exploration of what it would take to meaningfully probe the correctness of computations in modeling software

Recently I’ve been implementing and attempting to extend some computationally intense methods. These methods are from papers published in the last several years, and haven’t…

type stable estimation

my thesis about why modeling software is an enormous mess

This post discusses how the mathematical objects we use in formal data modeling are represented in statistical software. First I introduce these objects, then I argue that…

implementing the super learner with tidymodels

a demonstration of low levels tidymodels infrastructure to build sophisticated tools in a hurry

In this post I demonstrate how to implement the Super Learner using tidymodels infrastructure. The Super Learner is an ensembling strategy that relies on cross-validation to…

overlapping confidence intervals: correcting bad intuition

some math to show that confidence interals of significantly different parameters can overlap

In this post I work through a recent homework exercise that illustrates why you shouldn’t compare means by checking for confidence interval overlap. I calculate the type I…

some things i’ve learned about stan

what i wish my mother had told me about sampling from posteriors

Yesterday, for the first time ever, I coded up a model in Stan and it actually did what I wanted. My current knowledge of Stan is, at best, nascent, but I’ll show you the…

understanding multinomial regression with partial dependence plots

some intuition for multinomial regression’s initially intimidating functional form

This post assumes you are familiar with logistic regression and that you just fit your first or second multinomial logistic regression model. While there is an…

a summer with rstudio

reflections on a great internship

Today is the last day of my summer internship with RStudio. This is the first year that RStudio has had an official internship program, and I couldn’t be happier to have…

speeding up GPX ingest: profiling, Rcpp and furrr

a demonstration of how to profile r code on a toy problem

This post is a casual case study in speeding up R code. I work through several iterations of a function to read and process GPS running data from Strava stored in the GPX…

comparing runs with riegel’s formula and gams

a quick analysis of my running fitness using splines

Runners often vary the distance and intensity of their workouts. In this post I demonstrate how to compare runs of different lengths using Riegel’s formula. The formula…

predictive performance via bootstrap variants

resampling based approaches to estimating the risk of a predictive model

When we build a predictive model, we are interested in how the model will perform on data it hasn’t seen before. If we have lots of data, we can split it into training and…

numerical gradient checks

how to use a computer to check your derivative calculations

Suppose you have some loss function

L (β) : R^{n} \to R

you want to minimize with respect to some model parameters

β

. You understand…