alex hayes
about
blog
papers
code
teaching
talks
resume
posts
Modified
April 6, 2024
Categories
All
(25)
calculus
(1)
code performance
(1)
data analysis
(2)
data race
(1)
fairness
(1)
foundations of stat
(1)
frequentism
(1)
gams
(1)
getting help
(1)
hypothesis testing
(1)
incredible places to work
(1)
intel mkl
(1)
lottery
(1)
m estimation
(1)
math stat
(5)
methods
(5)
networks
(1)
notes to self
(5)
p-hacking
(1)
python
(3)
resampling
(1)
research
(1)
risk estimation
(1)
rstats
(3)
stacking
(1)
stan
(1)
statistical software
(2)
strava
(1)
tidymodels
(4)
tidyverse
(1)
workflow
(2)
fairness in repeated lotteries
how to run repeated lotteries in a fair way
Lotteries are a reasonably fair way to allocate scarce resources. One interesting feature of lotteries, however, is that they can quickly become unfair when they happen on a…
Jun 9, 2024
13 min
how to effectively ask for statistics help
help me help you
You’re probably here because you reached out with a stats question and I pointed you to this post. Thank you so much for your question! Chances are that I’d love to answer…
Dec 20, 2023
5 min
the accumulation of small risks
a simple plot that informs how i make decisions
When we take small risks repeatedly, that risk accumulates, and it can accumulate surprisingly quickly.
Dec 12, 2023
3 min
hypothesis testing by example
some pointers on things that can go right and wrong
In a data science course that I am currently TA-ing, we just gave out the following problem. Suppose you have the following six sequences of coin flips. Exactly one sequence…
Nov 2, 2022
15 min
avoiding data races with intel mkl
set the
MKL_THREADING_LAYER
environment variable to
GNU
If you are using Intel MKL for your BLAS/LAPACK implementation on Linux, you can set the environment variable
MKL_THREADING_LAYER
to
GNU
to avoid data races during matrix…
Oct 30, 2022
3 min
many models workflows in python ii
a tidymodels workflow in python using list columns in pandas dataframes, now with hyperparameter sweep and parallelism
In this followup to my earlier post on modeling workflows in Python, I demonstrate how to integrate sample splitting, parallel processing, exception handling and caching…
Mar 28, 2021
8 min
many models workflows in python i
a tidymodels workflow in python using list columns in pandas dataframes
This summer I worked on my first substantial research project in Python. I’ve used Python for a number of small projects before, but this was the first time that it was…
Aug 25, 2020
10 min
using the data twice
my best attempt at explaining why frequentism is about sampling procedures
Berna Devezer, Danielle Navarro, Joachim Vandekerckhove, and Erkan Ozge Buzbas recently posted a pre-print,
Devezer et al. (2020)
, responding to various claims within the…
May 4, 2020
11 min
synthetic control: elon’s tweet tanked tesla’s stock
a terse analysis of tesla stock prices and how one of elon’s tweets moves them
At 2020-05-01 15:11:26 UTC Elon Musk tweeted
May 1, 2020
3 min
to transform or not to transform
a likelihood ratio test to check if transforming your data leads to better model fit
You may have heard that it is impossible to compare models when the outcome has been transformed in one model but not the other. This is not the case. Models fit to…
Mar 22, 2020
16 min
overfitting: a guided tour
fleshing out intuition about structure in random processes beyond the standard bias-variance decomposition
This post introduces overfitting, describes how overfitting influences both prediction and inference problems, provides supervised and unsupervised examples of overfitting…
Jan 6, 2020
20 min
consistency and the linear probability model
an explainer about ordinary least squares regression and when it is an acceptable estimator
A while back Twitter once again lost its collective mind and decided to rehash the logistic regression versus linear probability model debate for the umpteenth time. The…
Aug 31, 2019
18 min
an annotated bibliography on stochastic blockmodels
some pointers to papers that were helpful when i got started in spectral network analysis, a woefully incomplete list
I’ve been reading a lot of papers on network analysis recently. I thought I’d write down some takeaways and point out papers that I’ve found helpful. This collection of…
Jul 26, 2019
13 min
testing statistical software
an exploration of what it would take to meaningfully probe the correctness of computations in modeling software
Recently I’ve been implementing and attempting to extend some computationally intense methods. These methods are from papers published in the last several years, and haven’t…
Jun 7, 2019
17 min
type stable estimation
my thesis about why modeling software is an enormous mess
This post discusses how the mathematical objects we use in formal data modeling are represented in statistical software. First I introduce these objects, then I argue that…
May 21, 2019
22 min
implementing the super learner with tidymodels
a demonstration of low levels tidymodels infrastructure to build sophisticated tools in a hurry
In this post I demonstrate how to implement the Super Learner using
tidymodels
infrastructure. The Super Learner is an ensembling strategy that relies on cross-validation to…
Apr 13, 2019
14 min
overlapping confidence intervals: correcting bad intuition
some math to show that confidence interals of significantly different parameters can overlap
In this post I work through a recent homework exercise that illustrates why you shouldn’t compare means by checking for confidence interval overlap. I calculate the type I…
Jan 31, 2019
9 min
some things i’ve learned about stan
what i wish my mother had told me about sampling from posteriors
Yesterday, for the first time ever, I coded up a model in Stan and it actually did what I wanted. My current knowledge of Stan is, at best, nascent, but I’ll show you the…
Dec 24, 2018
24 min
understanding multinomial regression with partial dependence plots
some intuition for multinomial regression’s initially intimidating functional form
This post assumes you are familiar with logistic regression and that you just fit your first or second multinomial logistic regression model. While there is an…
Oct 23, 2018
7 min
a summer with rstudio
reflections on a great internship
Today is the last day of my summer internship with RStudio. This is the first year that RStudio has had an official internship program, and I couldn’t be happier to have…
Aug 10, 2018
10 min
speeding up GPX ingest: profiling, Rcpp and furrr
a demonstration of how to profile r code on a toy problem
This post is a casual case study in speeding up R code. I work through several iterations of a function to read and process GPS running data from Strava stored in the GPX…
Jun 15, 2018
19 min
comparing runs with riegel’s formula and gams
a quick analysis of my running fitness using splines
Runners often vary the distance and intensity of their workouts. In this post I demonstrate how to compare runs of different lengths using Riegel’s formula. The formula…
May 16, 2018
11 min
predictive performance via bootstrap variants
resampling based approaches to estimating the risk of a predictive model
When we build a predictive model, we are interested in how the model will perform on data it hasn’t seen before. If we have lots of data, we can split it into training and…
May 3, 2018
13 min
numerical gradient checks
how to use a computer to check your derivative calculations
Suppose you have some loss function
\(\mathcal{L}(\beta) : \mathbb{R}^n \to \mathbb{R}\)
you want to minimize with respect to some model parameters
\(\beta\)
. You understand…
Oct 18, 2017
4 min
gentle tidy eval with examples
copy-pasteable example code for programming with the tidyverse.
I’ve been using the tidy eval framework introduced with
dplyr 0.7
for about two months now, and it’s time for an update to my original post on tidy eval. My goal is not to…
Aug 7, 2017
4 min
No matching items