code

Modified

April 6, 2024

n.b. For the last several years, I written code purely for the purposes of furthering my academic research. The goal of this code is not to be broadly usable. I only commit to maintaining and explaining my research code to the extent that it is assists others in their own academic research. I have also written code designed for broad consumption – see #rstats below for some details about my open source work.

My research code, as well as miscellaneous personal projects in various stages of completion, lives on my Github. I primarily use R, and most of my work as a developer is on methods packages. I am also a proficient Python user, and have passing exposure to SQL, Julia, and C++.

research software

  • vsp performs semi-parametric estimation of latent factors in random-dot product graphs by computing varimax rotations of the spectral embeddings of graphs. The resulting factors are sparse and interpretable. The theory work on this was done by Rohe and Zeng (2022+), and then I ended up using varimax rotation a lot in my own data analysis and wrapped some of the infrastructure I developed into this package. I am committed to maintenance of this package and will respond quickly to feature requests or questions about how you might use it in your own research.

  • fastRG samples large, sparse random-dot product graphs very efficiently and is especially useful when running simulation studies for spectral network estimators. I am committed to maintenance of this package and will respond quickly to feature requests or questions about how you might use it in your own research. The fastRG sampling algorithm is described in Rohe et al. (2018).

  • fastadi is a proof-of-concept implementation of AdaptiveImpute, a self-tuning matrix completion with adaptive thresholding that is closely related to softImpute (Cho, Kim, and Rohe 2019, 2018). I extended AdaptiveImpute to the computationally challenging case where the entire upper triangle is observed as part of my work with Karl Rohe on citation networks. This is research code rather than code intended for broad consumption. I make no commitments to maintaining or improving this code unless something about it is blocking an ongoing research project.

  • aPPR approximates Personalized PageRanks in large graphs, including those that can only be queried via an API. aPPR additionally performs degree correction and regularization, allowing users to recover blocks from stochastic blockmodels (see Chen, Zhang, and Rohe 2020). Originally aPPR was designed to be used together with the neocache backend to sample large portions of the Twitter following graph with high Personalized PageRanks around seed nodes (joint work with Nathan Kolbow). I am no longer maintaining neocache, however, and cannot commit any development time to keeping up with the Twitter API shenanigans. slides

design of statistical software

I am interested in the design of statistical software and have contributed to ROpenSci’s statistical software reviewing guidelines, as well as early versions of the tidymodels implementation principles. I have some long form explorations of modeling software design on my blog:

I review for the Journal of Open Source Software and the R Journal.

#rstats

I have been involved in a number of open source projects in the tidyverse and tidymodels orbits. I previously maintained the broom package, and am responsible for the 0.5.0 release and a portion of the 0.7.0 release. For these contributions I was generously given authorship on the tidyverse paper. I intermittently participate in the Stan and ROpenSci communities.

I also wrote the distributions3 package, which provides an S3 interface to distribution functions, with an emphasis on good documentation and beginner friendly design. The vignettes in particular are designed to walk students intro stat courses though a litany of classic hypothesis tests. I do not actively maintain distributions3 but there is small community of invested contributors.

References

Chen, Fan, Yini Zhang, and Karl Rohe. 2020. “Targeted Sampling from Massive Block Model Graphs with Personalized PageRank.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (1): 99–126. https://doi.org/10.1111/rssb.12349.
Cho, Juhee, Donggyu Kim, and Karl Rohe. 2018. “Asymptotic Theory for Estimating the Singular Vectors and Values of a Partially-Observed Low Rank Matrix with Noise.” Statistica Sinica. https://doi.org/10.5705/ss.202016.0205.
———. 2019. “Intelligent Initialization and Adaptive Thresholding for Iterative Matrix Completion: Some Statistical and Algorithmic Theory for Adaptive-Impute.” Journal of Computational and Graphical Statistics 28 (2): 323–33. https://doi.org/10.1080/10618600.2018.1518238.
Rohe, Karl, Jun Tao, Xintian Han, and Norbert Binkiewicz. 2018. “A Note on Quickly Sampling a Sparse Matrix with Low Rank Expectation.” Journal of Machine Learning Research 19: 1–13.
Rohe, Karl, and Muzhe Zeng. 2022+. “Vintage Factor Analysis with Varimax Performs Statistical Inference.” arXiv:2004.05387 [Math, Stat], 2022+. https://arxiv.org/abs/2004.05387.