Here are some projects I have pursued over the last 3 years. I code primarily in Python, R and SQL.
Predicting the mechanism of action of anti-tuberculosis drugs using machine learning
Skills: Data wrangling, Logistic regression, LASSO, Random forests, Recursive feature elimination, Imbalanced datasets
Tuberculosis causes over a million deaths every year and 20% of cases exhibit resistance to one or more drugs. Predicting the mechanism of action of anti-tuberculosis drugs helps speed up drug discovery and can help design drugs that are less likely to induce resistance. As a postdoctoral associate at the Ehrt & Schnappinger lab, I worked on predicting how novel anti-tuberculosis drugs work and by measuring how effective a drug is on different kinds (strains) of tuberculosis. As a secondary goal, I helped reduce data acquisition costs by minimising the number of strains required to make predictions.
Predicting customer churn for a lawn services company
Skills: Solving a real-life business problem, Survival analysis, Model ensembling, FastAPI, AWS
As an Insight Data Science Fellow, I consulted for a lawn services startup that suffers from a 30% annual churn rate*. I provided them with their first customer-level churn prediction model. I used model ensembling to combine predictions from two sub-models, improving model performance. To deal with seasonality, I built two separate models, each using a different churn window. I deployed the model as an API for rapid querying and additionally generated actionable insights for improving customer retention efforts.
*Since the results of this project are public, the name of the startup is masked.
Sense and sensitivity analysis
Skills: Data visualization, causal inference, open source software
Making causal inferences from observational data is challenging and subject to unobserved confounding. In collaboration with Victor Veitch, this project developed a flexible new method for performing sensitivity analysis – a technique that estimates how strongly related to treatment assignment and outcome would a confounder have to be to change the study conclusions. I tested this new method on real-life datasets using tree-based models and converted the mathematical theory into a practical data visualization tool.
Jigsaw Unintended Bias in Toxicity Classification
Skills: Natural language processing, deep learning, word embeddings, bias and fairness in machine learning
Jigsaw Unintended Bias in Toxicity Classification was a Kaggle competition designed to classify comments by toxicity while also simultaneously reducing bias. In an earlier iteration of this competition, models learnt to associate identity-related words (such as ‘gay’) with toxicity . To correct for this issue, this version of the competition strongly penalized identity-associated false positives.
I used deep neural language models to address this problem. I modified Google’s BERT model to
accommodate sample weights. Additionally, I used word embeddings with a Bi-LSTM. My final model blended the outputs of the BERT and Bi-LSTM models.
MtbTnDB: A webapp for querying transposon sequencing data
Skills: Interactive web-apps (Dash/Plotly), data visualization
Gene essentiality profiles are useful for predicting gene function. However, these profiles are buried across dozens of research papers in supplementary materials. In collaboration with Adrian Jinich and Michael DeJesus, I built an interactive web app using Dash to aggregate this data and allow easy querying. I used Plotly to build data visualization tools to enable rapid data exploration.