Projects

Kaggle Competitions Expert (top 1% globally, top 10 Kaggler in Boston)

  • Led a team to develop deep learning CNN models with transfer learning techniques to analyze and
    predict the presence of oil palm plantations in satellite imagery with test AUC 0.999. WiDS Cambridge Datathon Workshop Winner.
  • Built deep learning CNN models (fastai) to predict if the center 32x32px region of a patch contains at least 1 pixel of tumor tissue with PCam 277k data. My PrLB AUC 0.97.
  • Built ensemble models with 8.9 million samples to predict if a machine will soon be hit with malware. Won Silver medal for Microsoft Malware Prediction with PrLB AUC is 0.64.
  • Built ensemble model to predict if each applicant will repay a loan. Won Bronze medal for Home Credit Default Risk. My PrLB AUC 0.79 v.s. Best AUC 0.80.
  • Built ensemble model to predict whether a driver will initiate an auto insurance claim in the next year. Won Bronze medal for Porto Seguro’s Safe Driver Prediction.
  • Built multiclass classification models (XGBoost and Keras) to classify genetic mutations based on clinical evidence (text) by Memorial Sloan Kettering Cancer Center with NLP.

Analytics and its Application to Healthcare

Designing Metabolic Division of Labor in Microbial Communities

Microbes face a trade-off between being metabolically independent and relying on neighboring organisms for the supply of some essential metabolites. This balance of conflicting strategies affects microbial community structure and dynamics, with important implications for microbiome research and synthetic ecology. A “gedanken” (thought) experiment to investigate this trade-off would involve monitoring the rise of mutual dependence as the number of metabolic reactions allowed in an organism is increasingly constrained. The expectation is that below a certain number of reactions, no individual organism would be able to grow in isolation and cross-feeding partnerships and division of labor would emerge. We implemented this idealized experiment using in silico genome-scale models. In particular, we used mixed-integer linear programming to identify trade-off solutions in communities of Escherichia coli strains. The strategies that we found revealed a large space of opportunities in nuanced and nonintuitive metabolic division of labor, including, for example, splitting the tricarboxylic acid (TCA) cycle into two separate halves. The systematic computation of possible solutions in division of labor for 1-, 2-, and 3-strain consortia resulted in a rich and complex landscape. This landscape displayed a nonlinear boundary, indicating that the loss of an intracellular reaction was not necessarily compensated for by a single imported metabolite. Different regions in this landscape were associated with specific solutions and patterns of exchanged metabolites. Our approach also predicts the existence of regions in this landscape where independent bacteria are viable but are outcompeted by cross-feeding pairs, providing a possible incentive for the rise of division of labor.

Predicting Chronic Disease Hospitalizations from Electronic Health Records: An Interpretable Classification Approach

Urban living in modern large cities has significant adverse effects on health, increasing the risk of several chronic diseases. We focus on the two leading clusters of chronic disease, heart disease and diabetes, and develop data-driven methods to predict hospitalizations due to these conditions. We base these predictions on the patients’ medical history, recent and more distant, as described in their Electronic Health Records (EHR). We formulate the prediction problem as a binary classification problem and consider a variety of machine learning methods, including kernelized and sparse Support Vector Machines (SVM), sparse logistic regression, and random forests. To strike a balance between accuracy and interpretability of the prediction, which is important in a medical setting, we propose two novel methods: K-LRT, a likelihood ratio test-based method, and a Joint Clustering and Classification (JCC) method which identifies hidden patient clusters and adapts classifiers to each cluster. We develop theoretical out-of-sample guarantees for the latter method. We validate our algorithms on large datasets from the Boston Medical Center, the largest safety-net hospital system in New England.