Arkesh Das | Portfolio

Projects

MSU Curriculum Maps (Data Science Capstone)

Pipeline for transforming university curriculum data into prerequisite networks for structural analysis.

Python Julia CurricularAnalytics

More details

Problem

University curricula are difficult to analyze due to inconsistent data formats and complex prerequisite structures, limiting data-informed curriculum design.

Approach

Built a reproducible pipeline to clean registrar and major requirement data, convert it into structured prerequisite graphs, and compute metrics such as blocking factor and delay factor.

Results & Impact

Enabled analysis of curriculum complexity and bottlenecks, providing a foundation for data-driven curriculum design and future institutional tools.

View Repo

Medical Admissions Gap Year Analysis

Analysis of AAMC data showing how gap years impact physician workforce timing and system efficiency.

Python pandas Jupyter

More details

Problem

Gap years have become increasingly common in medical school admissions, but their broader impact on workforce timing and healthcare systems is not well quantified.

Approach

Analyzed national admissions and matriculation trends using AAMC data. Developed a “physician-years” framework to estimate how delays in training affect total workforce capacity over time.

Results & Impact

Showed that small increases in matriculation age scale into significant losses in total physician-years. Provides a systems-level perspective on how admissions incentives shape healthcare supply.

View Repo

Measles Comeback, Vaccination Gaps and Outbreak Risk

Data analysis linking vaccination gaps and exemption rates to measles outbreak risk in the United States.

Python pandas matplotlib

More details

Problem

Measles was declared eliminated in the U.S., yet outbreaks have re-emerged. Understanding how vaccination coverage and exemption rates contribute to outbreak risk is critical for public health policy.

Approach

Integrated datasets from the CDC, WHO, Census, and vaccination surveys. Computed incidence metrics, normalized case counts, and analyzed trends across time and geography. Built visualizations comparing U.S., Europe, and global patterns.

Results & Impact

Identified a strong relationship between exemption rates and outbreak probability, showing how localized drops in vaccination coverage can drive resurgence. Demonstrates how public health data can inform prevention strategies.

View Repo

NYC 311 Service Request Analysis (AWS Pipeline)

Cloud-based pipeline to model complaint resolution time and analyze service patterns across NYC agencies.

AWS (S3, Athena, SageMaker) SQL Python scikit-learn

More details

Problem

City service requests vary widely in resolution time, but predicting delays at intake and understanding service patterns is challenging without scalable infrastructure.

Approach

Built a pipeline using S3 for storage, Athena for SQL-based feature engineering, and SageMaker for modeling. Created features at complaint creation time and trained baseline regression models.

Results & Impact

Improved prediction over baseline and demonstrated how cloud-based workflows enable scalable analysis of public service data. Highlights how structured pipelines support real-world decision-making.

View Repo

Curser: Stop Embarrassing Names Before They Launch

Interactive app that detects cross-language phonetic collisions to prevent unintentionally offensive or ambiguous naming.

Python Whisper (speech-to-text) PanPhon (phonetic similarity) eSpeak (IPA conversion) ElevenLabs API Streamlit

More details

Problem

Naming products, brands, or ideas is difficult because abstract names can unintentionally resemble offensive or misleading words in other languages. This creates reputational risk, especially in global contexts, and is not easily detectable using standard tools.

Approach

Built a pipeline that converts spoken or typed input into IPA phonemes and compares them against a multilingual word database using phonetic distance metrics. Integrated Whisper for transcription, PanPhon for similarity scoring, and optional text-to-speech for validation. Deployed as an interactive Streamlit app supporting live audio and text input.

Results & Impact

Demonstrated how phonetic similarity can be used as a practical screening tool for naming decisions. The project also introduced reproducible environment setup, API integration with secure key management, and deployment of an interactive application accessible to non-technical users.

View Repo

Immune Response GWAS FDR Analysis

Statistical genetics project comparing false discovery rate correction methods in an immune response GWAS.

R GWAS Statistical testing Data visualization

More details

Problem

Genome-wide association studies test large numbers of genetic variants, creating a major multiple testing problem. Different false discovery rate methods can produce different sets of significant SNPs, which affects how researchers interpret genetic associations.

Approach

Applied multiple FDR correction methods, including Benjamini-Hochberg, Benjamini-Yekutieli, BKY, and q-value approaches, to immune response GWAS results from the Milieu Intérieur dataset. Compared how method choice changed the number and identity of significant SNP-level findings.

Results & Impact

Showed that FDR method choice can meaningfully affect which genetic associations appear significant. The project emphasizes that statistical correction is not just a technical step, but a decision that shapes biological interpretation.

View Repo

About

Clinical Research Healthcare Data Science Reproducible Workflows

I’m a senior at Michigan State University studying Data Science and Biotechnology, and I will begin medical school at the University of Toledo College of Medicine and Life Sciences in 2026 through the MedStart program.

My work focuses on using data to understand systems, especially where those systems impact people. I am interested in clinical research, EMR-based analysis, and how data can improve decision-making in healthcare settings.

Across my projects, I’ve explored public health trends, medical admissions systems, and service infrastructure, with an emphasis on building reproducible workflows and communicating results clearly.

Long term, I want to work at the intersection of medicine and data, contributing to clinical research that is both technically rigorous and directly impactful.

Writing

June 30, 2026

Love is Dead (in Medicine)

There is no such thing anymore as falling in love with neurosurgery.

Read Essay

January 2026

Rising Matriculation Age: The Cost of Delay

Research paper examining how medical school admissions incentives normalize gap years and how delayed matriculation affects physician workforce timing.

Epithelial Histology

This is another post I wrote my senior year of high school. I was taking an anatomy and physiology class at the time. Disclaimer: if you don't want to see pictures of my cut-up finger, don't click on the link below.

Words Are Like Ice Cream Flavors

This was a blog post that I wrote my senior year in high school for a creative writing class. I've been told that I have a way with words, and sometimes the choice words that I decide to use are not always the ones that I should be using. Because I know so many words that it takes so long to decide which one/ones I should actually use, sort of like how when you're standing in the Dairy store holding up the line because there's so many options. I edited this text slightly so that it doesn't sound like its written by a high schooler, but the original version is still live on the link below.

Projects

MSU Curriculum Maps (Data Science Capstone)

Problem

Approach

Results & Impact

Medical Admissions Gap Year Analysis

Problem

Approach

Results & Impact

Measles Comeback, Vaccination Gaps and Outbreak Risk

Problem

Approach

Results & Impact

NYC 311 Service Request Analysis (AWS Pipeline)

Problem

Approach

Results & Impact

Curser: Stop Embarrassing Names Before They Launch

Problem

Approach

Results & Impact

Immune Response GWAS FDR Analysis

Problem

Approach

Results & Impact

About

Writing

Love is Dead (in Medicine)

Rising Matriculation Age: The Cost of Delay

Epithelial Histology

Words Are Like Ice Cream Flavors

Contact