Optimizing Transformer Models for Production
Deep dive into quantization, pruning, and distillation techniques for deploying large language models efficiently.
Senior Machine Learning Engineer at Nomad Health, helping bring travel clinicians closer to the work they love and that we all depend on. Previously worked at Chegg (2018-2024) on question answering systems and at WriteLab (2014-2018) on helping people write better using NLP and deep learning.
Selected work from recent years
Built a scalable PyTorch distributed training system reducing training time by 60% across multi-GPU clusters.
Developed a low-latency ML system serving 100M+ daily predictions with <50ms p99 latency.
Designed end-to-end object detection system achieving 94% mAP on custom dataset.
Fine-tuned transformer models for multi-label text classification with 91% F1 score.
Thoughts on machine learning and engineering
Deep dive into quantization, pruning, and distillation techniques for deploying large language models efficiently.
Best practices for detecting data drift, model degradation, and maintaining ML systems in production.
Bridging the gap between research prototypes and production ML systems at scale.
A comprehensive guide to self-attention, multi-head attention, and their applications in modern architectures.
Always open to interesting conversations and opportunities
Senior Machine Learning Engineer at Nomad Health. Open to consulting, speaking engagements, and interesting opportunities in ML engineering, NLP, and Generative AI.