Hey! I'm Anushka, nice to meet you!

I’m a data engineer with 3+ years of experience turning messy data into meaningful insights. I’ve worked across healthcare, fintech, and telecom—building scalable pipelines, migrating legacy systems to modern cloud platforms like AWS and Fabric, and designing clean, reliable data models that power hundreds of dashboards. I love working with Python, PySpark, Kafka, and dbt, and I’m big on data quality—shoutout to Great Expectations for helping me keep things clean!

Outside of coding, I enjoy writing about data workflows on Medium and experimenting with mini projects (like my PDF summarizer with ChatGPT!). Whether it’s building real-time systems with Kinesis or optimizing ETL pipelines, I’m always up for solving hard data problems and making life easier for analysts, engineers, and stakeholders.

Personal Projects

Interactive GenAI PDF Summarization with ChatGPT [Code] May 2024

Built GenAI PDF-GPT, a web app using OpenAI’s ChatGPT API & Retrieval-Augmented Generation (RAG) for document analysis
Leveraged ChatGPT’s summarization capabilities & LangChain to extract insights from PDFs & designed a chat interface for users
Used: Python, OpenAI API, LangChain, RAG, Flask

Real-time Customer Sentiment Analysis [Code] Dec 2024

Built a real-time customer feedback analysis tool using Kafka, Spark & MLlib, with 90% accuracy in sentiment classification (RAG)
Stored processed data in Delta Lake for efficient querying & analysis, enabling real-time monitoring of sentiment trends
Used: Python, Kafka, Spark, MLlib

Predicting Chess Games [Code] Dec 2023

Designed an automated ETL pipeline using Airflow to pull 3M rows of chess game data from Chess.com APIs, processing this data on GCP buckets
Transferred stored data from buckets to MongoDB clusters and connected it to a Databricks cluster to develop classification models, leveraging SparkMLlib to classify game outcomes and surpassing traditional ELO benchmark accuracy by 2.11%
Used: Airflow, Spark, GCP, Databricks, MongoDB

Multimodal Resume Analysis [Code] Aug 2023

Developed an advanced resume analyzer capable of extracting and analyzing text, images, and audio data from resumes and multimedia presentations
Engineered a TensorFlow-based pipeline to preprocess multimodal data for real-time classification, achieving processing of upto 500 resumes/minute
Used: Python, TensorFlow, PyTorch

Blogs

Building Scalable ETL Pipelines: Lessons from Meta for Growing Businesses [Medium link]
Introduction to real-time data pipelines: Kafka & Spark, [Medium link]
Behind the Scenes at Lyft: How Airflow and Flyte Power Data Workflows, [Medium link]
Follow me for for on Medium

Publications

Applications of IOT in E-toilet Management System, International Journal of Emerging Technology and Research, ISSN:2349-5162 [Journal link]

Professional Experience

Data Engineer (Data Platform Team) @ Sanford Health July 2024 - Present

Built PySpark notebooks to unify data from 7+ legacy systems into Fabric Lakehouse for centralized reporting & no data silos
Developed 50+ standardized schemas on Fabric, ensuring data consistency & supporting scalable analytics infrastructure
Led data modelling to build semantic models linking tables with business logic, powering 300+ Power BI reports company-wide
Built automated data pipelines with scheduled triggers, increasing data refresh rate by 6x & enabling near real-time reporting
Built PySpark & Great Expectations data quality framework keeping <0.001% error rate, assuring data integrity during migration
Resolved 100+ ServiceNow tickets for custom reporting, improving productivity & reducing time-to-insight for business users
Supported migration of data from 7+ legacy systems to a Redshift data warehouse, reducing annual licensing costs while improving data access speed by 3x through centralized storage

Student Research Assistant @ University of Massachusetts, Amherst May 2023 - Aug 2023

Developed monitoring application using Python & Airflow for Agriculture School file tracking, improving overall data visibility
Automated archival of legacy files in HANA transport folder using Airflow, optimizing storage costs & improving performance

Data Engineer @ AK Fiserv Mar 2022 - Aug 2022

Set up 12+ CI/CD workflows via GitHub Actions for deployment, reducing config errors by 30% & accelerating product delivery
Built PySpark & Great Expectations framework validating 1M+ daily records with 99.9% accuracy, reducing audit risk for reporting
Configured AWS security infrastructure(IAM, S3 policies, KMS), strengthening financial data protection & reducing breach risk
Created 8+ compliance monitoring dashboards with audit trails, enabling early risk detection & faster regulatory response

Data Engineer (GET) @ Vodafone June 2021 - Feb 2022

Extended ETL/ELT pipelines (Python, dbt, Teradata) to integrate 20+ retail data sources, enabling customer LTV & churn analytics
Built 30+ reusable dbt models & CI/CD pipelines, enabling 200+ users to generate insights 3x faster through intuitive datasets
Automated 50+ schema drift & lineage checks using Great Expectations, ensuring zero production incidents in over 3+ months

Data Science Intern @ Iha Consulting Feb 2021 - May 2021

Engineered a deep learning model for brain tumor segmentation using ResNet-40, VGG-16, etc., achieving 98.3% accuracy

Machine Learning Intern @ Yibe Sep 2020 - Mar 2021

Developed an image captioning model using encoder-decoder architecture and transfer learning in TensorFlow, achieving a BLEU-1 score of 0.66 on the Flickr30K dataset by integrating InceptionV3 for precise image feature encoding

Anushka Sirpurkar

Hey! I'm Anushka, nice to meet you!

Personal Projects

Blogs

Publications

Professional Experience