Hey! I'm Anushka, nice to meet you!

I’m a data engineer with 3+ years of experience turning messy data into meaningful insights. I’ve worked across healthcare, fintech, and telecom—building scalable pipelines, migrating legacy systems to modern cloud platforms like AWS and Fabric, and designing clean, reliable data models that power hundreds of dashboards. I love working with Python, PySpark, Kafka, and dbt, and I’m big on data quality—shoutout to Great Expectations for helping me keep things clean!

Outside of coding, I enjoy writing about data workflows on Medium and experimenting with mini projects (like my PDF summarizer with ChatGPT!). Whether it’s building real-time systems with Kinesis or optimizing ETL pipelines, I’m always up for solving hard data problems and making life easier for analysts, engineers, and stakeholders.

Personal Projects

Interactive GenAI PDF Summarization with ChatGPT [Code]

  • Built GenAI PDF-GPT, a web app using OpenAI’s ChatGPT API & Retrieval-Augmented Generation (RAG) for document analysis
  • Leveraged ChatGPT’s summarization capabilities & LangChain to extract insights from PDFs & designed a chat interface for users
  • Used: Python, OpenAI API, LangChain, RAG, Flask

Real-time Customer Sentiment Analysis [Code]

  • Built a real-time customer feedback analysis tool using Kafka, Spark & MLlib, with 90% accuracy in sentiment classification (RAG)
  • Stored processed data in Delta Lake for efficient querying & analysis, enabling real-time monitoring of sentiment trends
  • Used: Python, Kafka, Spark, MLlib

Predicting Chess Games [Code]

  • Designed an automated ETL pipeline using Airflow to pull 3M rows of chess game data from Chess.com APIs, processing this data on GCP buckets
  • Transferred stored data from buckets to MongoDB clusters and connected it to a Databricks cluster to develop classification models, leveraging SparkMLlib to classify game outcomes and surpassing traditional ELO benchmark accuracy by 2.11%
  • Used: Airflow, Spark, GCP, Databricks, MongoDB

Multimodal Resume Analysis [Code]

  • Developed an advanced resume analyzer capable of extracting and analyzing text, images, and audio data from resumes and multimedia presentations
  • Engineered a TensorFlow-based pipeline to preprocess multimodal data for real-time classification, achieving processing of upto 500 resumes/minute
  • Used: Python, TensorFlow, PyTorch

Blogs

  • Building Scalable ETL Pipelines: Lessons from Meta for Growing Businesses [Medium link]

  • Introduction to real-time data pipelines: Kafka & Spark, [Medium link]

  • Behind the Scenes at Lyft: How Airflow and Flyte Power Data Workflows, [Medium link]

  • Follow me for for on Medium

Publications

  • Applications of IOT in E-toilet Management System, International Journal of Emerging Technology and Research, ISSN:2349-5162 [Journal link]

Professional Experience

Data Engineer (Data Platform Team) @ Sanford Health

  • Built PySpark notebooks to unify data from 7+ legacy systems into Fabric Lakehouse for centralized reporting & no data silos
  • Developed 50+ standardized schemas on Fabric, ensuring data consistency & supporting scalable analytics infrastructure
  • Led data modelling to build semantic models linking tables with business logic, powering 300+ Power BI reports company-wide
  • Built automated data pipelines with scheduled triggers, increasing data refresh rate by 6x & enabling near real-time reporting
  • Built PySpark & Great Expectations data quality framework keeping <0.001% error rate, assuring data integrity during migration
  • Resolved 100+ ServiceNow tickets for custom reporting, improving productivity & reducing time-to-insight for business users
  • Supported migration of data from 7+ legacy systems to a Redshift data warehouse, reducing annual licensing costs while improving data access speed by 3x through centralized storage

Student Research Assistant @ University of Massachusetts, Amherst

  • Developed monitoring application using Python & Airflow for Agriculture School file tracking, improving overall data visibility
  • Automated archival of legacy files in HANA transport folder using Airflow, optimizing storage costs & improving performance

Data Engineer @ AK Fiserv

  • Set up 12+ CI/CD workflows via GitHub Actions for deployment, reducing config errors by 30% & accelerating product delivery
  • Built PySpark & Great Expectations framework validating 1M+ daily records with 99.9% accuracy, reducing audit risk for reporting
  • Configured AWS security infrastructure(IAM, S3 policies, KMS), strengthening financial data protection & reducing breach risk
  • Created 8+ compliance monitoring dashboards with audit trails, enabling early risk detection & faster regulatory response

Data Engineer (GET) @ Vodafone

  • Extended ETL/ELT pipelines (Python, dbt, Teradata) to integrate 20+ retail data sources, enabling customer LTV & churn analytics
  • Built 30+ reusable dbt models & CI/CD pipelines, enabling 200+ users to generate insights 3x faster through intuitive datasets
  • Automated 50+ schema drift & lineage checks using Great Expectations, ensuring zero production incidents in over 3+ months

Data Science Intern @ Iha Consulting

  • Engineered a deep learning model for brain tumor segmentation using ResNet-40, VGG-16, etc., achieving 98.3% accuracy

Machine Learning Intern @ Yibe

  • Developed an image captioning model using encoder-decoder architecture and transfer learning in TensorFlow, achieving a BLEU-1 score of 0.66 on the Flickr30K dataset by integrating InceptionV3 for precise image feature encoding