MLOps London October: Testing and Quality in Data Science Projects

Ed Shee presenting at the MLOps London event

In this October MLOps London meetup, there was an insightful talk presented by Philip Henry and Chris Monit that highlighted best practices for bringing automated testing and quality assurance into data science projects. Their aim is to bridge gaps between academia and industry to unlock efficiency and impact.

Philip Henry and Chris Monit presenting at the MLOps London event

Automated Testing

Traditionally, data science has not employed the rigorous testing strategies that are commonplace in software engineering. However, as data scientists tackle more complex analytical pipelines and production deployments, automated testing can improve quality and productivity. Philip and Chris shared learnings from decades of experience on bringing automated testing to data science projects.

They noted that while academia focuses on iteration and reliability, the industry prioritizes handling rapidly changing requirements. Automated regression testing provides several benefits:

  • Exercising production code paths without manual inspection
  • Consistently passing on each run
  • Asserting expected results

This protects against regressions as implementations evolve. A recent case study showed how England’s NHS incorporated complex pipelines by building a healthcare decision support tool using logistic regression. Even with sophisticated analysis, directly inspecting outputs is common. However, data frames are just data structures that programmers have tested for years.

Synythetic Data

Philip and Chris advocated that data scientists should embrace automated testing with synthetic data to reduce bugs and enable quick fixes. 

Synthetic data provides many advantages: 

  • Forcing deeper data understanding in modeling it realistically
  • Full control over contents and edge cases
  • No privacy concerns
  • Ability to test locally without large compute 

The main disadvantage is the upfront investment to build suitable synthetic data generators.

Effective synthetic datasets resemble production data’s structure and statistical properties while allowing injecting arbitrary quirks as needed. Data can be shared across many tests to amortize the effort. Examples included testing pipeline logic by generating biased data and asserting expected coefficients, or testing geospatial mappings by creating synthetic locations and validating correct transformations. Deterministic synthetic data simplifies troubleshooting due to consistent test passes/failures and lower cognitive load. Non-deterministic approaches better catch corner cases but require carefully tuning distributions. 

There are a lot of things that data scientists can borrow from software engineering testing strategies to create reliable, well-tested analytical code and ensure data science quality. Embracing test-driven development will enable translating models and analysis into robust production systems that perform as expected. A warm thank you to Ed Shee and Seldon for organizing this inspiring meetup and appreciation to our hosts Rise (Barclays) for the great venue, food and drinks!

Contents