In this October MLOps London meetup, there was an insightful talk presented by Philip Henry and Chris Monit that highlighted best practices for bringing automated testing and quality assurance into data science projects. Their aim is to bridge gaps between academia and industry to unlock efficiency and impact.
Traditionally, data science has not employed the rigorous testing strategies that are commonplace in software engineering. However, as data scientists tackle more complex analytical pipelines and production deployments, automated testing can improve quality and productivity. Philip and Chris shared learnings from decades of experience on bringing automated testing to data science projects.
They noted that while academia focuses on iteration and reliability, the industry prioritizes handling rapidly changing requirements. Automated regression testing provides several benefits:
- Exercising production code paths without manual inspection
- Consistently passing on each run
- Asserting expected results
This protects against regressions as implementations evolve. A recent case study showed how England’s NHS incorporated complex pipelines by building a healthcare decision support tool using logistic regression. Even with sophisticated analysis, directly inspecting outputs is common. However, data frames are just data structures that programmers have tested for years.
Philip and Chris advocated that data scientists should embrace automated testing with synthetic data to reduce bugs and enable quick fixes.
Synthetic data provides many advantages:
- Forcing deeper data understanding in modeling it realistically
- Full control over contents and edge cases
- No privacy concerns
- Ability to test locally without large compute
The main disadvantage is the upfront investment to build suitable synthetic data generators.
Effective synthetic datasets resemble production data’s structure and statistical properties while allowing injecting arbitrary quirks as needed. Data can be shared across many tests to amortize the effort. Examples included testing pipeline logic by generating biased data and asserting expected coefficients, or testing geospatial mappings by creating synthetic locations and validating correct transformations. Deterministic synthetic data simplifies troubleshooting due to consistent test passes/failures and lower cognitive load. Non-deterministic approaches better catch corner cases but require carefully tuning distributions.
There are a lot of things that data scientists can borrow from software engineering testing strategies to create reliable, well-tested analytical code and ensure data science quality. Embracing test-driven development will enable translating models and analysis into robust production systems that perform as expected. A warm thank you to Ed Shee and Seldon for organizing this inspiring meetup and appreciation to our hosts Rise (Barclays) for the great venue, food and drinks!
Atharva Ganesh Lad is a Data Scientist at Coefficient and an enthusiastic professional with expertise in data analysis, technical content writing, and social media management. He has a strong aptitude for delivering data-driven insights and dedicates his time to exploring AI, ML, and other emerging technologies. Atharva is passionate about making a positive impact in the world of startups, and is always up for a game of badminton or a refreshing swim.