Here at Seldon our primary focus is accelerating the adoption of machine learning in production, and so much of our public engineering work comes in the ML and data science space. However, in building our enterprise product Seldon Deploy, which takes the best of our open source offerings and builds new enterprise ready features on top, we also face lots of more traditional software engineering problems.
Ensuring our products are easy to use, secure and scale properly to the demands our customers expect brings a whole host of challenges and we’re going to discuss a recent one and how we solved it today!
What’s the problem?
Recently we built a new feature into Seldon Deploy; model metadata store, with the ability to keep track of your models and metadata; creation time, model version, key/value tags, metrics, prediction schema, and more. We wanted to make this model store searchable in the frontend – perhaps a straightforward task at first glance but a bit more complicated when you look deeper.
When working on a problem like this it’s useful to step back and think from first principles. What do we really want to achieve? What should this look like to an end user?
In the ideal world, our users would just write something like “give me all classification models trained on the iris dataset with training accuracy at least 90%”. Furthermore, we could ideally apply the same functionality to other areas of the product we want to filter like our monitoring data and other future use cases.
Implementing something like this straight away would be a tricky problem involving natural language processing, and maybe we’ll get there one day – but in the interest of not over-engineering we want to make the syntax a bit stricter. This also helps make sure the result is reproducible.
Thinking in this way has taught us a few things:
- We want a simple text query as close to natural language as possible
- We still want it to be powerful and be able to query complex data types
- We want to re-use the same query language in multiple product areas so users don’t need to learn a new syntax for each feature
- The query language needs to be able to support multiple “storage backends” like SQL, Elastic Query DSL etc
How could we solve it?
As with all engineering problems you try to solve, the first step is to see if someone else has solved it for you! Open source libraries can solve a lot of problems you come across as an engineer and there is no point reinventing the wheel.
Another idea would be to just expose a SQL query directly on the frontend which we run on our backend; but hopefully anyone familiar with Bobby Tables immediately recoiled at this!
That left us with only one option, to build something ourselves. So we set about building a custom query language with all the functionality we described above.
Rob Pike, one of the inventors of Go, did a talk about building a custom lexer a while back and this proved to be a useful foundation. On top of this foundation we also built a custom parser for our query language. Building something ourselves has the following benefits:
- We understand the code implicitly
- It’s easy to extend without getting external agreement
- It’s fun to write code!
We don’t intend to dive too deep into the implementation, but the basic whiteboard architecture of Goven is as follows:
The natural language query we suggested above: give me all classification models trained on the iris dataset with training accuracy at least 90%, in Goven becomes:
task=”classification” and tags[training_set]=”iris” and metrics[training_accuracy] >=0.9
How could you use Goven?
Imagine you’re a small technology company trying to build a *hopefully faster* version of a bug tracking system like JIRA. You have your database schema for:
Now you want to make your tickets searchable, and you want to let users make powerful queries across things like who created the ticket, when they created it, the status of the ticket and so on.
With Goven all you’d have to do is add a few lines of code to create an adaptor for your Ticket database schema, and then extend your FetchTickets API to accept a search string that your frontend can provide. It’s as simple as that!
It’s also secure since we utilise the interpolation features of Gorm to prevent any SQL injection attacks. It’s easily extensible to support more advanced queries specific to your own schema using the custom matcher functionality.
We’ve called it Goven because, just like a real life oven, it takes something raw (your database struct + a query input) and bakes something specific (the schema specific parser + SQL output). We call the adaptors “recipes” that Goven can make. Contrived, we know, but every self respecting Go library has to start with Go and have some metaphorical interpretation!
Become a contributor!
You can find the Goven library here, it’s still at an early stage and we will gladly accept contributions to improve it!
Seldon is also currently building a new Cloud team to take our products to the next level by hosting them on public cloud infrastructure – if you found this article interesting, or have experience building SaaS or working with Golang/Kubernetes then we’d love to hear from you. You can find our current job openings here.
Max is a software engineer at Seldon, primarily focused on the development of infrastructure and features for Deploy Advanced. He has a particular interest in cloud-native technology, the intersection of data science and engineering, and how best to deliver value to customers. Prior to Seldon, he worked at Improbable, a startup centred on building the Metaverse. Max has a Masters degree from the University of Cambridge and the Massachusetts Institute of Technology.