The LLM Module - Live Walkthrough and QnA

On-Demand Webinar

The LLM Module – Live Walkthrough and QnA

First Recorded: February 13, 2024

Speaker

Josh Goldstein

Solutions Engineer

Transcript

Hi, everyone. Good morning. My name is Josh Goldstein. I am a senior solutions architect at Seldon with the, as the and as Graham was mentioning, the been part of the go to market team for the building and launching our LLM Module.

So today, what I’m gonna do is I’m gonna go over a little bit about what Seldon is, the challenges of deploying LLMs, and then we’re gonna go through a demo of showing what a local LLM being deployed and what an API based LLM would be deployed.

So who’s Seldon? Who we are? We’re building the global infrastructure for ML operations, helping improve businesses, enterprises, accelerate the adoption of ML by getting their models into production faster in a reduced risk way with the abilities to manage and monitor the performance.

Where Seldon fits in to the MLOps lifecycle, Seldon, we are we are focused and best of breed on that serving aspect. So once that model is done, once that model is built, you’re good to go. That’s where Seldon would take over. You would put your model into production. We are Kubernetes based. And then you would have a suite of tools that will allow you to manage not only infrastructure and performance metrics, but also data science quality metrics, drift, outlier, explainability.

Obviously, we’re all still waiting for the LLM paper on explainability. So, all, all of that jazz, as well.

Seldon’s global impact. We’ve been around for about eight to ten years thus far. We’ve had we have a lot of models in production over the years, eight and a half thousand stars on GitHub across our suite of product products. And through working with our customers, we’ve been able to determine that it’s been as up to as much as eighty five percent, efficiency gains in productivity and getting those models to production faster. We’ve had some customers that have gone from months to weeks and then even more further going down from weeks to days, and then some of them even got down to minutes sometimes.

So what do people actually struggle with with operationalizing models?

Well, first, there’s the DevOps and engineer who has to manage all of the infrastructure, is not very in tune with machine learning models, the constructs of the building of those, so it causes bit of causes a bit of more, learning curve for those DevOps engineers.

When it gets to that ML engineer when it gets the actual ML engineer, you know, they have the trouble of getting all the monitoring in place, trying to understand, look at the performance, and see really what’s going on and get that greater insight into the model. And then from that data science perspective with our monitoring capabilities, being able to expose and enhance and, squash that lack of that lack of mistrust by including trust for the data scientists with visualizations and the ability to understand why their models are doing what they’re doing in production.

Challenges then I’m gonna go into a upload a couple of challenges on LLMs. I’ll go into a couple of specifics about the Seldon technology, then we’ll go into the LLM module, and then we’ll go into the demo.

So LLMs, the use cases, very standard. Pretty much every enterprise I have spoken to in the past year and a half, conversational support, starting with those rag apps, those customer support, those Copilot tools, even those things like the Shopify tool that helps you build your website and tells you exactly specifically what to do, the training bots. As you can see, we have a couple. We were trying out a company called Capa AI at one point that was looking at our search for our documentation.

The next set of use cases would be content creation. You know, a lot in financial services, and highly regulated industries, there’s a lot of stress testing being done. So being able to generate synthetic data is a huge advantage because the the synthetic data is of better quality, and there have been struggles in the past to make sure that you’re building good quality data.

Then the other fun ones, image generation, video generation, music generation, as well to help, you know, in various different industries as well.

Then finally, we get to the the real fun ones, text to speech, speech to text, enabling that Siri, that that car when you’re talking to the car, doing the summarization, and then some of even going into the health care space where they can even start these LLMs are off are actually helping with drug discovery and understanding different patient records and being able to help get that drug to to the market faster.

So three overarching challenges when deploying and serving LLMs, is deployment and serving, building the application, and having the ability to con to construct the components for that application and then the ongoing management of that application.

So now we’re gonna take a step back, and we’re gonna see how we deploy models with Seldon because the cool thing about Seldon is it is the same the way you deploy traditional models is the same way you deploy LLM models as well. Keeping that standardization, which is one of the key value points of Seldon around being able to not interrupt the workflow of a developer, an ML engineer, a data scientist, but because you want to add an LLM. It’s, oh, we’ve been using Seldon. We see this. Here’s the LLM module. This is how we deploy the models. Oh, that looks exact that looks the same to deploying those traditional models we deployed six months back.

So Seldon Core is an orchestration framework for deploying and managing models on top of Kubernetes.

We have, two different versions of Seldon Core, one being Seldon Core V1, which is the first the first version that we created, and then Seldon Core V2. The reason we built Seldon Core V2 is as that data centric AI notion and jargon came right before chat GPT, being able to have a use case is no longer just a single model.

It’s a it’s a sequential set of models. I’ve talked to a couple of insurance companies that have these assessment models that take maybe six or seven steps before they can actually spit out the answer to the business question.

So we’ve we what we’ve been able to what we did was we split a lot of the the pieces out to give you that flexibility to build things like inference graph pipe complex inference graphs, pipelines with data controls, being able to configure, various different components that should you say you need a different version of scikit learn or XGBoost, your TensorFlow, being able to being able to revolutionize and have that in an easy to consume manner, as well as experiments, adding on to the AB testing, the mirror testing, and being able to do that for not just one a a singular model, but other models as well, but multiple models as long as it adds up to a hundred.

So oops. Sorry. We we support under the covers, ML Server, which is our own inference server, as well as NVIDIA Triton, to produce these basically configurations, not code. So all you would do for these types of prepackaged runtimes you see on the screen, MLflow, Light g x g boost, TensorFlow, this is all configuration as opposed to having to write any code to deploy these models once those models are sitting in artifact storage up in a a blob storage area.

The Seldon stack. So we do have also an enterprise product known as Seldon Enterprise Platfom that is also Kubernetes based. And what you can see here is that architecture of the overall ecosystem, so some of our open source dependencies, but as well as you can see in coming into the enterprise red square, it’s red rectangle itself, you can see that it starts at that core v two level where you have that pipeline capability as well as another infrastructure savings, feature that called multimodel serving. But then you can see you can intertwine various the various inference servers having ML Server be a preprocessor. You could have Triton be a preprocessor and then ML Server running the model, and being able to do those configurations.

When it comes to LLMs, this is the very same paradigm of that preprocessor chaining and then having that configuration.

Then within the Enterprise Platfom, you also have the role based access control, the audit logs, the lineage. You have the u visualizations and alerting capabilities on performance infrastructure, performance metrics, as well as those data science monitoring of drift outliers and.

A quick little workflow of how you would deploy your models. You come in, you have your experimentation, Then you would have your CICD pipelines doing stress testing. Maybe, that is how the going through the securities, the approval button. And then once that model is approved, then it’s time to be used. It’s fully SDK and API enabled to put it into the Kubernetes cluster where the model will be running in Seldon Enterprise Platfom or Seldon Core. With Enterprise Platfom, you do get these GitOps and inference data and metadata capabilities that give you a three sixty view of everything that’s going on down to the request level, as well as a model catalog that allows you to build that contextualized model three sixty, not only on the training side, but also at production side as well.

Models are defined in a simple as I was describing in a simple, configuration file. As you can see, I have my storage URI, my requirements. This would be an x two boost model, and my name. So just having this simplistic way to deploy these models. And the the paradigm here is you deploy models first, and then you deploy your inference graph. And what that enables is the ability to reuse models. So, like, for instance, I have an OpenAI run I have an OpenAI model running, but I’m using it across about three different use cases without having to deploy that model three times.

Pipelines is that notion of those inference graphs. They’re naturally asynchronous.

They have lineage and audit. You have multiple data control flows around them to build various use cases should you need to go down conditional routing, should you need to do running two models in parallel before you hit that.

What you’re seeing here is just a very simple showing the parallelization of ability to run various models is an IoT predictive maintenance model, but it also has its explainer and its outlier, and it’s giving me my output. So I can get that full picture of everything from that inference graph in my response when my when the consumer of that model gets that response back. All the way on the right, you do have that drift detector.

That does run. As you can see, it runs in a batch of five. So every five requests, you will have and if the drift is occurring, you will get alerted on it.

Experiments enable you to run AB tests, run mirrors, run shadows as well. And as you can see, it’s it’s not just a fifty fifty split. So if you’re in this candidates, you could do iris one, two, three, four, should you so please as long as it adds up to a hundred or there could be some issues.

So one of the other key features in Seldon is multimodal serving. This is super important because what it does is it consolidates the resources that you’re using in your Kubernetes clusters.

Now with using LLMs and GPUs, you are starting to come into the challenges of having node availability.

And sometimes, you know, if you’re not using your use case, you could potentially lose that that GPU node or depending on the region of in the cloud that you’re using, those they would run out of GPUs.

So what we do is we have a concept of multimodal serving, which we which what we use is we consolidate these resources. So, basically, if you see here, each of these models’ servers is considered a model. So each of these servers each of these models is a pod or a managed or a managed black box serving mechanism for these models.

Sometimes these models are not big enough in to to fill the entire infrastructure that’s been dedicated to it, and you’re leaving a whole bunch of infrastructure on the table. With multimodel serving, you consolidate those models. They’re all running on the same on the same pod, the same infrastructure, and then you only have that sliver of of need of unused resources, saving you a whole bunch of costs because, like I said, if they’re GPUs, they’re I know at minimum about eight dollars an hour, and they only go up from there.

Alright. Cool. So the LLM module, what Seldon offers.

So going back to those challenges of deploying and serving, building applications, and ongoing management, what we’ve noticed is there are, as you could see, the model landscape has been growing for about a year and a half now between model size, performance, adding complicated workflows into the mix. The differences, the variations, there’s API based models. There’s hosted on prem. Have you’re even a lot of people are fine tuning models, taking those Hugging Face models and bringing them down. Also with those some of the large model with the large language model deployment strategies, you have the different types of distributed GPU processing in order to you know, sometimes you do not get the luxury of having a very massive GPU, but a set of smaller GPUs. So enabling customers to run that same LLM on that distributed manner as well.

And that is why we these we decided to build the LLM module with an OpenAI with the OpenAI feature, the ability to run models with deep DeepSpeed, VLLM, and the transformers library. The reason we did this is because instead of going specifically for a model, it’s using the common frameworks that are in the market today, like VLLM, that enable things like that distributed GPU. And what we’ve done is taken the abstraction layer to make it more simplistic to configure in order for those for those models.

Building applications. So with Seldon, because of some of the out of the box features already, we have that complex pipelines or inference graphs as we were looking at above. You can see you have the flexibility to start building those various Chaney components, q and a, RAG apps. And then we also introduced, a memory component, so an out of the box memory component that you don’t have to set up, you configure, and you’re good to go, to enable that real chat experience so then you can inject that chat history back into that next request for your chatbot.

With retrieval augment augmented generation, we have the ability in Seldon with ML Server to write custom components. And what that does is it, right now, makes us, vector database agnostic. So you have the ability to connect to any vector database that you so please, using these custom components, out of the box. And they’re very simplistic. It’s it’s a very fun fun piece of the technology, as I like to say.

So just a bit of an overview of just some of the of the components. We have the serving aspect over here where we support DeepSpeed VLLM and transformers and the OpenAI runtime, which also supports the Azure OpenAI service as well because most enterprises are not going directly to OpenAI.

And then on the application building side, we have the ability to create pipelines. You can set, prompt, basic prompt templating.

You have that memory component out of the box. And then with Seldon’s flex the given the flexibility of Seldon, you also have the ability to write those custom components to enable anything like calling out to a vector database, doing some data aggregation, maybe even calling out to a third party. If it’s in an ecommerce section, you got a ecommerce use case, you got you’ll call out to the KYC system to bring on more information about the customer before sending it through the LLM. So having that flexibility within the app within Seldon really enables you to do that. And it also scales each piece also scales individually, which enables you to have a greater aspect of that ongoing management.

So having the ongoing management with the Enterprise Platfom, you get full model log you get full logging, versioning. You have your role based access control as well, alerting. And what all this culminates into is staying in compliance. You know, with the LLMs coming out with the EU AI Act, the US executive order coming into play, and then with the EU a with the EU AI Act clearly defining the various use cases and level of scrutiny and regulation that they’re going to have, having ongoing management is super critical for the future of running LLMs or, as we’ve all been learning over the past two or three weeks, SLMs, small language models.

So what’s inside?

Easy deployments.

Access to the Azure OpenAI and OpenAI, API endpoints and having that deep speed VLLM and transformers to run any of those open source models, pretty much anything from Hugging Face. Now we do get a lot of questions as to why we’re tapping into the OpenAI service itself, and that’s because it goes back to the standardization aspects of the product and having the ability to build these inference graphs. You don’t wanna go manage one little piece of your use case somewhere else.

If you can just do it right within that pipeline, and let’s say, in the future, when you’re building this use case, you decide to move away from OpenAI and use a fine tuned or open source model, because those models are have already been deployed and configured and settled, it’s more swapping a model out as opposed to having to redo your whole inference pipe, inference graph, and rewriting the sum of the code.

You also get those that memory component, which enables chat chat capabilities, so being able to combine the answers, put memory, keep track of memory with session IDs as well.

It really enables a lot of that those features, and then we’ll see what the future brings with some of even as chatbots and rags expand.

Also, it’s because of that because of our complex pipelines in core v two that you have the ability to enable all of these things.

And finally, one AI ecosystem.

So with the enterprise product having a unified model management, role based access control, visualizations for monitoring performance, as well as alerting and logging of all the requests. You get that ecosystem of management to help as these regulations are coming down. So you’re already future proofed and prepared for everything that that’s gonna come at you.

Finally, some of the benefits.

It is performant.

It is easy and flexible with those three back ends. It is tech stack agnostic. Seldon is a cloud agnostic platform that can be run on Kubernetes clusters across OpenShift, GCP, AWS, as well as Azure.

You have the custom components as well. And because we’re you’re just injecting the LLM module into the entire Seldon Seldon ecosystem, you’re already set up for success with the debugging, the logging, all of those monitoring, disaster recovery reliability, and the beautiful word of scaling.

So once again so now we will come in, and I just realized there’s one sign. Okay.

There we go.

So one last example down below is this use case before we go into the demo.

And in this use case, what we are seeing is it’s not a chatbot. Give it a second to load. Apologies. What we’re not seeing, but another aspect of a use case that you can do by combining those traditional models as well as those LLMs with Seldon. So going back to that IoT initial pipeline, taking it a complete step further because we wanna understand in English or, as I like to say, in people who, for people who are not as in tuned with understanding the statistics and outputs of drift detectors and outlier detectors and explainers, what you can do is build pipelines like this to take all of that information from the traditional models and then throw it into an LLM to give give that business summary back for maybe a business executive or a product manager that is trying to understand why that IoT use case is not doing what it’s supposed to do.

So with that being said, let’s come over into a bit of a demo on how we would do this. Let me pull this down quickly.

So currently, right now, I have Seldon running, on a Kubernetes cluster in GCP, and I have the I have an OpenAI runtime running. I have a local model running, and then I have an entire RAG running.

So some of the key pieces of these are in the settings files. So we have in a directory structure, and one of the things about these model settings files is the fact that you’re not going to be pulling from Hugging Face directly, because this is how you configure down to that intricate level how many distributed GPUs you wanna run across, which run time do you wanna use. And so you would create a file like this that says local runtime, the URI, which would be the location of the weight of that model. And in the cases when you’re doing this with on prem models with this with the storage being in the blob storage sitting right next, this model settings would sit in the same directory as the the cloned or fine tuned LLM that you’ve created.

You can set your versions. I can set my back end.

Oh, one sec.

There we go.

Cool.

I will take a step thank you, Pavel. I will take a quick step back.

So just to refresh, this is what a configuration and how you would set up a local model using cell a local or fine tuned open source model that’s sitting behind your firewalls than deploying it with Seldon.

So this is a what we have is the model settings file, which is basically where you can tweak and fine tune the parameters as well as give it all the information it needs to do things like distribute across multiple GPUs, use CPU, use depending on if you wanna use the transformers back end or the deep speed back end. I know I’ve been reading a bunch of stuff that depending on the back end, there are some quality there are some quality differences that have been starting to be noticed.

So one of the the things you would do is is this model settings file will sit right next to the weights or of that fine tuned model because then you would set this URI, and you would set it basically to just dot slash. And this will know that that that LLM is sitting in the same directory as this file.

I would then set my back end so I can use, like we were saying, VLLM, deep speed, or the transformers library.

I can then set my devices. So if I wanted to use GPUs, I set this to CUDA.

And then I can set a TP parallelization to, say, I wanna use three distributed GPUs.

Finally coming down, a lot of these LLMs, especially the ones on Hugging Face, have various tweaks of their prompts and the input formats for their prompts. So we have the ability to bring in those prompt those prompt templates, which are which are in the code bases of these LLMs. So what it also enables is some fun stuff as well. So I can come in.

I have the message content, some of the out of the box standard things they need, but then I’m also being able to give system messages, user messages. Yes. This local LLM may have, used a swear word the other day, so I, as you can see, I injected the, I injected a a little prompt to say, please don’t do that. So, yeah, having these full capabilities.

And this is not only just for the locals, but you can also do this for the open for the OpenAIs as OpenAI runtimes as well. So if we came down into that OpenAI runtime, you can see that I am using very similar type file. I’m just setting my provider ID to OpenAI. I’m setting that prompt template to true, and then I can access the OpenAI different model endpoints.

So if I wanted to change this to four o, or I wanted to use instruct as opposed to completions, I can.

This also supports Dolly. I have used Dolly before.

But then, also, please, if you note the the the one of the the Nuance differences is that setting up of that where that template is. So you would set that in the URI field this time. And then very similarly, these are, you would have that that prompt template, and then setting the context, the user role, the chat history, or array questions and answers.

So once that all set once that’s all set up, we would come in and we would register the models. So very similar to what we saw earlier, we have in this is a rack. So I have a few pieces of this puzzle. I have my chat memory. I have my database lookup. I have my memory, stages within my pipeline, and then I actually have the call to the LLM itself.

So I have all of these various components deployed deployed in, visualized in one file. So having leveraging that once again, out of the box, just setting up that memory just with a simple saying of the requirements of memory.

Then with the embeds, being able to use embedding models as well. So with the reg, you have to first embed embed the query, to search against the database.

And then with those custom components like we were discussing, you have the ability to go out and make a call to any type of third party database. What this what these custom components are is a pipe is a Python class, and that Python class has two methods. And then you’re off to the races with being able to access using any Python libraries that you would that you would like because you have with the flexibility of Seldon the ability to deploy to compile those dependencies into a container that can be used in a very reusable fashion.

And then finally, I have my l my LLMs being deployed. This is the location of where that local model would be sitting next to its model settings file, and then I would apply these. So then that means I can apply these first.

Now in here, we do have two different LLMs. So this is my GPT LLM using the OpenAI runtime, and this is my local LLM, which is GPT Neo two fifty.

As this is a demo box, I was not I I do not for capacity reasons, I found a smaller a smaller model. But please take note that if we look at some of the componentry like brag combine answer, you can use these things in a reusable manner. So let’s take a look at a pipeline configuration quickly, and you can see it is called the rag, seldon, core v t pipeline. I have that rag combined question name, which was which was, as you can see in the models, registered only once.

And then I have all of the different inputs and and specificities you would do to con to configure the ability to say, map things to like we saw in those prompt templates. So I could put any variables I really want in there, and as you can see within the pipeline features, I can map them.

So if we look at the rag combined question, embed VDB, and and what these names are is the corresponding to those models you had previously registered because this enables that reusability.

So this one is using the local the local GPT.

So as you can see, I have my josh test local, which was that local model. But then if we take a look over at the OpenAI pipeline, you might notice that it’s about the same.

All the models still using the embed, the VDB, the combined questions.

However, the only key difference here is I’m using this model instead, which then enables the swap between so as I was saying, should you start with a with a use case with an OpenAI API based model, but then choose down the road, you wanna swap that out for a local LLM.

You can see how just by applying this manifest again and registering that new model, you’ll be able to comp with zero downtime, roll out that new use case with that updated model.

Alright.

And finally, let’s talk to chat let’s talk to some models.

So as you can see, I have a little chat interface that, has been written. And, let’s say hello.

Can you tell me about seven four e two?

And what we’ll see give the this is a local the local model, so we will see some interesting things, but it it will return. So you could see that it has responded.

So let’s just quickly exit, and then we’ll open the OpenAI model.

One last second.

And then I can once again say hello. Tell me about Seldon four two.

And the model will then once again return, Seldon Core two runs on Kafka. Can you give this to me in bullet points?

So now as you can see, it’s leveraging that memory, and it was able to remember to put in those that we were talking about selling core v two. This also is is going out and is we have a vector database in the background running, and it’s pulling back content from that and injecting it into the prompts for that RAG, RAG experience.

Alright. And that about wraps up the tour. Would love to open it up to questions now. Please feel free. Happy to answer anything that you’d like.

Hey, Josh. Thanks very much for, for going over that. Appreciate it. What we’ll do is we’ll, we’ll go and see if there’s any, questions out there. There was a couple delivered to me, actually around, deploying that LMM there in that.

If you were to is there is there a sort of limit on the size of the LMM that you’re that you’re putting out there? Obviously, that was a very basic model, but there was an idea around the size and limitations of size that were going through the the module there.

Yeah. For sure. So limitation of size when it comes to deploying models with Seldon is kind of an interesting concept because it really does tie depend on how much horse power you’re able to, requisition from the infrastructure teams. So there’s no real size limit. It’s more, do you have the CPUs or GPU capacity that will be able to hold that model in memory and run it? And that’s why there’s also all of these concepts of distributed GPU processing for LLMs because sometimes you’re not just gonna get a big massive GPU. You might get a couple of smaller GPUs that and that’s the only thing that’s available to you, but you still you still have the capabilities with that distributed manner to run it, but then also you would have that same capability if you just did have that big massive GPU box as well.

Cool. Thank you. There was another question here around, infrastructure, actually. Is this just for the cloud, or is this on prem only?

So with Seldon, we are cloud we are cloud agnostic, Kubernetes agnostic. So all the three major cloud, Kubernetes services as well as we have a lot of our customers are actually on prem on prem data centers with using Red Hat OpenShift.

Cool. Thank you. I’ve actually just put out a poll there. So for those who are, who are with us, maybe, I’ve also launched a poll so you can, let us know as well. When when do you think you will be deploying your first LLM?

Is that over a year? Is it the next twelve months? Is it six months, or have you already done it? Let’s, let’s get an idea of, of of who’s out there and and where you got where you’ve come from and what you’ve already done.

Okay. Cool. I did have, there’s no another question.

Just going through. This is from, Pava, actually. Most configuration was done using, you know, Kubernetes, YAML files. Is it possible to achieve the same using Python?

Yes. So with the Enterprise Platfom, the the, one of the other features that comes is a full API and SDK.

So you have basically anything that you would see in the Enterprise Platfom, you can integrate. So these, the configuration files would look more JSON files and dictionary files as opposed to YAML manifests, and then you would use the SDK library to run those. That’s the common pattern we see with, customers and how they integrate into their CICD pipeline is, and that’s why they find the value of the Enterprise Platfom is because of that SDK capability.

Cool.

Alright. Thanks. So, yeah, with the that that poll, that I just put out there is actually it’s it’s very interesting. So currently, no one online has actually deployed, not a great surprise, I’d say, given where we are at the moment.

But there’s but but a vast majority actually looking to it in the next six months and then the next twelve months as well. So it doesn’t look like it’s that far out either. What’s Josh, tell us what you’re kind of hearing in in the market at the moment, regarding people. Oh, there there we go.

Someone late to the game thought I’m not having this. We have already, launched an LLM and deployed one, so I’m gonna put it in there. Yeah. Well done to you.

Very good. Very good.

So, yeah, in terms of that, what do you see as the the sort of, like, the the market? How how’s how are things going, at the moment? What are you sort of hearing from from some of the customers we’re speaking to?

Yeah. For sure. Thanks, Graham. And so, yeah, basically, as you know, it’s been about a year and a half as many of that. Yes. It’s been about a year and a half.

In the beginning, last year was definitely a lot of r and d, a lot of just trying to wrap your head around it.

I spoke to a bunch of financial services where we were able to successfully do a couple of POCs, a chatbot.

Actually, we had one that was converting old mainframe code into Python code to do a little bit of modernization.

As the year went on, there were you started to see, especially with most of the larger larger enterprises, they were able to successfully get, more of a generalized chatbot into play, enabling employees to use things like chat GPT to help with their work.

In the past six months, this is definitely the paradigm is shifting. People have been, you know, playing for about a year, testing, understanding, and we’re starting to see some of the first real live use cases go into production. Lot of them being summarization and rack chat bots, because that seems to be where everyone is starting. And then as they’re doing that, they’re expanding out into get it will get into things like generation of tech not generation text. That’s what no one’s do.

More media generation and more I was talking to, another financial services. You know, agents is gonna be a big thing. You know, people there’s these students, you know, not students, MBA grads that are getting paid hundreds of thousands of dollars a year at these financial services firms, and they’re spending three hours in the afternoon writing reports that with an with an LLM type application could do all of that task automatically so that the MBA students can then go focus on money making activities.

Cool. Thank you. We’ve got a a few more questions coming in. So, there’s a couple on the q and a, which I’ll I’ll launch in a moment. But, just a sort of quick fire, I know this is probably a multimodal serving piece here, but, what’s the carbon footprint of the LLM you’ve just deployed? A huge, topic, at the moment, the environmental, factors of LLMs.

Very, very small. It was a two hundred and fifty mill million parameter, model given, I’m using my personal demo cluster, and we did not so, Yeah. The capacity is and, you know, we don’t trying to get GPUs on GCP sometimes is a bit of a a bit of a hassle.

But yeah. So this the carbon footprint of the one I’m running right now is very, very small.

You know, but that is a big thing in the industry now is you’re starting to look at things like how much water are you consuming, how much electricity are you consuming, battery companies coming out to help power these data centers, reduce that and reduce that carbon footprint. So, like, with from a from a multi from a Seldon perspective, you know, we do have that concept of multi model serving, which does consolidate the resources, which then gives less nodes, which then means less infrastructure in the data center, which would then lead to a lower carbon footprint.

Cool.

Now I’m gonna go into one here. There’s a it’s it’s kind of been asked in a couple of places.

So well, maybe.

Maybe. Let’s just see. But what is the advantage of server and over deployment directly in AWS or Azure? And then I think to follow-up on that, because as a Google Gemini anthropic call, is is a chance to include those in there. So you may wanna kind of maybe try and wrap that up into one one base if you can.

Yeah. So I’ve been doing a lot of research on, you know, the the the big cloud providers, how they’re doing their LLM, ML ops, LLM ops type scenarios.

What we are finding is, you know, it is it is that standard when you’re deploying on on the cloud providers. It is some it is black box managed services. You have to pick from a finite list of of nodes and, yes. And they you can they do have the aspects to it, installing Kubernetes. However, there’s a lot of lift needed to complete and connect them together.

And then in terms of the anthropics and the Geminis of the world, as Graham had mentioned, we we have, just recently launched this a few weeks ago, so we’re ever developing the pro the road map and taking into account things like all of the popular ones in like, Anthropic and Gemini as well.

Cool. Thank you. I’ll add this one to the stage as well. So, just to clarify, is the LLM module available on the basic core product? Maybe a quick, explanation on the difference between what we have as core, the way that licensed is licensed, and then what Core+ is and and how that kinda pieces together. But we’ll go we can go through that.

You want me to go through that?

Yeah. Yeah. Sorry.

Alright. So yeah. So there’s there’s a few tiers of core. As as as, if you’ve not known, we had a we, made a licensing change a couple of months back, and we offer a few tiers of our products. So that basic core product is more just a release to allow you to use Seldon in production.

You get community support, as one of the offerings. The LLM offering is not available for that basic core pack package.

When we introduce what’s known as Core+ or Core+ plus, that’s where you start that’s where you would get into you would get an account manager, access to the support portals, access to what we have known as IQ tokens, as well as enable you to have that ability to purchase the LLM module as an add on as part of the package.

And then in enterprise, that’s the same thing. Enterprise, you get the full UI interface, APIs, SDKs, and also the option to buy the LLM module as well.

Nice. Thanks.

We could probably spend a lot of time on that one, but, it’s, a a decent quick summary. Thanks, Josh. And then, I think just, as we got a couple more minutes left, there’s a a nice question here I think would be would be worthwhile, going through as well. So Jacob has asked, if for a beginner, what tools exist to help me create my own LMM? I mean, I guess it kinda depends a little bit on what you’re trying to create and what that’s kind of for. If you pop that into the chat for a bit more context, Josh can start answering some things. But then, yeah, Jake Jake, if you can add, a little bit more context on what it is that you’re trying to do, what the use case is, perhaps that’s gonna help you with the answer.

Yeah.

So I know, you know, as as Seldon, we’ve we do not focus on the creating of the LLMs. However, I have looked into it a bunch. And, yeah, like Graham was saying, it’s it’s a lot about the it’s a lot about the use case. It’s a lot about there’s a lot of research that goes into, you know, if you’re gonna if you’re going to build a I I’ve only heard about a few people. I had a friend at one of the major newspapers in the country, and they said they built their own LLM. And as I was discussing it with him more and more, most of what you’re seeing is you’re not building LLMs from scratch. You’re fine tuning a model or so for instance, using, like, a mixture and then fine tuning it and then quantizing it, which what that does is the quantization brings it smaller, so it reduces that carbon footprint as well.

So there’s a few key steps when you’re building out an LLM or an LLM use case. Yes. If if there are ways to build your exact own LLM, and then there are also ways as, you know, the the parameter size is very, very large. And then when there’s also then you go into that concept of quantization where you will quantize the model to bring it down in size. You do get a little bit loss of quality, but then what that does is it helps from the infrastructure perspective of being able to run that model in a more efficient manner with that distributed GPU and, you know, reducing that carbon footprint.

Did that help, Jacob?

I’ve it was a pretty good answer to me. I don’t know whether Jacob’s still there or not, because we’ve been, our our q and a has been a little a little while. So, oh, there we go. He’s a math grad and has been using LMS and objective detection models over the last few days, and we’re working on training an AI model with a, for a data project I’ll be part of.

So, yes, he has said that that was helpful. Cool. Thank you, Josh. Appreciate you jumping on, to do our webinar today and to take us through Oh, well.

Deployment.

I really enjoyed it, and and it looks like a few of the others did. We had a lot of great questions there and and audience interaction and participation as well. So, we’ll have more on that in the future. A lot of our webinars are moving this way.

I hope you’ve enjoyed using the using the new sort of platform that we’re using here as well. It’s a a little bit more attractive, and we wanna kinda bring in that life into some of our webinars as well. So, stay tuned for the next few. We’ve actually got one coming up in early June as well.

And so Ramon, who’s our, developer advocate, and you may have seen on some of our communities and and our Slack channels, certainly. He’s gonna be running an L and M kind of workshop, as well. So we’ve got more L and M content coming up, so look out for that as well. You’ll get links to that fairly soon.

Like I said, I think it’s on June the third. So it hasn’t been announced yet.

Quick private announcement to everyone here, I guess. But, yeah, thanks very much to everyone for joining. I hope you enjoyed it.

Thank you, Josh, again, for for for coming along and and presenting and doing the demo today. Session is recorded. You’ll get a link to that, fairly quickly after after the webinar.

Take care. Have a good evening or a good afternoon or enjoy the rest of the day wherever you are in the world.

On-Demand Webinar