Your service standards have been positioned as "AI-driven" by integrating large-scale language models. Your website homepage proudly showcases the revolutionary impact of your AI-driven services through interactive demos and case studies. This is also the first mark your company has left in the global GenAI field.
Your small but loyal user base is enjoying an improved customer experience, and you can see potential for future growth. However, as the month enters its third week, you receive an email from OpenAI that surprises you: Just a week ago, you were talking to customers to assess product market fit (PMF). ), now thousands of users flock to your site (anything can go viral on social media these days) and crash your AI-driven service.
As a result, your once-reliable service not only frustrates existing users, but also affects new users.
A quick and obvious solution is to restore service immediately by increasing the usage limit.
However, this temporary solution brought with it a sense of unease. You can't help but feel like you're locked into a reliance on a single vendor, with limited control over your own AI and its associated costs.
"Should I do it myself?" you ask yourself.
You already know that open source large language models (LLMs) have become a reality. On platforms like Hugging Face, thousands of models are available for immediate use, which provides the possibility for natural language processing.
However, the most powerful LLMs you will encounter have billions of parameters, run into hundreds of gigabytes, and require significant effort to scale. In a real-time system that requires low latency, you can't simply plug them into your application as you can with traditional models.
While you may be confident in your team's ability to build the necessary infrastructure, the real concern is the cost implications of this transformation, including:
Cost of fine-tuningDo some calculations using Llama 2
If you consult your machine learning (ML) engineer, they will probably tell you that Lama 2 is an open source LLM that seems to be a good choice because it performs as well as you on most tasks The currently used GPT-3 is just as good.
You will also find that the model comes in three sizes - 7 billion, 1.3 billion and 700 million parameters - and you decide to use the largest 7 billion parameter model to maintain consistency with the OpenAI model you are currently using. Competitiveness.
LLaMA 2 uses bfloat16 for training, so each parameter consumes 2 bytes. This means the model size will be 140 GB.
If you think this model is a lot to adjust, don’t worry. With LoRA, you don't need to fine-tune the entire model before deployment.
In fact, you may only need to fine-tune about 0.1% of the total parameters, which is 70M, which consumes 0.14 GB in bfloat16 representation.
Impressive, right?
To accommodate memory overhead during fine-tuning (e.g. backpropagation, storing activations, storing datasets), the best memory space to maintain is trainable Approximately 5 times the parameter consumption.
Let's break it down in detail:
When using LoRA, the weights of the LLaMA 2 70B model are fixed, so this does not result in memory overhead → memory requirement = 140 GB.
However, in order to adjust the LoRA layer, we need to maintain 0.14 GB * (5 times) = 0.7 GB.
This results in a total memory requirement of approximately 141 GB during fine-tuning.
Assuming you don’t currently have training infrastructure, we assume you prefer to use AWS. According to AWS EC2 on-demand pricing, the compute cost is about $2.80 per hour, so the cost of fine-tuning is about $67 per day, which is not a huge expense because the fine-tuning does not last for many days.
Artificial intelligence is the opposite of a restaurant: the main cost is in service rather than preparation
When deploying, you need to maintain two weights in memory:
Model weights, consuming 140 GB of memory.Of course, you can cancel gradient calculations, but it is still recommended to maintain about 1.5x the memory — about 210 GB — to account for any unexpected overhead.
Again based on AWS EC2 on-demand pricing, GPU compute costs approximately $3.70 per hour, which works out to approximately $90 per day to keep the model in production memory and respond to incoming requests.
This equates to about $2,700 per month.
Another thing to consider is that unexpected failures happen all the time. If you don't have a backup mechanism, your users will stop receiving model predictions. If you want to prevent this from happening, you need to maintain another redundant model in case the first model request fails.
So this would bring your cost to $180 per day or $5,400 per month. You're almost close to the current cost of using OpenAI.
If you continue to use OpenAI, here is the number of words you can process per day to match the above fine-tuning and serving of LLaMA 2 cost.
According to OpenAI’s pricing, fine-tuning GPT 3.5 Turbo costs $0.0080 per 1,000 tokens.
Assuming most words have two tokens, to match the fine-tuning cost of the open source LLaMA 2 70B model ($67 per day), you would need to feed the OpenAI model approximately 4.15 million words.
Typically, the average word count on an A4 paper is 300, which means we can feed the model about 14,000 pages of data to match the open source fine-tuning cost, which is a huge number.
You may not have that much fine-tuning data, so the cost of fine-tuning with OpenAI is always lower.
Another point that may be obvious is that this fine-tuning cost is not related to the training time, but to the amount of data for model fine-tuning. This is not the case when fine-tuning open source models, as the cost will depend on the amount of data and the time you use AWS compute resources.
As for the cost of the service, according to OpenAI’s pricing page, a fine-tuned GPT 3.5 Turbo costs $0.003 per 1,000 tokens for input and $0.006 for output per 1,000 tokens.
We assume an average of $0.004 per 1000 tokens. To reach the cost of $180 per day, we need to process approximately 22.2 million words per day through the API.
This equates to over 74,000 pages of data, with 300 words per page.
However, the benefit is that you don’t need to ensure the model is running 24/7 as OpenAI offers pay-per-use pricing.
If your model is never used, you pay nothing.
At first, moving to self-hosted AI may seem like a tempting endeavor. But beware of the hidden costs and headaches that come with it.
Aside from the occasional sleepless night where you wonder why your AI-driven service is down, almost all of the difficulties of managing LLMs in production systems disappear if you use a third-party provider.
Especially when your service doesn't primarily rely on "AI", but other things that rely on AI.
For large enterprises, the annual cost of ownership of $65,000 may be a drop in the bucket, but for most enterprises, it is a number that cannot be ignored.
Additionally, we should not forget about other additional expenses such as talent and maintenance, which can easily increase the total cost to over $200,000 to $250,000 per year.
Of course, having a model from the beginning has its benefits, such as maintaining control over your data and usage.
But to make self-hosting feasible, you will need user request volume well beyond the norm of about 22.2 million words per day, and you will need the resources to manage both the talent and logistics.
For most use cases, it may not be financially worthwhile to have a model instead of using an API.
The above is the detailed content of OpenAI or DIY? Uncovering the true cost of self-hosting large language models. For more information, please follow other related articles on the PHP Chinese website!