“Bigger is always better” — this principle is deeply rooted in the AI world. Every month, larger models are created, with more and more parameters. Companies are even building $10 billion AI data centers for them. But is it the only direction to go?
At NeurIPS 2024, Ilya Sutskever, one of OpenAI’s co-founders, shared an idea: “Pre-training as we know it will unquestionably end”. It seems the era of scaling is coming to a close, which means it’s time to focus on improving current approaches and algorithms.
One of the most promising areas is the use of small language models (SLMs) with up to 10B parameters. This approach is really starting to take off in the industry. For example, Clem Delangue, CEO of Hugging Face, predicts that up to 99% of use cases could be addressed using SLMs. A similar trend is evident in the latest requests for startups by YC:
Giant generic models with a lot of parameters are very impressive. But they are also very costly and often come with latency and privacy challenges.
In my last article “You don’t need hosted LLMs, do you?”, I wondered if you need self-hosted models. Now I take it a step further and ask the question: do you need LLMs at all?
In this article, I’ll discuss why small models may be the solution your business needs. We’ll talk about how they can reduce costs, improve accuracy, and maintain control of your data. And of course, we’ll have an honest discussion about their limitations.
The economics of LLMs is probably one of the most painful topics for businesses. However, the issue is much broader: it includes the need for expensive hardware, infrastructure costs, energy costs and environmental consequences.
Yes, large language models are impressive in their capabilities, but they are also very expensive to maintain. You may have already noticed how subscription prices for LLMs-based applications have risen? For example, OpenAI’s recent announcement of a $200/month Pro plan is a signal that costs are rising. And it’s likely that competitors will also move up to these price levels.
The Moxie robot story is a good example of this statement. Embodied created a great companion robot for kids for $800 that used the OpenAI API. Despite the success of the product (kids were sending 500–1000 messages a day!), the company is shutting down due to the high operational costs of the API. Now thousands of robots will become useless and kids will lose their friend.
One approach is to fine-tune a specialized Small Language Model for your specific domain. Of course, it will not solve “all the problems of the world”, but it will perfectly cope with the task it is assigned to. For example, analyzing client documentation or generating specific reports. At the same time, SLMs will be more economical to maintain, consume fewer resources, require less data, and can run on much more modest hardware (up to a smartphone).
And finally, let’s not forget about the environment. In the article Carbon Emissions and Large Neural Network Training, I found some interesting statistic that amazed me: training GPT-3 with 175 billion parameters consumed as much electricity as the average American home consumes in 120 years. It also produced 502 tons of CO₂, which is comparable to the annual operation of more than a hundred gasoline cars. And that’s not counting inferential costs. By comparison, deploying a smaller model like the 7B would require 5% of the consumption of a larger model. And what about the latest o3 release?
💡Hint: don’t chase the hype. Before tackling the task, calculate the costs of using APIs or your own servers. Think about scaling of such a system and how justified the use of LLMs is.
Now that we’ve covered the economics, let’s talk about quality. Naturally, very few people would want to compromise on solution accuracy just to save costs. But even here, SLMs have something to offer.
Many studies show that for highly specialized tasks, small models can not only compete with large LLMs, but often outperform them. Let’s look at a few illustrative examples:
- Medicine: The Diabetica-7B model (based on the Qwen2–7B) achieved 87.2% accuracy on diabetes-related tests, while GPT-4 showed 79.17% and Claude-3.5–80.13%. Despite this, Diabetica-7B is dozens of times smaller than GPT-4 and can run locally on a consumer GPU.
- Legal Sector: An SLM with just 0.2B parameters achieves 77.2% accuracy in contract analysis (GPT-4 — about 82.4%). Moreover, for tasks like identifying “unfair” terms in user agreements, the SLM even outperforms GPT-3.5 and GPT-4 on the F1 metric.
- Mathematical Tasks: Research by Google DeepMind shows that training a small model, Gemma2–9B, on data generated by another small model yields better results than training on data from the larger Gemma2–27B. Smaller models tend to focus better on specifics without the tendency to “trying to shine with all the knowledge”, which is often a trait of larger models.
- Content Moderation: LLaMA 3.1 8B outperformed GPT-3.5 in accuracy (by 11.5%) and recall (by 25.7%) when moderating content across 15 popular subreddits. This was achieved even with 4-bit quantization, which further reduces the model’s size.
I’ll go a step further and share that even classic NLP approaches often work surprisingly well. Let me share a personal case: I’m working on a product for psychological support where we process over a thousand messages from users every day. They can write in a chat and get a response. Each message is first classified into one of four categories:
Source link
#Small #Language #Models #instad #ChatGPTсlass #models