Demystifying large language models (LLMs)

Beatriz De miguel pérezLast updated on May 24, 2024
6 min

Ready to build better conversations?

Simple to set up. Easy to use. Powerful integrations.

Get started

It is clear that Machine Learning is here to stay. And on the back of all the hype, it seems that every company is willing to add any functionality that has something to do with Artificial Intelligence (AI). But where do we start? Is that hype justified? What is really new?

Machine Learning is nothing new; the new revolution came with the famous transformers and our beloved LLM (large language model). But what are these LLMs? Are they as powerful as they seem? Do they really solve any type of problem? And do they apply everywhere?

Let's try to answer these questions.

What is a large language model (LLM)?

Everyone has heard of ChatGPT. And yes, it’s easy to get lost in the continuous release of new versions and models. Developers no longer need to start from scratch. Now, we have easy access to pre-trained deep learning models, ready for deployment.

In the end, they are simply language models trained with billions of parameters. It's worth noting that not all language models are necessarily "large" in the traditional sense, nor do they always meet the criteria for being classified as such. When we refer to LLMs, we are talking about models with billions of parameters.

These types of Machine Learning models, including all NLP models, transform natural language into vectors. Transformers—a specific type of neural network—are incredibly deep and utilize sophisticated mathematical techniques that have shown remarkable success. In simple terms, they convert words and sentences into numerical representations, enabling machines to understand and process them.

What kind of problems can LLM solve?

 LLMs are highly useful, but they don’t solve everything. Of course, as they understand natural human language they mainly work analyzing or generating text. This is known as Natural Language Processing (NLP). Basically, they do not just recognize words but they do understand the meaning and context of those words.

We need to know what the basic NLP tasks are. Those tasks can be divided into several groups. Two big groups are as follows.

  • The traditional classification and regression approaches (finding the right class for a text, among a closed list of classes; or finding its right numerical score)

  • The new generative AI approach (that models any problem as the completion of a text: answering a question or completing a sequence)

So Open AI’s ChatGPT is a specific type of generative language model (included in the second group) that focuses on generating human-like text-based responses on the patterns it has learned from large amounts of text data.

For example, language models can be used for tasks such as sentiment analysis, content generation, identifying toxic language, or handling enterprise Personally Identifiable Information (PII).

Where can I find these models and how do I choose between them?

There are a lot of “private” or paid LLMs—like ChatGPT, Mistral, and Cohere— that generate text given a prompt. However, many companies provide open-source libraries containing pre-trained language models - Hugging Face being a key player.

What are the important things I need to consider to choose the right model?

When choosing a Machine Learning model, you need to consider the following. 

  • Accuracy: how accurate my response is?

  • Scalability and reliability: consider how your traffic will scale and any third-party quotas you need to operate within to ensure the model can adapt to your needs.

  • The speed of the development: how easy is it to use and deploy? What infra is required and how easy is it to test? 

  • Cost: big models can be very expensive (if they are so big that they require high computing or expensive GPU machines) so be sure to do the math to avoid a surprising bill. 

Where to begin: what would be my first model candidates?

Firstly, we need to know the kind of task we want to solve. Is it a classification problem? Or do I want to generate text? Let's see some examples.

For instance, let's say you want to classify sentiment in a sentence (neutral, positive or negative). The common response encountered is, "Just use GPT-4." It’s often assumed that GPT-4 is the magic solution for every problem. It's almost as if people believe that simply plugging in GPT-4 will effortlessly solve any challenge. Instead, perhaps a smaller model will work just as well while being cheaper and faster.

This is what we strive for at Aircall. Our data primarily consists of call transcripts; yet comparing models trained on generic text data, such as Amazon reviews, introduces a significant contextual difference. It's akin to comparing apples to oranges. Therefore, it's crucial to test with our own annotated benchmarks derived from call data.

There are specific models for sentiment analysis tasks; models with a few hundred million parameters, so they are smaller but accurate for that task. Our journey at Aircall involved exhaustive assessments of sentiment analysis using GPT, smaller open-source models, and AWS solutions with our proprietary data. These tests reflected the reality: trusting leaders alone isn't enough. Ensuring model accuracy was crucial for filtering out those with poor sentiment classification performance.

Let's demystify the notion that GPT-4 is the sole solution and empower ourselves with a broader toolkit for AI development. It’s important to acknowledge that the bigger model is not always the better option. What qualifies as "best" varies depending on the task at hand. In many cases, a model as powerful as GPT-4 might be excessive, the equivalent of using a sledgehammer to crack a nut.

Unlike dedicated sentiment analysis models that can be fine-tuned on specific tasks, ChatGPT lacks the capability for task-specific fine-tuning. This means you can't tailor its performance to the nuances of sentiment analysis in the same way you could with a specialized model. Also with GPT you have a prompt engineering challenge and some prompt size limitations.

In the world of Artificial Intelligence, it's not just about size. It's crucial to move away from merely considering the model size and instead prioritize the quality, relevance, and strategic assessment of alternatives. Companies need to quickly adjust to this changing perspective to stay competitive and ensure their operations are future-proof. In the evolution of AI models, their success may no longer be determined solely by their size, but rather by the adeptness of data curation and refinement methods, which are instrumental in realizing their full capabilities. What holds greater significance than mere model selection is the meticulous curation of data and the ongoing monitoring and enhancement of data quality throughout its lifecycle. Our main aim is to adjust the solutions to fit our specific context, highlighting the importance of using benchmarks—based on our own data—to ensure the best results.

Let's consider the second example: generating a summary of a provided text. Here, the initial step would be consulting the leaderboard for summarization models in Hugging Face. Subsequently, it's crucial to establish a benchmark using your own data and assess which model yields the most accurate outcomes.

To summarize, AI developers should consider the following when investigating models:

  • Identify key criteria for your specific use case. Are you looking for the best task performance, the most efficient model, or how adaptable it is to different uses?

  • Utilize filtering options in the leaderboard to pinpoint models excelling in relevant benchmarks. 

  • Weigh the trade-offs between model size and precision based on deployment context, choosing between faster, smaller models and larger, more capable ones. 

  • Prioritize models fine-tuned on datasets similar to your application's domain for optimal performance, or consider fine-tuning them yourself.

Can I use LLMs in bigger contexts?

Let's think bigger. Let's say we need to build Analytics on top of text. The short answer is, to achieve that you will need something else. Models like GPT are designed primarily to process shorter inputs for generating outputs rather than handling large-scale. As you might know, there is a limited context you can give to those models.

When it comes to constructing robust analytical text tools handling large volumes of data, the RAG (Retrieval-Augmented Generation) model appears as the next evolutionary step.

RAG's use of vector databases allows for easy integration and analysis of big datasets, going beyond the limits of LLMs restricted by context. It marks a move towards stronger deep learning tools necessary for managing broader contexts in AI applications.

However, exploring the complexities of RAG and its potential could be another full article. Stay tuned for an in-depth exploration of how this powerful tool is shaping the future of AI-driven analytics.


Published on May 23, 2024.

Ready to build better conversations?

Aircall runs on the device you're using right now.