Most enterprises should not spend time or money finetuning a chat model. And yet, 72% of enterprises that are customizing LLMs are doing so with finetuning.1
But there are better ways to spend time and budget. The three highest ROI enterprise use cases for finetuning transformer models are:
- Text classification
- Named entity recognition (NER)
- Information retrieval (via embedding-based nearest neighbors search)
We’ll explore these use cases by precisely defining the different types of transformer architectures, mapping them to enterprise use cases, and explaining the role of finetuning in each. We’ve operationalized each of these use cases (including chat completion) at BAM and in this piece, we’ll cover two of them in detail:
- Finetuning an encoder-only LLM for NER to help a portfolio company detect mentions of specific locations in unstructured text datasets, an exciting new feature now part of their core product.
- Finetuning our own embeddings model to boost information retrieval performance in our RAG applications, leading to better relevancy and recall (and yes, better chatbots).
First, some background...
Staying current in machine learning is challenging. As technologists and investors, how can we quickly grok techniques that were announced last month? When diving into a new paper, how do we determine the level of abstraction that we can tolerate? Start by asking two important questions:
- What use case was this invented for? This is very likely to inform the technique’s inner workings.
- What use cases has this been repurposed for? This helps us solidify our understanding, start to think creatively, and explore the increments and perturbations that have been built on top of the original technique.
Let’s apply this framework to transformers. My goal here is to help time-constrained readers demystify the LLM blackbox and understand finetuning via a new mental model.
Why was the transformer architecture invented?2
The original transformer model from Attention Is All You Need is an encoder-decoder architecture applied to machine translation tasks (most prominently English-to-German and English-to-French). I repeat: encoder-decoder transformers were first invented for machine translation. So why is this important to remember?
Machine translation is clearly a sequence-to-sequence task (to generate the first word in the target language, the model needs to fully encode and “understand” the entirety of the original sentence). It’s easy to form an intuition around the encoder block being responsible for generating a latent representation of the entire original text, and the decoder block being responsible for autoregressively generating the target language representation one token at a time thereafter. This intuition can help us all remember that there are different types of transformer architectures that are better suited for different types of language-related tasks. Not all transformer-based3 LLMs are for chatting or code completion.
How has the transformer been repurposed?
- For other sequence-to-sequence tasks: e.g. summarization. Much like machine translation, summarization requires the entirety of the original text to be processed before any word of the output (the summary) can be generated. Thus, encoder-decoder transformers (like T5) are a natural fit.
- For “next token prediction” tasks: e.g. chat completion (chatbots!). This informs the invention of the decoder-only architecture, which makes sense for a host of reasons (simplicity, efficiency via state preservation, direct objective alignment, and more). GPT-4 falls into this category.
- For generating pretrained embeddings:
- These can form a basis for task-specific finetuning with as little as a single additional layer in the neural network: e.g. in text classification, question answering, entailment, NER (enter the encoder-only architectures like BERT and RoBERTa – but note that embeddings can also come “for cheap” with initialization via decoder-only model weights plus contrastive training against pair data; this is what OpenAI does).
- These also form a fundamental building block in semantic information retrieval (a popular realization of this in chat completion is retrieval-augmented generation, or RAG).
- For tasks outside of natural language: computer vision, multimodal modeling, and even time series forecasting (with limited success4).
Don’t flop your FLOPS5
So let’s talk about the finetuning buzz. MosaicML has orchestrated pretraining and/or finetuning for over 100,000 large language models to date.6 OpenAI, AWS, Anyscale, Scale AI, Together.ai, and more have entered the finetuning-chat-completion-as-a-service space. We can debate the ROI we’ll see from finetuning LLMs for chat, but we can’t doubt that enterprises are putting their money where their mouth is on the subject.
There are absolutely situations in which finetuning a chat completion model makes sense. OpenAI does a good job of laying these reasons out, the most notable of which are (1) altering the style or tone of a model, and (2) improving task-specific reliability.
Still, most enterprises should not spend time or money finetuning chat completion models, and most of them certainly shouldn’t be in the business of training language models from scratch. Why?
- Finetuning is not always effective at incorporating new information into a chat model, which is often the game that enterprises need to play in order to realize value.
- We still lack reliable rules of thumb to guide whether finetuning a chat model is likely to lift your system’s performance. You can’t really look before you leap.
- If you do leap, it’s really hard to evaluate whether your finetuned chat model is better than your alternatives.7
- Even if your model does (provably) perform well, the performance lift may not generate enough value to justify the high (and recurring) development cost.
- We haven’t even talked about data quality.
For enterprises with the desire and chops to finetune a model, there are superior ways to realize value from your generative AI initiatives:
- Finetune pretrained encoder-only models for:
- NER
- Text Classification
- Finetune embeddings for better information retrieval.
Since the architectural approach for NER and classification are quite similar, we’ve picked just one to explore further.
Finetuning for Named Entity Recognition (NER)
At BAM Elevate – the growth equity investment team at BAM – we look for ways to deploy value-additive data science and machine learning inside of our portfolio companies by leveraging the research expertise, compute infrastructure, and data platform that many startups (rightfully) haven’t developed internally. One such opportunity arose when a portfolio company identified a product enhancement that would require a high-recall model for extracting mentions of granular locations in unstructured text.
At its surface, location resolution is an NER task that seems like it should be a “solved problem.” But off-the-shelf models on Hugging Face (usually finetuned BERT models or variants thereof) suffered from three types of limitations:
- the models were too general (e.g. they were trained to extract locations, people, and organizations all at once) and thus not performant,
- the models resolved to too-large a granularity (often resolving to the Wikidata location database, which is much less rich than something like Geonames or Overture), and
- the models were built for small text snippets rather than large ones (like research reports or news articles).
Thus, we built a chain of models which would help solve our portfolio company’s problem end-to-end:
- Location detection: a finetuned RoBERTa model with a token classification head (think of this as adding one more layer to the pretrained encoder-only LLM’s neural network to adapt an embedding model into an NER model)
- Candidate generation: a retrieval model for identifying location database entry candidates from the previous step’s detected locations
- Location disambiguation: a neural network for re-ranking the candidates from the previous step
To do so, we annotated a custom dataset of representative, long-form unstructured text and precisely annotated location mentions. We then designed the model architecture, implemented the finetuning and re-ranking routines, validated the model, and built an inference pipeline. We achieved greater than a 30% lift in the system’s F1 score relative to state-of-the-art location NER models against our holdout dataset. We used PyTorch, Hugging Face Trainer, and NVIDIA GPUs to implement the training routines, OpenSearch for candidate generation, and both AWS Sagemaker and Databricks for inference. As an added bonus, we were able to 3x the inference speed in our system versus state-of-the-art benchmarks.
Finetuning for Information Retrieval
BAM has dozens of business processes that involve some form of information retrieval. Improving the quality of our information retrieval systems has helped us both (1) do new things and (2) do the old things better, faster, and cheaper.
In 2023, enterprises scrambled to hack together retrieval-augmented generation systems to incorporate proprietary or timely data into their LLM workflows. It was a way to quickly solve the frozen model problem, avoid sensitive data leakage, and ground models in truth (e.g. by incorporating citations into retrieved answers). But the journey from minimum viable hack to enterprise-ready system requires many difficult problems to be solved at the same time. We needed to:
- Identify, curate, and clean high-quality unstructured datasets
- Build systems to parse, chunk, and generate metadata against these datasets in pseudo-real time
- Implement those ever-pesky experimentation, evaluation, and observability frameworks
- Tweak all sorts of knobs and levers: prompts, retrieval hyperparameters, post-retrieval re-ranking, metadata filtering, and the aforementioned chunking strategies.
Through this process, it became clear that the highest ROI use of our research cycles would lie in finetuning the embeddings we use to retrieve the most relevant chunks of textual information for a given query. This was true for three reasons:
- Out-of-the-box embedding models, including OpenAI’s, had low performance (via recall@k) in our information retrieval use cases. The usual advanced RAG approaches helped, but we could do more.
- We needed to build golden copy query-chunk pair datasets to test our systems on realistic, domain-specific use cases anyways. These are the very same datasets that we can use to finetune embeddings via contrastive learning.
- In the “build vs. buy vs. wait” debate, the systems we could justify building were those that require domain-specific expertise to build and are orthogonal to the (significant and frequent) industry-wide advancements coming down the pike. Finetuned embeddings check both those boxes.
Like our contributions to open source, we hope sharing these learnings will provoke discussion and healthy adoption of enterprise-caliber machine learning approaches. If you’re a technology leader facing similar technical challenges, OR a founder building bold new solutions in these areas, we’d love to hear from you.
Authors
Nickhil Nabar is the Head of Data Science at BAM Elevate.
Esther Shinn is a Principal on the BAM Elevate investment team.
Special thanks to Evelyn Wang, Charlie Flanagan, Peter Anderson, Mano Janardhanan, Neil Cho, and Jason He for their contributions to this piece.
Disclosure
The information presented is furnished on a confidential basis at the specific request of recipient. You agree not to reproduce, redistribute or disclose this material, in whole or in part, to any other person or entity except to your tax advisors or as otherwise required by law. The blog post reflects information, data and opinions as of the date presented or stated herein. The opinions and commentary are subject to change without prior notice or any notice whatsoever. The blog post is not intended to be investment advice and should not be used as such. Any investment ideas and examples discussed may or may not have been profitable and are a general representation of some of the information utilized in making investment decisions. There is no guarantee that any of the Funds managed by BAM will continue to hold any of the ideas currently in the portfolio. Likewise, there is no guarantee that these ideas will be profitable. In fact, private company investments have a high degree of risk, including the risk of significant loss. BAM makes no warranty or representation that the content of the blog post are accurate, complete or without error. The links provided are for articles not prepared by BAM and we are not responsible for the content.
An investment in private funds is speculative and involves a high degree of risk. These risks, and other important risks, are described in detail in a confidential private placement memorandum (“PPM”) or the limited partnership agreement (“LPA”). Recipient understands and acknowledges that as of the date of this presentation, neither the PPM of the LPA are available. Prospective investors are strongly urged to review the PPM and LPA, and consult with their professional advisors, prior to investing in order to gain a better understanding of these risks. This is not an offer or solicitation with respect to purchase a, or any, security interest.
BAM is registered as an investment adviser with the US SEC and also registered as a commodity pool operator with the US CFTC. BAM’s registration status does not indicate BAM has received any sort of special status from either regulator. Further, neither regulator opines on the merits of BAM or the funds managed by BAM. Pursuant to an exemption from the Commodity Futures Trading Commission in connection with pools whose participants are limited to qualified eligible persons, an offering memorandum for this pool is not required to be, and has not been, filed with the Commission. The Commodity Futures Trading Commission does not pass upon the merits of participating in a pool or upon the adequacy or accuracy of an offering memorandum. Consequently, the Commodity Futures Trading Commission has not reviewed or approved this offering or any offering memorandum for this pool.
- According to a16z↩
- A better answer might be that the architecture enables parallelizable training in a way that previous SOTA architectures (RNNs) fundamentally do not. The importance of this cannot be overstated. Still, we’re choosing to focus our exposition on the differences amongst transformer architectures since it will help readers understand the different use cases for finetuning.↩
- I should acknowledge that encoder-decoder architectures aren’t transformer-specific (we’ve seen them in RNNs and CNNs). The transformer’s novelty lies in the combination of self-attention, multi-headed attention, and position encoding, not it’s composition of encoder and/or decoder blocks.↩
- Our friend Alex Izydorczyk of Cybersyn explores this topic here↩
- Floating Point Operations Per Second (i.e. a unit of compute)↩
- Naveen Rao drops this figure in a great conversation with the a16z team here↩
- A RAG system, an open-sourced model, a model parked behind an API, a function calling system…↩
Most enterprises should not spend time or money finetuning a chat model. And yet, 72% of enterprises that are customizing LLMs are doing so with finetuning.1
But there are better ways to spend time and budget. The three highest ROI enterprise use cases for finetuning transformer models are:
- Text classification
- Named entity recognition (NER)
- Information retrieval (via embedding-based nearest neighbors search)
We’ll explore these use cases by precisely defining the different types of transformer architectures, mapping them to enterprise use cases, and explaining the role of finetuning in each. We’ve operationalized each of these use cases (including chat completion) at BAM and in this piece, we’ll cover two of them in detail:
- Finetuning an encoder-only LLM for NER to help a portfolio company detect mentions of specific locations in unstructured text datasets, an exciting new feature now part of their core product.
- Finetuning our own embeddings model to boost information retrieval performance in our RAG applications, leading to better relevancy and recall (and yes, better chatbots).
First, some background...
Staying current in machine learning is challenging. As technologists and investors, how can we quickly grok techniques that were announced last month? When diving into a new paper, how do we determine the level of abstraction that we can tolerate? Start by asking two important questions:
- What use case was this invented for? This is very likely to inform the technique’s inner workings.
- What use cases has this been repurposed for? This helps us solidify our understanding, start to think creatively, and explore the increments and perturbations that have been built on top of the original technique.
Let’s apply this framework to transformers. My goal here is to help time-constrained readers demystify the LLM blackbox and understand finetuning via a new mental model.
Why was the transformer architecture invented?2
The original transformer model from Attention Is All You Need is an encoder-decoder architecture applied to machine translation tasks (most prominently English-to-German and English-to-French). I repeat: encoder-decoder transformers were first invented for machine translation. So why is this important to remember?
Machine translation is clearly a sequence-to-sequence task (to generate the first word in the target language, the model needs to fully encode and “understand” the entirety of the original sentence). It’s easy to form an intuition around the encoder block being responsible for generating a latent representation of the entire original text, and the decoder block being responsible for autoregressively generating the target language representation one token at a time thereafter. This intuition can help us all remember that there are different types of transformer architectures that are better suited for different types of language-related tasks. Not all transformer-based3 LLMs are for chatting or code completion.
How has the transformer been repurposed?
- For other sequence-to-sequence tasks: e.g. summarization. Much like machine translation, summarization requires the entirety of the original text to be processed before any word of the output (the summary) can be generated. Thus, encoder-decoder transformers (like T5) are a natural fit.
- For “next token prediction” tasks: e.g. chat completion (chatbots!). This informs the invention of the decoder-only architecture, which makes sense for a host of reasons (simplicity, efficiency via state preservation, direct objective alignment, and more). GPT-4 falls into this category.
- For generating pretrained embeddings:
- These can form a basis for task-specific finetuning with as little as a single additional layer in the neural network: e.g. in text classification, question answering, entailment, NER (enter the encoder-only architectures like BERT and RoBERTa – but note that embeddings can also come “for cheap” with initialization via decoder-only model weights plus contrastive training against pair data; this is what OpenAI does).
- These also form a fundamental building block in semantic information retrieval (a popular realization of this in chat completion is retrieval-augmented generation, or RAG).
- For tasks outside of natural language: computer vision, multimodal modeling, and even time series forecasting (with limited success4).
Don’t flop your FLOPS5
So let’s talk about the finetuning buzz. MosaicML has orchestrated pretraining and/or finetuning for over 100,000 large language models to date.6 OpenAI, AWS, Anyscale, Scale AI, Together.ai, and more have entered the finetuning-chat-completion-as-a-service space. We can debate the ROI we’ll see from finetuning LLMs for chat, but we can’t doubt that enterprises are putting their money where their mouth is on the subject.
There are absolutely situations in which finetuning a chat completion model makes sense. OpenAI does a good job of laying these reasons out, the most notable of which are (1) altering the style or tone of a model, and (2) improving task-specific reliability.
Still, most enterprises should not spend time or money finetuning chat completion models, and most of them certainly shouldn’t be in the business of training language models from scratch. Why?
- Finetuning is not always effective at incorporating new information into a chat model, which is often the game that enterprises need to play in order to realize value.
- We still lack reliable rules of thumb to guide whether finetuning a chat model is likely to lift your system’s performance. You can’t really look before you leap.
- If you do leap, it’s really hard to evaluate whether your finetuned chat model is better than your alternatives.7
- Even if your model does (provably) perform well, the performance lift may not generate enough value to justify the high (and recurring) development cost.
- We haven’t even talked about data quality.
For enterprises with the desire and chops to finetune a model, there are superior ways to realize value from your generative AI initiatives:
- Finetune pretrained encoder-only models for:
- NER
- Text Classification
- Finetune embeddings for better information retrieval.
Since the architectural approach for NER and classification are quite similar, we’ve picked just one to explore further.
Finetuning for Named Entity Recognition (NER)
At BAM Elevate – the growth equity investment team at BAM – we look for ways to deploy value-additive data science and machine learning inside of our portfolio companies by leveraging the research expertise, compute infrastructure, and data platform that many startups (rightfully) haven’t developed internally. One such opportunity arose when a portfolio company identified a product enhancement that would require a high-recall model for extracting mentions of granular locations in unstructured text.
At its surface, location resolution is an NER task that seems like it should be a “solved problem.” But off-the-shelf models on Hugging Face (usually finetuned BERT models or variants thereof) suffered from three types of limitations:
- the models were too general (e.g. they were trained to extract locations, people, and organizations all at once) and thus not performant,
- the models resolved to too-large a granularity (often resolving to the Wikidata location database, which is much less rich than something like Geonames or Overture), and
- the models were built for small text snippets rather than large ones (like research reports or news articles).
Thus, we built a chain of models which would help solve our portfolio company’s problem end-to-end:
- Location detection: a finetuned RoBERTa model with a token classification head (think of this as adding one more layer to the pretrained encoder-only LLM’s neural network to adapt an embedding model into an NER model)
- Candidate generation: a retrieval model for identifying location database entry candidates from the previous step’s detected locations
- Location disambiguation: a neural network for re-ranking the candidates from the previous step
To do so, we annotated a custom dataset of representative, long-form unstructured text and precisely annotated location mentions. We then designed the model architecture, implemented the finetuning and re-ranking routines, validated the model, and built an inference pipeline. We achieved greater than a 30% lift in the system’s F1 score relative to state-of-the-art location NER models against our holdout dataset. We used PyTorch, Hugging Face Trainer, and NVIDIA GPUs to implement the training routines, OpenSearch for candidate generation, and both AWS Sagemaker and Databricks for inference. As an added bonus, we were able to 3x the inference speed in our system versus state-of-the-art benchmarks.
Finetuning for Information Retrieval
BAM has dozens of business processes that involve some form of information retrieval. Improving the quality of our information retrieval systems has helped us both (1) do new things and (2) do the old things better, faster, and cheaper.
In 2023, enterprises scrambled to hack together retrieval-augmented generation systems to incorporate proprietary or timely data into their LLM workflows. It was a way to quickly solve the frozen model problem, avoid sensitive data leakage, and ground models in truth (e.g. by incorporating citations into retrieved answers). But the journey from minimum viable hack to enterprise-ready system requires many difficult problems to be solved at the same time. We needed to:
- Identify, curate, and clean high-quality unstructured datasets
- Build systems to parse, chunk, and generate metadata against these datasets in pseudo-real time
- Implement those ever-pesky experimentation, evaluation, and observability frameworks
- Tweak all sorts of knobs and levers: prompts, retrieval hyperparameters, post-retrieval re-ranking, metadata filtering, and the aforementioned chunking strategies.
Through this process, it became clear that the highest ROI use of our research cycles would lie in finetuning the embeddings we use to retrieve the most relevant chunks of textual information for a given query. This was true for three reasons:
- Out-of-the-box embedding models, including OpenAI’s, had low performance (via recall@k) in our information retrieval use cases. The usual advanced RAG approaches helped, but we could do more.
- We needed to build golden copy query-chunk pair datasets to test our systems on realistic, domain-specific use cases anyways. These are the very same datasets that we can use to finetune embeddings via contrastive learning.
- In the “build vs. buy vs. wait” debate, the systems we could justify building were those that require domain-specific expertise to build and are orthogonal to the (significant and frequent) industry-wide advancements coming down the pike. Finetuned embeddings check both those boxes.
Like our contributions to open source, we hope sharing these learnings will provoke discussion and healthy adoption of enterprise-caliber machine learning approaches. If you’re a technology leader facing similar technical challenges, OR a founder building bold new solutions in these areas, we’d love to hear from you.
Authors
Nickhil Nabar is the Head of Data Science at BAM Elevate.
Esther Shinn is a Principal on the BAM Elevate investment team.
Special thanks to Evelyn Wang, Charlie Flanagan, Peter Anderson, Mano Janardhanan, Neil Cho, and Jason He for their contributions to this piece.
Disclosure
The information presented is furnished on a confidential basis at the specific request of recipient. You agree not to reproduce, redistribute or disclose this material, in whole or in part, to any other person or entity except to your tax advisors or as otherwise required by law. The blog post reflects information, data and opinions as of the date presented or stated herein. The opinions and commentary are subject to change without prior notice or any notice whatsoever. The blog post is not intended to be investment advice and should not be used as such. Any investment ideas and examples discussed may or may not have been profitable and are a general representation of some of the information utilized in making investment decisions. There is no guarantee that any of the Funds managed by BAM will continue to hold any of the ideas currently in the portfolio. Likewise, there is no guarantee that these ideas will be profitable. In fact, private company investments have a high degree of risk, including the risk of significant loss. BAM makes no warranty or representation that the content of the blog post are accurate, complete or without error. The links provided are for articles not prepared by BAM and we are not responsible for the content.
An investment in private funds is speculative and involves a high degree of risk. These risks, and other important risks, are described in detail in a confidential private placement memorandum (“PPM”) or the limited partnership agreement (“LPA”). Recipient understands and acknowledges that as of the date of this presentation, neither the PPM of the LPA are available. Prospective investors are strongly urged to review the PPM and LPA, and consult with their professional advisors, prior to investing in order to gain a better understanding of these risks. This is not an offer or solicitation with respect to purchase a, or any, security interest.
BAM is registered as an investment adviser with the US SEC and also registered as a commodity pool operator with the US CFTC. BAM’s registration status does not indicate BAM has received any sort of special status from either regulator. Further, neither regulator opines on the merits of BAM or the funds managed by BAM. Pursuant to an exemption from the Commodity Futures Trading Commission in connection with pools whose participants are limited to qualified eligible persons, an offering memorandum for this pool is not required to be, and has not been, filed with the Commission. The Commodity Futures Trading Commission does not pass upon the merits of participating in a pool or upon the adequacy or accuracy of an offering memorandum. Consequently, the Commodity Futures Trading Commission has not reviewed or approved this offering or any offering memorandum for this pool.
- According to a16z↩
- A better answer might be that the architecture enables parallelizable training in a way that previous SOTA architectures (RNNs) fundamentally do not. The importance of this cannot be overstated. Still, we’re choosing to focus our exposition on the differences amongst transformer architectures since it will help readers understand the different use cases for finetuning.↩
- I should acknowledge that encoder-decoder architectures aren’t transformer-specific (we’ve seen them in RNNs and CNNs). The transformer’s novelty lies in the combination of self-attention, multi-headed attention, and position encoding, not it’s composition of encoder and/or decoder blocks.↩
- Our friend Alex Izydorczyk of Cybersyn explores this topic here↩
- Floating Point Operations Per Second (i.e. a unit of compute)↩
- Naveen Rao drops this figure in a great conversation with the a16z team here↩
- A RAG system, an open-sourced model, a model parked behind an API, a function calling system…↩