AI Pricing: Will the popularity of RAGs change how we price AI?

As Generative AI applications enter commercial use we are seeing more and more approaches to customizing and operating applications built on Large Language Models. One of the challenges to the standard model, of which Open.ai’s GPT series is the best known example, is the time and investment it takes to train these models. One result is that the models quickly get out of date, and there is an ongoing training and updating cost. This is a severe limitation for many commercial uses.

Ibbaka recently built a generative AI for evaluating pricing changes. We had originally planned to take an open-source LLM, train it on proprietary data, and then tune the hyper-parameters and construct a series of prompts. This turned out to be a bad design for this purpose. It would take too much work and computational effort to keep it updated and there was some concern about data security. What is the solution?

Ibbaka is not the only company with this problem, and fortunately, there is a standard solution, Retreival Augmented Generation or RAG. This was introduced in a paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis et al. Nvidia has an excellent explanation.

RAGs are one of a set of generative AI design and implementation patterns that are coming into focus. Each of these patterns has implications for pricing. Debmalya Biswas has begun to document these in a series of articles that you can find in his post, Generative AI architectural patterns. The patterns he covers are

Black Box LLM APIs (Is this an anti-pattern?)
Enterprise Apps in LLM App Store (Not sure this is an architectural pattern, more of a commercial pattern.)
Fine Tuning Domain Specific SLMs (This is the pattern Ibbaka started with.)
Retrieval Augmented Generation or RAG (What this post is about.)
Multi Agent LLM Orchestration (This is a pattern I am eager to explore.)

Many business applications are going to use RAG architectures. What does this mean for pricing?

Cost and Pricing of Retrieval Augmented Generation

There are several properties of RAGs that determine their cost and will likely impact how solutions built on RAGs are priced. RAG-based solutions generally have lower training costs (of course they are generally leveraging an existent model, often open source like Meta’s Llama or Mistral 7B) but higher operating costs.

The reason they have higher operating costs is that the inputs are much larger than a typical prompt and more processing is needed to generate the output.

RAG architectures need to have a larger context than other design patterns. The limits to context enforced by Open.ai and other companies using the Black Box LLM API pattern are meant to control costs but make it hard to build a RAG on these systems.

Costs are an important consideration in pricing generative AI applications (see AI Pricing: Operating Costs will play a big role in pricing AI functionality). They need to be part of the overall pricing system and variable costs will be higher with RAGs than with, say, the Fine Tuning Domain Specific SLMs pattern. But costs should not define pricing. Pricing should be defined by differentiated value.

So what differentiated value do RAGs enable?

They are current - the request adds to and structures the data
They are relevant - they are customized to a specific situation
They are personalized - one can use one’s own situation as context

So applications based on the RAG architecture need to have value drivers that take advantage of currency and relevant and personalized use cases.

Note that the Fine Tuning Domain Specific SLMs pattern also does a good job, maybe even a better job, on relevancy. I am using this pattern in some personal work as I want to train a very specific model and am willing to invest in retraining the model over time.

When currency is a critical value driver some form of transactional pricing generally makes sense. So RAG applications are likely to have at least one pricing factor that includes the number of queries

Let’s assume that more context gives better answers. This will often but not always, be true for well-designed RAGs this will often be the case. So one way to align price with value will be by scaling the price to the size of the context.

This is starting to look like the pricing of the Black Box LLMs, which generally charge for input and output tokens. There is a difference here though. One can give more structure to the contexts used as inputs in RAGs (part of the art here is structuring these very large prompts. The type of context used can be factored into the pricing metric. Understanding how prompt structure (and size) and how it generates value will be the key to pricing RAGs. Model size and properties will be relatively less important.

RAGs can be very personal, with each person maintaining a history of their own context. Many RAG applications will also have a per-user price.

RAG pricing guidelines

Price the input based on structure and size of the context - connect the input to value and not just the output
Place less input on pricing the model itself
Look for uses that prioritize currency and relevance
Include a ‘per user’ metric when users want personalized outputs that improve over time

RAGs are just one of the generative AI design patterns. Pricing and design patterns work together and over the next few years, the emergence and adoption of these patterns will drive pricing innovation.

Cost and Pricing of Retrieval Augmented Generation

Read other posts on pricing design