Four Ways Enterprises Deploy LLMs

With the rapid pace of LLM innovation, enterprises are actively exploring use cases and deploying their first generative AI applications into production. As deployment began in earnest, enterprises have incorporated four types of LLM deployment methods, contingent on their talent, tools, and capital investment.

These deployment approaches will keep evolving as new optimizations and tooling launch regularly. The goal here is to walk through these approaches and examine the decisions behind design choices.

The Four Approaches

There are four different approaches enterprises take to jumpstart their LLM journey. These range from easy and cheap to difficult and expensive. Organizations should assess their AI maturity, model selection (open versus closed), available data, use cases, and investment resources when choosing the approach that works for their strategy.

1. Prompt Engineering with Context

Many enterprises begin their LLM journey here since it is the most cost-effective and time-efficient approach. This involves directly calling third-party AI providers like OpenAI, Anthropic, or Cohere with a prompt.

Given that these are generalized LLMs, they may not respond to questions unless framed in specific ways. Building effective prompts, called "prompt engineering," involves creative writing skills and multiple iterations to get the best response.

The prompt can include examples to guide the LLM. These examples, included before the prompt itself, are called "context." "One-shot" and "few-shot" prompting refers to introducing one or several examples in the context.

Since it is as easy as calling an API, this is the most common approach for enterprises to start their LLM journey. It may well be sufficient for many organizations lacking in AI expertise and resources. This approach works well for generalized natural language use cases but can get expensive with heavy traffic to third-party providers.

Best for: Organizations starting their LLM journey, generalized use cases, quick experimentation.

Trade-offs: Costs scale with usage, limited customization, dependent on provider capabilities.

2. Retrieval Augmented Generation (RAG)

Foundation models are trained with general domain corpora, making them less effective at generating domain-specific responses. Enterprises often want to deploy LLMs on their own data to unlock domain-specific use cases: customer chatbots on documentation, internal chatbots on IT instructions, or responses using non-public information.

However, there may be insufficient data (hundreds or a few thousands of examples) to justify fine-tuning a model, let alone training a new one.

RAG augments prompts by using external data in the form of documents or document chunks, passed as context so the LLM can respond with that information. Before data gets passed as context, it needs to be retrieved from an internal store. Both the prompt and documents are converted into embeddings, and similarity scores determine what to retrieve. Vector databases and LLM metadata tooling support this approach.

In addition to saving time on fine-tuning, this technique reduces hallucinations since data is passed in the prompt rather than relying on the LLM's internal knowledge. However, knowledge retrieval is not bulletproof. Correctness relies heavily on information quality and retrieval techniques used.

Another consideration: sending proprietary data in calls increases privacy risks since foundation models can memorize data passed through. It also increases the token window, raising cost and latency.

Best for: Domain-specific use cases with moderate data volumes, applications requiring current information, privacy-sensitive deployments where data stays internal.

Trade-offs: Retrieval quality affects output quality, increased latency, privacy considerations with external providers.

3. Fine-Tuned Model

While prompt engineering and RAG work for some use cases, their shortcomings become apparent as data volume and use case criticality increase. Fine-tuning an LLM offers better ROI when you have larger datasets.

When you fine-tune, the LLM absorbs your dataset knowledge into the model itself, updating its weights. Once fine-tuned, you no longer need to send examples or additional information in the prompt context. This approach:

Lowers per-inference costs
Reduces privacy risks
Avoids token size constraints
Provides better latency
Delivers higher response quality with better generalization

Fine-tuning provides good value with larger instruction sets (typically tens of thousands), but it can be resource-intensive and time-consuming. You also need to compile a dataset in the right format. Cloud services are making it easier to fine-tune LLMs without managing infrastructure.

Best for: Organizations with substantial domain-specific data, high-volume production use cases, applications where response quality is critical.

Trade-offs: Requires significant data preparation, ongoing maintenance as requirements change, initial investment in training.

4. Trained Model

If you have a domain-specific use case and a large amount of domain-centric data, training an LLM from scratch can provide the highest quality results. This approach is by far the most difficult and expensive to adopt.

The complexity is substantial. Financial models, for example, have been trained on forty years of financial language data for a total dataset of 700 billion tokens.

Enterprises need to be aware of costs related to training from scratch. Large amounts of compute add up quickly. Depending on training requirements, costs can range from hundreds of thousands to several million dollars. However, training costs are coming down rapidly with more efficient architectures and optimization techniques.

Best for: Organizations with unique domain requirements, massive proprietary datasets, and resources to invest in differentiated capabilities.

Trade-offs: Highest cost, longest time to deployment, requires specialized expertise, ongoing maintenance burden.

Choosing Your Approach

The right approach depends on several factors:

AI Maturity: Organizations new to AI should start with prompt engineering to learn before investing heavily.

Data Availability: RAG works with hundreds of documents. Fine-tuning needs tens of thousands of examples. Training from scratch requires billions of tokens.

Use Case Criticality: Higher-stakes applications justify greater investment in quality.

Resource Constraints: Time, budget, and expertise all factor into what is realistic.

Privacy Requirements: Some approaches expose more data to external providers than others.

The Common Thread

Regardless of which approach you choose, certain requirements remain constant:

Monitoring: All LLM deployments need observability to detect drift, hallucinations, and safety issues.

Testing: Pre-deployment evaluation should cover robustness, bias, and adversarial scenarios.

Governance: Clear policies for data handling, access control, and incident response.

Supervision: Human oversight mechanisms for high-stakes decisions.

As LLMOps infrastructure evolves with more advanced tools and methods, enterprises will adopt deployment options that yield higher quality at more economical cost with faster time to market. The fundamentals of testing, monitoring, and governance will remain the foundation that makes all of it work.

Four Ways Enterprises Deploy LLMs

The Four Approaches

1. Prompt Engineering with Context

2. Retrieval Augmented Generation (RAG)

3. Fine-Tuned Model

4. Trained Model

Choosing Your Approach

The Common Thread

Related Posts

Harnessing Generative AI for Healthcare Innovation

Enterprise Generative AI: Promises vs. Compromises

Join our newsletter for AI Insights