Caching in LLM-Based Applications

Caching in LLM-Based Applications

What is Caching?

Caching is a technique used to store frequently accessed data in a temporary storage area, enabling faster retrieval and reducing the need for repetitive processing.

Caching can significantly enhance the performance and cost-efficiency of your applications. In the context of LLM (Large Language Model) )-based applications, you can implement two primary types of caching: standard caching and semantic caching.

Standard Caching

Standard caching involves saving prompts and their responses in a database. This approach is straightforward: when a user asks a question like “What is the capital of India?”, the model responds with “The capital of India is Delhi.” Since this information is unlikely to change, it’s ideal to cache this response.

However, a limitation of standard caching is that it treats similar prompts as separate requests. For instance, asking “What is the capital of India?” and “Can you name the capital of India?” would both be processed individually, even though they have the same answer. This is where semantic caching becomes useful.

Below is the sample implementation of standard caching in Langchain

import langchain
from langchain_openai import ChatOpenAI
from google.colab import userdata
from langchain.cache import InMemoryCache

chat = ChatOpenAI(model="gpt-4o", api_key=userdata.get('OPENAI_API_KEY'))
langchain.llm_cache = InMemoryCache()
%%time

chat.invoke("What is capital of India?")
CPU times: user 32.8 ms, sys: 1.51 ms, total: 34.3 ms
Wall time: 931 ms
AIMessage(content='The capital of India is New Delhi.', response_metadata={'token_usage': {'completion_tokens': 8, 'prompt_tokens': 13, 'total_tokens': 21}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_3e7d703517', 'finish_reason': 'stop', 'logprobs': None}, id='run-efc09a3e-659a-4dd7-b45d-aeecd756dfe8-0', usage_metadata={'input_tokens': 13, 'output_tokens': 8, 'total_tokens': 21})

%%time
chat.invoke("What is capital of India?")
CPU times: user 2.31 ms, sys: 27 µs, total: 2.34 ms
Wall time: 2.37 ms
AIMessage(content='The capital of India is New Delhi.', response_metadata={'token_usage': {'completion_tokens': 8, 'prompt_tokens': 13, 'total_tokens': 21}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_3e7d703517', 'finish_reason': 'stop', 'logprobs': None}, id='run-efc09a3e-659a-4dd7-b45d-aeecd756dfe8-0', usage_metadata={'input_tokens': 13, 'output_tokens': 8, 'total_tokens': 21})

%%time
chat.invoke("Can you name the capital of India?")
CPU times: user 23.6 ms, sys: 993 µs, total: 24.6 ms
Wall time: 841 ms
AIMessage(content='Yes, the capital of India is New Delhi.', response_metadata={'token_usage': {'completion_tokens': 10, 'prompt_tokens': 15, 'total_tokens': 25}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_3e7d703517', 'finish_reason': 'stop', 'logprobs': None}, id='run-c800ff2e-bdad-40da-ba16-3f0c9603b894-0', usage_metadata={'input_tokens': 15, 'output_tokens': 10, 'total_tokens': 25})

In the above example, the first query is processed by the model and the response is cached. The second query retrieves the cached response, significantly reducing processing time. The third query, despite having the same answer, is processed separately due to the difference in the prompt, highlighting a limitation of standard caching. Let us see how semantic caching can help here.

Semantic Caching

Semantic caching enhances standard caching by performing a similarity search between new and cached prompts. If the similarity score is high, the cached response is returned, reducing redundant processing. This approach ensures that even if the prompts are phrased differently but convey the same meaning, the cached response can still be utilized effectively.

Below is the implementation of semantic caching using GPT Cache in Langchain

import hashlib
from langchain.globals import set_llm_cache

from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain_community.cache import GPTCache

from langchain_openai import OpenAI


def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()


def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")


set_llm_cache(GPTCache(init_gptcache))
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", n=2, best_of=2,api_key=userdata.get('OPENAI_API_KEY'))
CPU times: user 5.63 s, sys: 1.54 s, total: 7.17 s
Wall time: 35.1 s
'\nWhy did the tomato turn red?\n\nBecause it saw the salad dressing!'

%%time
llm.invoke("Tell me joke")
CPU times: user 1.43 s, sys: 0 ns, total: 1.43 s
Wall time: 3.52 s
'\nWhy did the tomato turn red?\n\nBecause it saw the salad dressing!'

In the above example, the first invocation of “Tell me a joke” processes the request and caches the response. On subsequent invocations, even if the prompt is slightly different but semantically similar, the cached response is returned, reducing processing time and enhancing efficiency.

By understanding and implementing these caching strategies, you can make your LLM-based applications more efficient, responsive, and cost-effective. Whether you choose standard caching for its simplicity or semantic caching for its advanced capabilities, both methods provide substantial benefits to the performance of your applications.

When implementing caching, consider factors such as the volatility of the information being cached, the desired trade-off between speed and accuracy, and the storage requirements for your application.

You can checkout the above code implementation here.