Beyond Prompting: Unlocking the Power of Caching in LLM Workflows
Designing a semantic caching layer for optimized model responses
In our previous posts—here and here, we explored how LLMs and Retrieval-Augmented Generation (RAG) pipelines help transform large volumes of raw data into actionable insights. These approaches allow teams to extract knowledge from unstructured sources, automate triage workflows, and predict operational reliability with remarkable efficiency.
However as we scale these LLM-driven systems, new challenges emerge around latency, consistency and more importantly cost.
Latency : LLMs can introduce noticeable delays, affecting user experience or system responsiveness.
Consistency: Even small rewordings of prompts can lead to slightly different outputs, which is undesirable in domains like knowledge bases, support, or compliance. Repeated inputs can also at times lead to varying outputs, and can be hard to control , even with tuning of seed, temperature and other such parameters.
Cost: While throwing compute and network at the problem can seemingly help with latency, costs can add up . In/out tokens in high-volume scenarios can also contribute significantly to cost spikes.
Caching as a technology has been around for quite some time, and in the context of LLMs can be an essential tool to address the challenges above. By intelligently storing and retrieving responses, we can:
Reduce redundant LLM calls, translating to real cost savings.
Serve results faster, improving latency and user experience.
Maintain consistency, ensuring that repeated or semantically similar queries return predictable results.
LLM caching is not a unique concept and has infact been used by LLM providers for some time now. For example, OpenAI offers prompt caching at the service level, which can speed up repeated requests. Yet, such provider-side caching is typically limited to exact or near-exact matches and may not provide fine-grained control over freshness, user-level personalization, or semantic similarity. A custom semantic caching layer addresses these gaps, giving users precise control over TTL, refresh policies, stale handling, and multi-tenant considerations - practices that have been a staple of the CDN world for many years now.
With this foundation, we will explore semantic caching for LLMs, its architecture, features, and best practices—bringing efficiency and reliability to the next generation of LLM-powered applications.
1. Traditional Caching vs LLM Caching
CDNs and web caches typically rely on exact-match keys- the same URL or API request returns a cached response. In addition, there is a well established protocol of what can be safely cached by middleware proxies and client devices as dictated by the content owners in response to the requests originating from clients. By virtue of this, correctness and validity of the cached assets is guaranteed to a great extent. While effective for static web based assets, this approach struggles with LLMs:
User prompts may often vary slightly in wording but result in the same response.
There is no clear indication on what content should/can be cached safely.
LLM caching must therefore account for semantic similarity of the prompts. Instead of looking for strict matches , the cache matches should done on semantically similar queries allowing for reuse of results for prompts that fall within a semantic threshold. Also we will need innovations around maintaining freshness and validity of the responses, while optionally allowing stale-but-useful content, when configured to do so.
2. Redis and RediSearch for Semantic Caching
Redis, combined with RediSearch, provides a natural solution to some of requirements that we need in a semantic cache. While Redis is well known for fast in-memory lookups, RediSearch extends some of the capabilities of Redis by adding advanced search and indexing capabilities. There are a few other attributes that make it desirable for our usage:
Vector indexing for semantic similarity: RediSearch supports HNSW (Hierarchical Navigable Small World) indices and various distance metrics (COSINE, L2 - Euclidean distance, IP - Inner Product), making it possible to find cached responses that are semantically close to a given prompt. Redis also supports the FLAT index for smaller datasets and when accuracy is preferred over search latency. See links in references section for additional details.
Flexible storage and TTLs: Redis stores embeddings, metadata, and responses efficiently. TTLs and eviction policies allow the cache to remain healthy and memory-efficient. Applications based TTLs (as we will see below) can enhance what is offered out-of-the-box by Redis,
Filtering and query richness: Beyond similarity, queries can filter by user group, creation time, or custom attributes, supporting multi-tenant caching and fine-grained freshness control, though for the purpose of this article, we will try to implement some of these at the application level as well.
The schema that we will use for building the rest of caching library is shown below:
schema = ( f"FT.CREATE {self.index_name} ON HASH PREFIX 1 doc: " f"SCHEMA prompt TEXT response TEXT created_at NUMERIC ttl NUMERIC " f"embedding VECTOR HNSW 6 TYPE FLOAT32 DIM {self.vector_dim} DISTANCE_METRIC {self.distance_metric}" )
Briefly the string above creates a RediSearch full-text + vector search index:
FT.CREATE {self.index_name}
Creates a new index with the name given byself.index_name
.ON HASH PREFIX 1 doc:
Tells Redis to build the index on HASH keys that start with the prefixdoc:
.
(E.g.,doc:123
would be indexed, butuser:123
would not.)SCHEMA
defines the fields:prompt TEXT
→ full-text search enabled on theprompt
field.response TEXT
→ full-text search enabled on theresponse
field.created_at NUMERIC
→ numeric field, useful for filtering/sorting by time.ttl NUMERIC
→ numeric field to track expiration logic at the application level.embedding VECTOR HNSW 6 TYPE FLOAT32 DIM {self.vector_dim} DISTANCE_METRIC {self.distance_metric}
Defines a vector field called
embedding
.VECTOR HNSW 6 → use HNSW graph indexing with
M=6
(connectivity parameter; controls graph density).TYPE FLOAT32 → each dimension in the vector is stored as a 32-bit float.
DIM {self.vector_dim} → the dimension size of embeddings (e.g., 1024 for the mxbai-embed-large model that we will be using for embeddings).
DISTANCE_METRIC {self.distance_metric} → similarity metric, typically
COSINE
,L2
, orIP
.
So this schema enables semantic + text + numeric filtering in a single index. Shortly, we will see how a vector search can be executed on the index above.
3. Useful features of the Caching layer
While Redis is used as the caching workhorse for our application, there are a few essential features that can (should ?) be built around this to not only add value, but also ensure correctness. In our reference implementation, we attempt to do this by building a layer around Redis (that we will rather unimaginatively call - the Caching layer) that applications can interact with. Also, while Redis is used as the only backing store for this implementation, it is very much possible to make Redis as one of many plugins that the Caching layer can interact with in the backend, while providing a plugin agnostic interface to the application.
(a) Multi tiered cache lookup :
While our goal is to build a semantic cache, it would be useful to also take into account exact queries, from same or different users as the first level of cache to do a really fast lookup.
Prompt Hash Lookup (Exact Match)
First, the system attempts a fast in-memory hash-based lookup using a stable document ID derived from the prompt (and optionally the user/group as a part of the key.def _hash_lookup(self, doc_id: str) -> Optional[Dict[str, Any]]: doc = self.r.hgetall(doc_id) if doc: return { "response": doc[b'response'].decode(), "created_at": float(doc[b'created_at']), "ttl": float(doc[b'ttl']), "doc_id": doc_id } return None
Semantic Lookup (Approximate Match)
If the hash lookup fails or the prompt is slightly different, the cache performs a semantic search over vector embeddings stored in RediSearch. Here’s how it works:- Convert the prompt into an embedding vector using your
embed_fn
.- Query the Redis vector index (
HNSW
) to find the topk
closest embeddings.- Compute a similarity score and return the response if it exceeds a threshold.
def _semantic_lookup(self, prompt: str, k: int = 1, threshold: float = 0.85) -> Optional[Dict[str, Any]]: embedding = self.embed_fn(prompt).tobytes() q = f"*=>[KNN {k} @embedding $vec AS score]" query = Query(q).dialect(2) params = {"vec": embedding} results = self.r.ft(self.index_name).search(query, query_params=params) if results.docs: top_doc = results.docs[0] sim_score = 1 - float(top_doc.score) if sim_score >= threshold: return { "response": top_doc.response, "created_at": float(top_doc.created_at), "ttl": float(top_doc.ttl) if hasattr(top_doc, "ttl") else self.ttl_seconds, "doc_id": top_doc.id } return None
This two-level approach ensures that exact matches are served immediately, while slightly different prompts still benefit from cached knowledge through semantic similarity. It significantly reduces redundant calls to the LLM and provides fuzzy caching for natural language inputs.
(b) Soft Purging and TTL-Based Refresh
After retrieving an entry from the two-level cache, the application determines if it is stale based on the TTL provided at storage time:
The TTL is application-defined, meaning for each request we can:
Enable or disable TTL.
Set a custom TTL depending on the expected freshness requirements.
The TTL forms the basis of cache validity, allowing SWR or refresh logic to decide if a response can be served or needs updating.
age = time.time() - result["created_at"]
ttl = ttl_override if ttl_override is not None else result["ttl"]
if age < ttl:
# cache is fresh
return {"response": result["response"], "status": CacheStatus.HIT}
Redis also provides TTL enforcement at the key level. If desired, the cache entry can be automatically evicted (deleted) after a certain duration using
expire()
or theex
parameter inhset()
; while this can be set to optimize for storage, the application layer TTL helps with some of the other features below.
self.r.expire(doc_id, ttl_to_store) # key will be deleted automatically
(c) Stale-While-Revalidate (SWR)
Stale-While-Revalidate (SWR) is a caching strategy that allows an application to serve expired (stale) data immediately while refreshing the cache asynchronously in the background. This is particularly useful for LLM applications, where calls are expensive and potentially slow.
How It Works in Our Semantic Cache
When a cached entry is found, the system checks its age against its TTL.
If the cache is still fresh, it is returned immediately.
If the cache has expired and the SWR mode is enabled:
return the stale response immediately to the user.
Start an asynchronous refresh to fetch a fresh response from the LLM and update the cache in the background.
In our implementation this is done via the following snippet:
if age > ttl and allow_stale:
# Serve stale response immediately
if doc_id not in self._inflight:
self._inflight[doc_id] = threading.Event()
# Refresh cache asynchronously
self._executor.submit(self._refresh_async, prompt, ttl_override, doc_id)
return {"response": result["response"], "status": CacheStatus.STALE}
There are obviously some caveats and considerations that need to be taken into account for serving stale data:
With stale data, users may receive slightly outdated responses. While acceptable for many use cases, it may not be suitable for time-sensitive queries, hence a application level flag is provided by the caching library to enable it on a per request basis.
TTL Tuning- The effectiveness of SWR depends on the TTL configuration. Too short a TTL may trigger frequent background refreshes, increasing load; too long may serve stale responses for extended periods, so the TTL needs to be set appropriately based on the applications resilience to stale responses. It is of course possible to punch-a-hole through the cache and retrieve data from the LLM if the staleness has exceeded a certain threshold
(d) Herd Locking (Preventing the thundering cache herd problem)
Herd locking is a technique used to prevent a cache stampede, which occurs when multiple requests simultaneously miss the cache and all trigger expensive backend calls (in this case, an LLM request). Without herd locking, high-frequency prompts could cause redundant LLM calls, wasting tokens, increasing latency, and spiking costs. Here are the gory implementation details:
When a cache miss or expired entry occurs, the system checks if a fetch is already in progress for that prompt (tracked in
_inflight
dictionary).If no fetch is in flight, the requesting thread initiates the LLM call and marks it as “inflight” by creating a threading.Event.
If a fetch/refresh is already inflight, subsequent threads wait for the ongoing LLM call to finish instead of calling the LLM themselves.
Once the first LLM call completes, all waiting threads use the newly cached response.
The following snippet shows how this is achieved
def _fetch_with_herd_control(self, prompt: str, threshold: float, ttl_override: Optional[int], doc_id: str) -> str:
with self._lock:
if doc_id not in self._inflight:
# First thread triggers the LLM call
self._inflight[doc_id] = threading.Event()
first = True
else:
# Other threads will wait
first = False
if first:
response = self._llm_call(prompt)
self._store(prompt, response, ttl_override, doc_id)
# Notify waiting threads
self._inflight[doc_id].set()
with self._lock:
del self._inflight[doc_id]
return response
else:
# Wait for the first thread to complete, with timeout
event = self._inflight[doc_id]
if event.wait(timeout=self.herd_wait_seconds):
refreshed = self._hash_lookup(doc_id)
if refreshed:
return refreshed["response"]
# Timeout fallback: call LLM directly
response = self._llm_call(prompt)
self._store(prompt, response, ttl_override)
return response
As always, there are some caveats and considerations to keep in mind:
Timeout Handling: Threads waiting for an inflight request may hit a timeout; in such cases, they fall back to calling the LLM directly.
Thread Safety: Proper locking is essential to avoid race conditions when multiple threads update
_inflight
.Multiprocessing: Current implementation works with threads; extending it to multiple processes requires shared memory or Redis-based coordination.
(e) User/group specific caching:
In multi-tenant applications or even in services where different users or groups may receive personalized responses, it’s important to segregate cache entries by user or group. This ensures that sensitive or user-specific data is not accidentally shared across tenants or even users.
In our implementation
Each cache entry’s document ID can include an optional
user_group
prefix.The SHA1 hash of the prompt combined with the user/group ensures unique cache keys per tenant, while still supporting semantic similarity within that group.
def _doc_id(self, prompt: str, user_group: Optional[str] = None) -> str:
base = prompt if user_group is None else f"{user_group}:{prompt}"
return "doc:" + hashlib.sha1(base.encode("utf-8")).hexdigest()
During lookup, the cache checks for entries within the user’s namespace first.
Semantic similarity searches can also be filtered by user/group metadata if needed.
There are some scenarios where a common caching strategy can be sufficient and hence also be optimal and simple - For example, if the application generates identical responses for all users (e.g., FAQs, general knowledge prompts), a shared cache without user-specific keys reduces memory usage and improves cache hit rates.
4. Hooking it all up together
For our application, we will reuse on one of the RAG usecases that we had addressed in an earlier sandbox - this is a RAG application that
a. transcribes YouTube videos,
b. chunks up the transcripts and creates embeddings
c. stores the embeddings in a Weaviate Database and then
d. Uses a LLM to work on the context/prompt and generate responses
While the application itself is not extremely interesting for the purpose of the article, the outputs most certainly are . We will work with a few variations of the prompts as shown below, to see how our caching layer performs:
We instantiate the Semantic cache using:
cache = SemanticCache(embed_fn=ollama_embed, llm_fn=generate_content, ttl_seconds=30, herd_wait_seconds=5)
where the embed_fn can be replaced by any function of the application’s choice and the llm_fn can be any LLM of choice as well. As an example, we use the following for the embedding function and the LLM:
# Ollama embeddings -for storage in cache
def ollama_embed(text: str) -> np.ndarray:
emb = ollama.embeddings(model="mxbai-embed-large", prompt=text)["embedding"]
return np.array(emb, dtype=np.float32)
def generate_content(prompt_template):
output = ollama.generate(
model = "llama3.2",
prompt = prompt_template,
)
return output['response']
and then trigger the prompts using:
prompt = "what are the advantages of SDWAN over MPLS"
context_data , metadata = retrieve_doc(collection, prompt)
logger.debug(f'Retrieved data {context_data}')
if context_data == None:
logger.warning('Could not get enough context from the video')
else:
#Generate content using retrieved data:
#prompt = "Give a executive summary of the advantages of SDWAN over MPLS"
#prompt = "Summarize the advantages of SDWAN over MPLS"
#prompt = "What are the advantages of SDWAN over MPLS"
#prompt ="Why should I use SDWAN over MPLS"
#prompt = "Summarize SDWAN in 3 sentences"
prompt = "What are the disadvantages of MPLS compared to SDWAN"
prompt_template ="Generate a response to the following prompt using the context data: " + context_data + \
f" Respond to this prompt: \"{prompt}\""
t = time.time()
response = cache.get_response(prompt_template, threshold=0.8, allow_stale=True, ttl_override=60)
delta = time.time() - t
logger.info(f'Got response in {delta} seconds')
5. Example in Action
The first prompt we try is:
prompt = "Give a executive summary of the advantages of SDWAN over MPLS"
This is a cache MISS and gets the response in 16.61 seconds
INFO:semcache:No results found for prompt with threshold 0.8
INFO:semcache:Cache MISS for prompt
INFO:ollama_weaviate_grounded:Got response in 16.61280393600464 seconds
INFO:ollama_weaviate_grounded:Response: {'response': "Here is an executive summary of the advantages of SD-WAN over MPLS:\n\n**Executive Summary**\n\nSD-WAN (Software-Defined Wide Area Networking) offers several advantages over traditional MPLS (Multiprotocol Label Switching) networks. By leveraging public internet connections and software-defined networking, SD-WAN provides a more agile, flexible, and cost-effective solution for wide area networking.\n\n**Key Advantages of SD-WAN over MPLS:**\n\n1. **Cost Savings**: SD-WAN eliminates the need for expensive MPLS connections, allowing organizations to reduce their network costs.\n2. **Flexibility and Scalability**: SD-WAN allows for easy addition or removal of sites as needed, without requiring costly changes to the underlying infrastructure.\n3. **Improved Agility**: SD-WAN enables rapid deployment of new applications and services, reducing the time it takes to respond to changing business needs.\n4. **Reduced Complexity**: By leveraging public internet connections, SD-WAN eliminates the need for complex MPLS circuit management, reducing operational complexity.\n5. **Increased Security**: SD-WAN provides robust security features, including encryption and firewalls, ensuring that sensitive data is protected both in transit and at rest.\n6. **Easier Management**: SD-WAN's centralized management platform enables easy monitoring, troubleshooting, and optimization of network performance.\n\nBy leveraging the benefits of SD-WAN, organizations can improve their agility, reduce costs, and enhance their overall networking experience.", 'status': <CacheStatus.MISS: 'MISS'>}
We retry the exact same prompt again, and this time get a cache HIT response in 0.0012 seconds, as a result of the prompt hash match
INFO:semcache:Cache age: 41.52454876899719, ttl: 60, doc_id: doc:ed4db40c878d54e6750f2111ec8ea5c3d081aa64
INFO:ollama_weaviate_grounded:Got response in 0.0012137889862060547 seconds
INFO:ollama_weaviate_grounded:Response: {'response': "Here is an executive summary of the advantages of SD-WAN over MPLS:\n\n**Executive Summary**\n\nSD-WAN (Software-Defined Wide Area Networking) offers several advantages over traditional MPLS (Multiprotocol Label Switching) networks. By leveraging public internet connections and software-defined networking, SD-WAN provides a more agile, flexible, and cost-effective solution for wide area networking.\n\n**Key Advantages of SD-WAN over MPLS:**\n\n1. **Cost Savings**: SD-WAN eliminates the need for expensive MPLS connections, allowing organizations to reduce their network costs.\n2. **Flexibility and Scalability**: SD-WAN allows for easy addition or removal of sites as needed, without requiring costly changes to the underlying infrastructure.\n3. **Improved Agility**: SD-WAN enables rapid deployment of new applications and services, reducing the time it takes to respond to changing business needs.\n4. **Reduced Complexity**: By leveraging public internet connections, SD-WAN eliminates the need for complex MPLS circuit management, reducing operational complexity.\n5. **Increased Security**: SD-WAN provides robust security features, including encryption and firewalls, ensuring that sensitive data is protected both in transit and at rest.\n6. **Easier Management**: SD-WAN's centralized management platform enables easy monitoring, troubleshooting, and optimization of network performance.\n\nBy leveraging the benefits of SD-WAN, organizations can improve their agility, reduce costs, and enhance their overall networking experience.", 'status': <CacheStatus.HIT: 'HIT'>}
The next prompt that we try with is after the cache (60 seconds TTL) has expired - since SWR is enabled, we serve the asset in 0.004 seconds and then fetch it again in the background
INFO:semcache:SWR mode invoked for prompt, age: 79.61280488967896, ttl: 60, doc_id: doc:ed4db40c878d54e6750f2111ec8ea5c3d081aa64 , fetching asset in the background
INFO:ollama_weaviate_grounded:Got response in 0.004001140594482422 seconds
INFO:ollama_weaviate_grounded:Response: {'response': "Here is an executive summary of the advantages of SD-WAN over MPLS:\n\n**Executive Summary**\n\nSD-WAN (Software-Defined Wide Area Networking) offers several advantages over traditional MPLS (Multiprotocol Label Switching) networks. By leveraging public internet connections and software-defined networking, SD-WAN provides a more agile, flexible, and cost-effective solution for wide area networking.\n\n**Key Advantages of SD-WAN over MPLS:**\n\n1. **Cost Savings**: SD-WAN eliminates the need for expensive MPLS connections, allowing organizations to reduce their network costs.\n2. **Flexibility and Scalability**: SD-WAN allows for easy addition or removal of sites as needed, without requiring costly changes to the underlying infrastructure.\n3. **Improved Agility**: SD-WAN enables rapid deployment of new applications and services, reducing the time it takes to respond to changing business needs.\n4. **Reduced Complexity**: By leveraging public internet connections, SD-WAN eliminates the need for complex MPLS circuit management, reducing operational complexity.\n5. **Increased Security**: SD-WAN provides robust security features, including encryption and firewalls, ensuring that sensitive data is protected both in transit and at rest.\n6. **Easier Management**: SD-WAN's centralized management platform enables easy monitoring, troubleshooting, and optimization of network performance.\n\nBy leveraging the benefits of SD-WAN, organizations can improve their agility, reduce costs, and enhance their overall networking experience.", 'status': <CacheStatus.STALE: 'STALE'>}
The next prompt that we try with is a variation over the first one, but well within the semantic threshold (0.8) that we set in the request. This is a cache HIT and the asset is retrieved in 0.15 seconds
prompt ="Why should I use SDWAN over MPLS"
INFO:semcache:Cache age: 30.159759044647217, ttl: 60, doc_id: doc:ed4db40c878d54e6750f2111ec8ea5c3d081aa64
INFO:ollama_weaviate_grounded:Got response in 0.15671300888061523 seconds
INFO:ollama_weaviate_grounded:Response: {'response': "Here's an executive summary of the advantages of SD-WAN over MPLS:\n\n**SD-WAN Advantages Over MPLS**\n\nAt Arg, we've seen firsthand the benefits of Software-Defined Wide Area Networking (SD-WAN) in enhancing network performance and security. In contrast to MPLS (Multiprotocol Label Switching), which requires a single provider for each remote location, SD-WAN offers several key advantages:\n\n1. **Agility**: With SD-WAN, you can quickly deploy broadband circuits at new locations, reducing downtime and improving response times.\n2. **Flexibility**: SD-WAN allows you to choose from various internet service providers (ISPs), enabling you to select the most cost-effective option for each location.\n3. **Cost Savings**: By using lower-cost broadband connections, SD-WAN can reduce overall network costs compared to MPLS.\n4. **Security**: SD-WAN provides robust security features, including encryption and firewalls, ensuring that your data is protected as it traverses the internet.\n5. **Scalability**: SD-WAN can easily scale with your organization's growth, allowing you to add new locations and applications without significant network upgrades.\n6. **Single Control Plane**: With SD-WAN, all network services are managed through a single interface, simplifying network management and reducing the risk of configuration errors.\n\nBy leveraging these advantages, organizations can improve their network performance, reduce costs, and enhance security, making SD-WAN an attractive alternative to traditional MPLS solutions.", 'status': <CacheStatus.HIT: 'HIT'>}
In summary , we have :
Cache MISS : 16.61 seconds
Cache HIT : 0.0012 seconds (prompt strict match)
Cache STALE: 0.004 seconds
Cache HIT : 0.15 seconds (prompt semantic match)
6. Show me the code
https://github.com/badrinatarajan/semcache
7. Experiments, Conclusions and more
For this article, we first used a local install of redis-stack (which includes redis search) and then moved to a docker based image - Both were pretty easy to setup and get started on.
Currently, the reference implementation uses a DIM of 1024 - as that is the number that is supported by the Ollama embedding function. It should be possible to determine this dynamically so any other embedding function can work seamlessly with the code
The Redis search query needed a few tries before we landed on the correct one - the dialect was critical, in case someone keeps running into cache MISSes on semantic searches
We used a local LLM running on a Mac Pro for our tests - the MISS varied between 11-16 seconds while the HITs were consistently in the sub second range
As mentioned in the article, this is a reference implementation - a few other improvements that are planned and will make this production-ready include, but are not limited to:
a. multiprocessing support,
b. multi level storage - in-memory , backed by disk storage
c. Eviction and explicit purge handling