Supercharging LLMs with Retrieval Augmented Generation
A Hands-on guide with Vertex AI and Weaviate
Generative AI models like ChatGPT are impressive, but they have one major limitation: their responses are limited to the data they were trained on. This is true across the different multimodal models that are now capable of understanding and generating content across multiple data types - text, images, audio and video. Retrieval-Augmented Generation (RAG) addresses this limitation—effectively blending the power of search with generation.
In this post, we’ll focus exclusively on the text-based use case - a domain where Retrieval-Augmented Generation (RAG) is particularly powerful and is actually production-ready, but first a bit of an overview, to level set.
1. RAG - When guessing isn’t good enough
Large Language Models (LLMs) like Gemini or GPT are trained on massive datasets. But once deployed, their knowledge is frozen at the time of training—meaning they don’t know about new facts, can't access proprietary documents, and may hallucinate in the absence of information.
Retrieval-Augmented Generation (RAG) solves this by connecting an LLM to external data sources. The model can retrieve relevant context from a knowledge base and then generate accurate, grounded responses using that context. Some of the use cases that RAG addresses include , but are not limited to :
Internal documentation search
Legal or medical question answering
Personalized chatbot assistants
Enterprise knowledge agents (viz., enterprise search)
More generically, RAG pipelines might offer most value in the following scenarios:
Information Retrieval:
When the task involves retrieving and synthesizing information from a large corpus of documents (e.g., answering questions using knowledge from a database or the internet).
Contextual Responses:
For generating responses that require accessing up-to-date or domain-specific knowledge, such as providing technical support, summarizing articles, or generating answers based on current events.
Enhancing Language Models:
For improving the performance of a language model by providing it with relevant context retrieved from a knowledge base (e.g., ChatGPT retrieving specific documents to answer queries more accurately).
Hybrid Approaches:
Combining retrieval with generation to handle tasks where the model needs to both generate new text and refer to specific pieces of information, like personalized email writing or detailed technical explanations.
While it might seem like (and in fact is true, that) RAG is a way to improve the performance of a LLM, it is worthwhile to disambiguate between RAG and fine-tuning. While RAG augments the model with real-time relevant retrieved data, fine-tuning of a model involves expanding or updating the LLMs knowledge and weights via new training. As such the bottom line cost and complexity for setting up a RAG pipeline are low compared with the more compute intensive retraining infrastructure.
2. So what does setting up a RAG pipeline involve ?
While some of the LLM providers (and the DB providers) expose APIs to cobble up a RAG pipeline from scratch, a few expose a one-stop shop “RAG engine” type of service , though users need to carefully consider the tradeoffs of using such a service versus integrating best of breed components to setup a pipeline to address their unique needs.
For the purpose of this article, we will build a pipeline using Google’s Vertex AI’s RAG Engine along with Weaviate’s Vector DB. While Weaviate offers a cloud hosted version with a trial period, it also offers a locally managed open-source version albeit with lesser (or next to none) UX/UI capabilities. The choice of these tools as building blocks is not as important to this post as are some of the underlying principles. In subsequent posts we will experiment with other tools and their capabilities as well.
A typical RAG pipeline includes some or all of the following steps.
Chunk the data
Embed each chunk
Store embeddings in a vector database
Retrieve relevant chunks at query time
Generate a response using the retrieved context
Lets look at each of these steps:
Chunking the data - Preprocess before indexing
Before any data can be semantically searched or embedded, it needs to be split into chunks—smaller, self-contained units of meaning.
Why chunking matters:
Improves search precision: Matching a paragraph is better than retrieving an entire book.
Respects token limits: Embedding models and LLMs have max token limits.
Faster processing: Smaller text blocks are quicker to embed and retrieve.
Our pipeline uses a preprocessed
chunks.json
file, where each entry represents a chunk of a document. (See links and references below for creation of the chunks.)
Embedding and Indexing — Turn Text into Searchable Vectors
Once we have clean, coherent chunks of text, we need to convert them into vectors—high-dimensional numerical representations that capture the meaning of the text. This process is called embedding.
To elaborate a bit, embeddings are numeric arrays (like
[0.21, -0.53, 0.87, ...]
) that represent the semantic content of text. Two chunks that have similar meaning will have vectors that are close together in this vector space, even if they don’t share the same words.For example:
"The president signed a bill" and
"A new law was enacted by the leader"
might be mapped to vectors that are very close in the embedding space, even though they share few words.
In our example, we create embeddings using Google’s
text-embedding-005
model on Vertex AI:embedding_model_config = rag.EmbeddingModelConfig( publisher_model="publishers/google/models/text-embedding-005" )
Each text chunk from our data file is passed through this model to generate its vector. And then these vectors can then be stored in a database for indexing and searching. (Note that for the purposes of indexing, the Vertex AI RAG engine creates an index - called “corpus”)
Vector Database - Store the generated embeddings
Traditional keyword search engines (like Elasticsearch) look for exact or partial matches. But for open-ended queries, we need to find semantically similar text—this is where vector search excels. And that is where a DB like Weaviate comes in.
Weaviate indexes each chunk's embedding and allows us to perform nearest neighbor searches during retrieval. These semantic matches are more accurate and flexible than keyword hits. (In fact, Weaviate supports vector similarity, keyword or a mix of both - which is called a hybrid search, though for this article our focus is on the similarity search. Also, though we have seemingly lumped vector and similarity search in the same bucket, there are slight nuances and for folks that are interested in knowing more, please refer to the links at the end of this post)
So, we use a vector DB to:
Store each text chunk and its embedding
Query similar embeddings during retrieval
vector_db = rag.Weaviate( weaviate_http_endpoint=WEAVIATE_HTTP_ENDPOINT, collection_name=COLLECTION_NAME, api_key=SM_WEAVIATE_API_KEY_RESOURCE, )
Then we create the actual corpus that links Vertex AI's embedding and retrieval engine with Weaviate:
rag_corpus = rag.create_corpus( display_name=RAG_CORPUS_DISPLAY_NAME, embedding_model_config=embedding_model_config, vector_db=vector_db, )
Retrieval - Bring Context to the model
Once the data chunks are embedded and stored in a vector database, the next crucial step is to retrieve the most relevant data when a user asks a question.
This is the retrieval phase —it acts as the intelligent middle layer between the user's query and the generative model. It fetches relevant text which in turn becomes context for the next step in the process. This ensures that the model responds with grounded, factually supported answers.
(Note that in our current example, we make use of the “Tool” class provided as a part of Vertex AI, which uses the corpus created in the previous step. In the subsequent step we create a model instance using the “tool” so the LLM can do its work. In a subsequent article, we will see how a user query can be used in the retrieval phase to get relevant context which can then be used by a LLM in the generation phase).Tool.from_retrieval( retrieval=rag.Retrieval( source=rag.VertexRagStore( rag_resources=[rag_resource], similarity_top_k=10, vector_distance_threshold=0.8, ) ) )
In the current example, we pass a pass two parameters similarity_top_k and vector_distance_threshold which are used to instruct the retriever to (a) fetch the top 10 most semantically similar chunks and (b) instruct how far apart a chunk can be in the vector space . We will see a bit later as to how the results vary by varying the distance thresholds.
Generation - Produce a human-like answer
And finally, the generation step is where the magic of large language models kicks in. The model takes the user’s question and the retrieved chunks, and synthesizes a coherent, informative response. So the model does not just “look up” an answer, rather it writes one based on the evidence in the previous step.
rag_model = GenerativeModel("gemini-2.0-flash", tools=[rag_retrieval_tool]) response = rag_model.generate_content("Summarize the history of computing in 3 points")
And providing a response like :
- The history of computing extends beyond modern computing technology to include methods for pen, paper, or chalk and slate. Early tools include the Sumerian abacus (c. 2700–2300 BC) and the Antikythera mechanism (c. 100 BC). - The first programmable, fully automatic computing machine was the Z3, built in 1941. The ENIAC, the first electronic general-purpose computer, was announced in 1946. The MOSFET, invented in 1959, enabled the creation of high-density integrated circuit chips. - Supercomputing began with the Control Data Corporation (CDC) 6600 in 1964. The 1980s and 1990s saw advancements in microprocessors and the rise of personal computers.
To summarize , these is the flow that is used to produce the desired output:
3. Experiments and more..
The setup was a bit laborious (especially since we opted to use another DB instead of the in-built RagManagedDb ) , but the setup instructions (linked in the references below) ought to be enough to get one started - and of course the link to the github codebase should help folks that are interested in playing with this combination a bit more.
There is also a decent comparison of the pros and cons of using the default DB versus integrating with 3rd party DB that have been linked in the references below.
With this combination of Vertex and Weaviate Cloud, we were able to generate coherent responses to our queries. Tweaking the vector_distance_threshold provided responses that were verbose or terse depending on how high or low the thresholds were set , respectively.
In our tests, usually a lower distance vector threshold (say 0.1 or 0.2 resulted in an answer that indicated that the provided context did not contain sufficient data - however through multiple iterations, we found that the responses were not always consistent -
a. Running the same experiment multiple times with the same thresholds would either provide an coherent response or result in a “insufficient data” response
b. A lower threshold would occasionally provide a better response, while a higher threshold would result in a “insufficient data” responseWhile Weaviate has a rich set of data types that can be used for its schema, the integration with Vertex mandated a set of properties (fileId, corpusId, chunkId, chunkDataType, chunkData, fileOriginalUri) that are used/necessitated for this specific integration. Adding more properties does not seem to affect the results in any significant way. However, using a different LLM (or even trying out just hybrid searches using just Weaviate) might provide more interesting results which make a better use of the underlying capabilities of the DB.
4. Conclusions and what’s next
This combination of Vertex AI RAG engine and Weaviate Cloud is a good starter to experiment with the capabilities and possibilities of RAG. There are a few caveats though:
Exposure of proprietary data to an LLM is always a gray area that prevents most potential customers from taking advantages of the full benefits of RAG. Indeed a large number of enterprises build a firewall around what can be exposed to the LLMs, thereby either restricting their employees to make use of such data from public domain or worse letting the LLM hallucinate responses.
The proliferation of open-source models with a performance matching or exceeding that of the proprietary models opens up the possibility of using those models to build a theoretically better pipeline - while potentially addressing concerns of privacy that are there with the proprietary models. In fact building such a pipeline will be the focus of our next article, along with analyzing the performance and sharing some of the lessons learnt.
5. Show me the code
https://github.com/badrinatarajan/ai-sandbox
Follow the Readme in the project to setup the environment and execute the code.
6. Links and References
https://cloud.google.com/vertex-ai/generative-ai/docs/samples/generativeaionvertexai-rag-quickstart
DB comparisons: https://cloud.google.com/vertex-ai/generative-ai/docs/rag-engine/vector-db-choices
Vector search explained : https://weaviate.io/blog/vector-search-explained
chunking code: https://github.com/badrinatarajan/vector-databases-in-practice-deep-dive-4513162/blob/main/04_04_1_chunk_wiki_articles.py