From Videos to Insight: RAG with Open LLMs
Building a Private RAG pipeline for YouTube content - using Ollama and Deepseek
In our last post, we introduced RAG - Retrieval Augmented Generation - as a way to enhance the capability of an LLM by connecting the LLM to a knowledge base, where the model can retrieve relevant context and then generate better, accurate and grounded responses using the context. While the possibilities of such a system to solve the knowledge/information problem are exciting and seem endless, there are a few concerns in widespread adoption of RAG. A major mind block is data privacy (or lack thereof) in exposing proprietary data to the LLMs. In addition, there is the restriction of working with closed APIs - and the level of control exposed by different LLMs in tweaking parameters to be able to customize the results. Finally there is also an associated cost-factor in working with some of the well-known and popular LLMs .
While the technology is still evolving, in this post, we will try to explore alternatives to address some of these concerns.
Open source LLMs are a great choice to consider, in dealing with some of the issues of their proprietary counterparts. An open-source LLM is a large-scale generative model (usually trained on massive text corpora) whose weights, architecture, and often training data are made available in the public domain. Anyone can download, run, fine-tune, or audit these models—often without sending any data to a third party.
Popular examples include:
Mistral — lightweight, fast, and versatile
LLaMA 2 / LLaMA 3 — from Meta, optimized for research and production
DeepSeek — powerful general-purpose model with high accuracy
Phi (actually a SLM) , OpenHermes, Zephyr, and many others—optimized for tasks like working on small devices, coding, chat, or summarization
These models are often released under permissive licenses (Apache 2.0, MIT, etc.) which allow commercial and private usage. In addition they provide:
Data Privacy: Your data is never expected to leave your machine.
Auditability: You can inspect how embeddings and answers are generated.
Cost Control: No API bills; just local (or cloud) compute.
Customizability: Swap models, tweak hyperparameters, and change behaviors easily.
Of course typical concerns of OSS - performance, and to some extent support and maintenance are still present. On the performance aspect specifically, some of the open source LLMs are pretty close to their proprietary counterparts. While doing research for this post, we found some key benchmarks -Massive Multitask Language Understanding (MMLU), Grade School Math 8K (GSM8k), HumanEval (for code generation and correctness) place Open source LLMs within a stone throw’s reach of the rest of the pack, but mileage varies depending on which provider is publishing the numbers and the method that the providers are using - (one shot vs 5-shot etc) to compile the results - tilting the scales in their favor!
While getting a open source setup up and running can seem be a daunting task, the evolution of OSS LLMs has also resulted in some orchestration platforms that have made it ridiculously easy to not only run, but switch between LLMs, thus providing a fertile ground for experimentation and subsequent production use.
Enter Ollama.
Ollama: Open-Source LLMs, Locally and Easily
Ollama is a developer-first platform that abstracts away all the complexity of hardware configuration, memory tuning, and model serving, so users can focus on building with the models—not wrangling with them. In addition to a single line install /model pull capability, that enables these models to be run on GPUs (with a CPU fallback), the platform also exposes a unified and consistent API that can be used across models. And yes, once the models are pulled, they can be run on air-gapped systems, so the users’ data never leaves the machine.
So, with that introduction, let’s get to building our video based RAG system , using Ollama , backed by our trusted vector DB (Weaviate) from the last post. Also, for this post we will use a open-source locally hosted version of Weaviate - making this a completely locally hosted system built off of open source components.
Video based RAG - our current usecase
Most RAG systems work with static documents like PDFs or websites. However, there is a world of valuable content present in YouTube videos - with video content being more engaging than a static document; however the problem is in elegantly and optimally searching for specific content within a video, ang generating insights. Note - the pipeline we intend to build currently processes only the linguistic modality from video transcripts and does not yet perform true multimodal fusion of visual and audio signals - more on that in a later experiment!
For this pipeline, we will use the following stack:
YouTube transcript loaders - we will use a part of another framework called Langchain, for the rich libraries, but not use Langchain itself for building the RAG.
A local Weaviate vector database
Open-source LLM (Deepseek) for embedding and generation
and create a system that can answer deep questions based on video content, like:
“How does SD-WAN compare to MPLS in modern networks?”
This is powerful not just for learning but for research, enterprise training, and content summarization. The choice of Deepseek as an LLM was arbitrary; in a few basic experiments, this seemed to perform well when compared to some other open source LLMs, however as a generic theme, the purpose of this post is not specifically to instrument and compare the different LLMs, rather just explore the capabilities and learn as we build a RAG system.
Let’s get on without much further ado.
Step 1: Load and Chunk the YouTube Transcript
We use LangChain’s YoutubeLoader
to pass in a Youtube link, fetch the transcript and split it into ~1-minute chunks. All of this is enabled by the following code:
loader = YoutubeLoader.from_youtube_url( link,
add_video_info =False,
transcript_format=TranscriptFormat.CHUNKS,
chunk_size_seconds=60)
l = loader.load_and_split()
return l
The transcription itself is a quick process - and completes within a few seconds. The available transcription formats are:
(a) text - where the transcript is plain text, devoid of metadata such as timestamps,
(b) chunks - where the transcript is a pre-chunked segment with metadata such as start_seconds, start_timestamp and source , that we will make use of later - perfect for a RAG use case
( c) lines - where the transcript is split by line with spoken sentences treated as separate docs.
A sample 30 sec transcript will produce an artifact similar to the following listing- with 4 components - page_content, source metadata (YouTube link for accessing the content chunk), start_seconds metadata (position in the chunk) and a timestamp metadata:
page_content: pretty traditional we're going to put a couple remotes and a data center that these remotes need to connect to now in the early days this might have been a private line or otherwise known as a point to point that could be um evpl which is uh ethernet virtual private line service could be dark fiber uh could just be in in the old days a
metadata: {'source': 'https ://www.youtube.com/watch?v=dq-qA4vEpN0&t=30s', 'start_seconds': 60, 'start_timestamp': '00:01:00'}
Step 2: Generate Embeddings
After loading and chunking the transcript in Step 1, we move to the next critical stage in any Retrieval-Augmented Generation (RAG) pipeline: embedding. As we discussed in the previous post, this is the process of converting raw text into a dense numerical representation that captures its semantic meaning—so we can later search and retrieve relevant content using vector similarity.
This is done by the following snippet:
response = ollama.embeddings( model="mxbai-embed-large", prompt=pc )
We use the ‘mxbai-embed-large’ model to embed each chunk from the transcript - the ‘prompt’ parameter name is a bit misleading, and just refers to the chunk of the text ( page_content) that is being passed to the model for generating the embeddings.
Step 3: Store Embeddings in Weaviate
Now that we have the embeddings generated, we need to store them in our vector database . The first step is the creation of a “collection” - basically a schema as shown below.
collection = client.collections.create( name = COLLECTION_NAME,
properties=[ wvc.config.Property(
name="text",
data_type=wvc.config.DataType.TEXT),
wvc.config.Property(
name="source",
data_type=wvc.config.DataType.TEXT),
wvc.config.Property(
name="start_seconds",
data_type=wvc.config.DataType.INT),
wvc.config.Property(
name="start_timestamp",
data_type=wvc.config.DataType.TEXT),
] )
Note that we pass the source, start_seconds and start_timestamp as additional fields during creation of the schema; however, the embeddings are generated only for the transcript of the video chunk (page_content) and will be used for the vectorized similarity search. The other meta parameters are returned in the search results - to ground the answer -i.e., to provide video references (and can alternatively be used for filtering , if desired)
And finally we store each chunk using the schema above using the following snippet:
batch.add_object(
properties={
"text": pc,
"source": source,
"start_seconds": start_seconds,
"start_timestamp": start_timestamp
},
vector=response["embedding"]
)
Step 4: Retrieve Context for a Prompt
At this point, our transcript has been embedded and stored in Weaviate. The next goal is to search for the most relevant transcript chunks given a user query—like:
“What is SDWAN?”
We achieve this by embedding the user query into the same vector space as the transcript chunks:
response = ollama.embeddings( model="mxbai-embed-large",
prompt=prompt, # The user's question )
Weaviate then compares the query vector to all stored vectors in the collection using vector similarity:
results = collection.query.near_vector( near_vector=response["embedding"], limit=10, distance=0.7 )
This returns the top N transcript chunks that are most semantically similar to the question. Each match includes:
text
: The actual transcript chunkstart_timestamp
,start_seconds
: Let you link back to the videosource
: The video URLdistance
: (optional) how close the match is
The top chunks are concatenated and become the grounding context data for the next step.
Step 5: Generate an Answer with DeepSeek - finally
Now that we have assembled relevant transcript snippets as context_data, it’s time to generate a natural-language answer grounded in that context. This the where prompt engineering comes into place.
We construct a prompt like this:
prompt = "Compare advantages of SD-WAN over MPLS"
prompt_template = f"Using this data: \"{context_data}\". Respond to this prompt: \"{prompt}\""
We then send it to Ollama’s hosted DeepSeek model:
response = ollama.generate( model="deepseek-r1",
prompt=prompt_template )
The model responds with a grounded, coherent answer using the retrieved transcript data. That is, because the generation is conditioned only on retrieved context, this approach:
Reduces hallucination
Keeps answers concise and relevant
Supports attribution to the original video (via timestamps)
Example in Action
In summary, we transcribed the YouTube video, chunked and generated context data using the question “What is SD-WAN” , and then asked the LLM to “Compare the advantages of SD-WAN over MPLS”. The system we just built:
Retrieves video chunks where this topic is mentioned.
Synthesizes a coherent response using DeepSeek.
Returns grounded text and metadata showing where the info came from.
Here is a sample response:
Okay, let's compare the advantages of SD-WAN highlighted in this transcript over traditional MPLS-based Wide Area Networking.
Based on the provided text ("...sd-wan comes in sd-wan stands for software-defined wide area networking it combines all of our requirements into a simple and affordable platform..." etc.), here are the key advantages:
1. **Cost Efficiency:**
* **SD-WAN Advantage:** Described as "simple and affordable." It allows businesses to potentially reduce costs by migrating less critical traffic off expensive MPLS circuits, especially for cloud access.
* **MPLS Comparison:** Often considered a premium service with dedicated connectivity being more costly.
2. **Flexibility in Connectivity:**
* **SD-WAN Advantage:** Enables the use of different types of internet connections (cable, fiber, DSL) at various locations ("you can use the most cost-effective provider"). It supports diverse options like business broadband or even wireless (4G/5G) as a failover or supplement.
* **MPLS Comparison:** Traditionally relies on a single contracted service provider and often requires specific types of circuits (e.g., dedicated fiber, high-capacity lines). Less flexible in leveraging cheaper internet alternatives directly for non-critical traffic.
3. **Operational Simplicity & Centralized Management:**
* **SD-WAN Advantage:** "Aggregates all these services into a single device" and provides "a single control plane where you can orchestrate the entire network through one user interface."
* **MPLS Comparison:** Often requires separate hardware appliances for routing, firewalling (especially if central firewalls are used), VPNs/Security. Configuration and management might be more distributed or complex.
4. **Performance Optimization on Internet Links:**
* **SD-WAN Advantage:** Can manage QoS effectively even over the public internet ("you can manage those very successfully going over the public internet"). It uses intelligence to decide how to route traffic for best performance.
* **MPLS Comparison:** While MPLS offers high, guaranteed bandwidth and low latency on dedicated links, SD-WAN's approach focuses on optimizing and ensuring quality *on potentially variable* internet connections.
5. **Support for Cloud-Native Applications:**
* **SD-Wan Advantage:** Directly addresses the need to connect securely and efficiently to cloud services (Azure, AWS, Office365, Salesforce). It provides a pathway ("slowly migrate applications") without necessarily requiring complex point-to-point MPLS circuits.
* **MPLS Comparison:** Connecting branch offices or remote users directly to multiple cloud services via standard MPLS might require more complex routing configurations and potentially dedicated tunnels for each secure connection (e.g., IPsec VPNs).
**In summary:**
The transcript positions SD-WAN as a modern, consolidated solution that offers significant cost savings by optimizing the use of diverse internet connections. It simplifies network management through centralized control and provides features to manage performance effectively on these less controlled links. While acknowledging MPLS can still be used alongside SD-WAN (especially for critical applications or point-to-point needs), it frames SD-WAN as a more flexible, affordable, and integrated approach to handling the changing demands of cloud connectivity compared to traditional single-vendor, often more expensive MPLS setups.
Sources: [('('https:// www.youtube.com/watch?v=dq-qA4vEpN0&t=540s', '00:09:00', 540), ('https:// www.youtube.com/watch?v=dq-qA4vEpN0&t=0s', '00:00:00', 0), ('https:// www.youtube.com/watch?v=dq-qA4vEpN0&t=780s', '00:13:00', 780), ('https:// www.youtube.com/watch?v=dq-qA4vEpN0&t=840s', '00:14:00', 840), ('https:// www.youtube.com/watch?v=dq-qA4vEpN0&t=300s', '00:05:00', 300), ('https:// www.youtube.com/watch?v=dq-qA4vEpN0&t=900s', '00:15:00', 900), ('https:// www.youtube.com/watch?v=dq-qA4vEpN0&t=600s', '00:10:00', 600), ('https:// www.youtube.com/watch?v=dq-qA4vEpN0&t=660s', '00:11:00', 660), ('https: //www.youtube.com/watch?v=dq-qA4vEpN0&t=720s', '00:12:00', 720), ('https: //www.youtube.com/watch?v=dq-qA4vEpN0&t=240s', '00:04:00', 240)]
Show me the code
https://github.com/badrinatarajan/ai-sandbox2
Experiments, Conclusions and more
Getting Ollama up and running was a breeze - we also had the chance to experiment with both Deepseek and Llama3.2 and observe the quality of answers provided by both. In general Deepseek’s responses seemed to be less hallucinated, and a bit more structured - this was true specifically for the RAG case, as well as general questions thrown at the LLM (such as compare RAG vs fine tuning). The key takeaway was that running a model locally like Deepseek was possible with low effort.
The ‘thinking’ feature/capability is quite insightful in the reasoning behind why the model seems to come up with a certain response. However, the latency and compute utilization during the response generation was quite visible, considering this was running on a M3 processor.
Getting a local Weaviate instance up and running was also quite easy, though as mentioned in a previous post, the lack of UI capability to interact with the local instance (as opposed to the cloud hosted instance) leave much to be desired for.
A natural follow-up to this experiment/post would be to build a RAG system that maintains memory and context across various steps of the process - and makes a decision based on observations across the various steps - in other words a “agent” that is capable of retrieving, generating, refining and iterating while being stateful- an Agentic RAG pipeline. We will attempt to do this in our next post.
Links and references:
Install a local weaviate instance https://docs.weaviate.io/weaviate/quickstart/local
Ollama models : https://ollama.com/search
Get up and running with Ollama: https://github.com/ollama/ollama
Langchain document loaders : https://python.langchain.com/api_reference/community/document_loaders.html
Youtube loader: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.youtube.YoutubeLoader.html