Taming the SRE Firehose: Agentic RAG Workflow for Incident triaging

Building a LangChain powered Assistant using JIRA and Slack signals to do a Root cause analysis on field issues

Jul 25, 2025

In the last 2 posts, we looked at how RAG (Retrieval Augmented Generation) provided a way to enhance capabilities of LLMs , and how enriching content using a knowledge base provided for more meaningful and powerful LLM outputs. We also looked at how the Open source LLMs ecosystem has been evolving, to address critical privacy , cost and control concerns that are a staple with enterprise data.

However the magic often stops at retrieval. That is, the system is just a great static Q&A tool. However, if the capabilities of RAG can be extended to evolve it into a reasoning system, then the benefits and applicability of such a system would just increase by an order of magnitude. This is where the magic of Agents as applicable to the LLM world starts.

So, what Are Agents anyway?

At their very core, agents are orchestrators - autonomous workers that are able to decide what actions to take, in what sequence and how to react based on intermediate results. Agents are goal-oriented and can plan , remember and use external tools to complete tasks. While the above description is not exactly crisp or scientific, it is in essence what an agent is truly capable of.

To provide a more contextual example, an agent can :

Interpret a user request (e.g, - “Why did the latency spike in the us-west region”)
Decompose into subtasks (e.g., - Check alerts, correlate deployments, check logs)
Invoke external tools (e.g., APIs , ticketing systems, databases)
Reason iteratively , using the retrieved data to form a final answer, or better still take an automated action.

An Agentic RAG system can integrate the broader capabilities of a LLM agent with a RAG workflow. And in doing so, an Agentic RAG system can evolve a plain RAG Q&A tool into a more useful reasoning system :

Need to analyze a ticket? The agent can retrieve similar past issues.
Want to explain a spike? The agent can correlate Grafana alerts with recent deploys and Slack chatter.
Triaging a customer issue? The agent can parse logs, summarize causes, and even suggest JIRA resolutions.

Beyond the obvious benefit of a context-aware, grounded and precise response, such a coupling of an Agent with RAG also lends itself to do Multi-hop reasoning - where agents can handle follow-ups or combine contextual evidence, and also a dynamic tool usage - where agents can decide whether to call APIs, language tools or search.

Relevance to the Platform Engineering discipline

There is a reason why the examples and explanations provided above have a SRE/DevOps (or more generically - a Platform Engineering) flavor to it. In today’s cloud-native landscape, Platform Engineers operate at the intersection of development velocity and production reliability. There is a wealth of data that is available as a part of the tech stack - logs, alerts, JIRA, Slack, Grafana dashboards, incident reports, databases and such. And sifting through the firehose of data , to understand, triage and resolve incidents is mostly manual, reactive and time-consuming. The technical knowledge is siloed, and in a lot of cases an issue reported from the field needs to be connected back to a SME from the development organization - after it has been bounced back and forth between different development teams.

In essence, this is a knowledge problem - with the following attributes:

An information and data corpus that is vast, disjointed and proprietary, and is both structured and unstructured, but most of which can be distilled into a natural language representation.
A somewhat vague notion of a issue or incident ( usually something like “customer is experiencing slowness”), where the same symptom can stem from different causes
Triaging and resolution of the issue requiring a human context; however a lot of critical time is wasted by hapless TAC Engineering that lobs the issue to the first development team that is present in the alphabetically ordered dropdown list of orgs in JIRA..

As a slight segue, at this point there is also a clear distinction that needs to be made between what can/must be addressed using automation and what should not be incorrectly attributed as a problem to be solved using other tools.

a. Plain old CI/CD, can and must just be automated. And I mean right from the point a PR is raised to the point the code hits production. Sadly, the general and more common experience is that this process is saddled with unnecessary embellishments, manual interventions, spreadsheets, go/no-go meetings at various levels of the org and such.

b. Traditional SRE automation is often rigid and rule-based - “If CPU utilization > 80%, then alert”. These workflows are excellent for predictable, repeatable tasks.

c. Field issues as we discussed above are messy and nuanced.

Agents by nature of the definition that we provided above, introduce flexibility and reasoning in dealing with messy and nuanced.
To re-iterate:

They can infer connections ("Latency spike coincides with a deploy")
They can ask follow-up questions ("Is this affecting other regions?")
They can summarize and take actions across systems dynamically

While automation is great for execution, Agents are great for decision and execution.

So, how do we actually go about building such Agentic systems in practice ? That’s where LangChain comes in.

Building Agentic RAG workflows

LangChain is a powerful open-source framework for building applications with language models. It shines in scenarios that require chaining, tool use, and reasoning — making it a perfect match for agent-powered RAG systems in platform engineering.

LangChain shines in building Agentic RAG pipelines due to the following capabilities:

Composable architecture : Swap in/out tools, retrievers using API calls easily - as we. will see in our code below
Tool integrations: Support for popular devops tools, and/or easy integrations.
Agent orchestration : ReAct or function-calling.
- ReAct framework combines reasoning (thinking) and action (calling tools) in a step-by-step loop. The LLM “thinks aloud,” decides on the next action (e.g., calling a tool), sees the result, and continues reasoning. For example the LLM outputs intermediate steps such as:
  - Thought: I need to check recent alerts in Slack. Action: SlackTool Action Input: "west coast outage 2025-06-13"
    The agent then calls the Slacktool , feeds back the result, and continues reasoning.
- Function calling: Allows LLMs to select and call external functions by schema without verbose “thinking” steps

LangChain also integrates smoothly with Ollama (that we touched upon in our previous post) . This allows developers to run powerful open source LLMs like Gemma or Deepseek locally.

With LangChain, a typical Agentic RAG workflow would involve:

Integrating tools like JIRA, Slack, SQL , that provide the corpus of information along with vector databases like Weaviate to store the chunked data
Defining retrievers to fetch relevant chunks of data from internal sources
Configuring agents to reason through tasks and use tools as needed
Composing end-to-end pipelines that perform triage, diagnosis, and even action

Let’s walk through a more concrete example with relevant code.

Triaging Field incidents with LangChain, Ollama and Weaviate

Assume we have to do a RCA of a customer reported incident of “slowness in application access” that was reported on a certain date in the past. Also assume that relevant signals are spread across structured systems like Jira (engineering status) and unstructured messaging platforms like Slack (alerting, SRE channel). Traditional log scraping or dashboard monitoring can potentially miss cross-source correlation and wastes precious triage time. We will try to solve this problem using LangChain + Ollama (Gemma) for reasoning and Weaviate as a vector DB to persist and semantically retrieve prior Slack alerts and Jira issues. Then, we will construct an agent with access to two tools:

SlackTool: queries semantically indexed Slack alerts
JiraTool: queries semantically indexed Jira tickets

LangChain’s zero-shot ReAct agent decides how to combine these tools to answer a complex triage question like the one we have above.

Key workflow steps

Step 1 : Custom integrations with JIRA and Slack to access documents

In our simulated example, for JIRA, we created and populated a “DEVOPS” project with typical set of Ops tasks and items. We have also populated a slack channel #sre-alerts with alerts having a timestamp, a raise/clear condition, an alert name, a tenant id and a description

We have also created connectors to JIRA and Slack in order to use the APIs exposed by JIRA and Slack to retrieve the relevant documents:

def get_docs_from_jira()->list:
    j, rc = jira_connector.jira_connect(JIRA_URL, JIRA_USERNAME, JIRA_TOKEN)
    if rc.success == True:
        issues = jira_connector.jira_search_issues_with_jql(j, "project="+JIRA_PROJECT, maxResults=100)
        print(f'Retrieved {len(issues)} from jira')

and

def get_docs_from_slack()->list:

    slack_client = slack_connector.get_slack_client(SLACK_TOKEN)
    data, rc = slack_connector.get_messages(slack_client, SLACK_CHANNEL_ID)
    if rc.success == False:
        return []
    docs = parse_slack_msgs(data)
    return docs

Step 2: Vector Embeddings and Storage with Weaviate

Once the docs are retrieved from Slack and Jira, they are then chunked using RecursiveCharacterTextSplitter, and embedded with mxbai-embed-large via OllamaEmbeddings.

embeddings = OllamaEmbeddings(model="mxbai-embed-large")

db = WeaviateVectorStore( client=client, 
index_name=collection_name, 
text_key=key, 
embedding=embedding ) 
vectorstore.add_documents(docs)CopyEdit

Both Slack and Jira have their own collections in Weaviate:

DevOpsSlackCollection
DevOpsJiraCollection

Each Slack document includes:

Timestamp (Timestamp of the event)
Alert (alert name)
Description (free form text)
Tenant ID ( that the alert was raised for)
Status (RAISE/CLEAR)

Each Jira document includes:

Content (Issue summary and description)
Issue ID (JIRA ID)
Assignee (if available)
Resolution status
Resolution date (Date the item was resolved on) - Note JIRA automatically assigns a resolution date when the event transitions to the resolved/done state and so for this project, we created a custom field. In a real world scenario, we should just be able to use the JIRA provided ResolutionDate field that can be accessed using APIs that we have provided in the code.

Once we have defined the collections, we then add the documents into the collections using the following snippets:

slack_key = 'description'
slack_db = add_or_retrieve_from_collection(weaviate_client, slack_collection, SLACK_COLLECTION_NAME, slack_key)

and

jira_key = 'content'
jira_db = add_or_retrieve_from_collection(weaviate_client, jira_collection,JIRA_COLLECTION_NAME,jira_key)

Step 3: Create retrievers

LangChain provides a “retriever” interface via a .as_retriever() function that converts a vector store into something that is accessible by an agent’s components such as tools or chains.

For example, the following two snippets:

slack_retriever = slack_db.as_retriever(search_kwargs={"k":10})

jira_retriever = jira_db.as_retriever(search_kwargs={"k":10})

are used to wrap the vectorstore with a retriever class that exposes a uniform API, specifically, in our case the get_relevant_documents() as shown below. An abstraction like this ensures that users do not have to deal with the vector store’s specific query interfaces , and the retrievers are expected to implement the uniform methods.

docs = slack_retriever.get_relevant_documents(query)

and

docs = jira_retriever.get_relevant_documents(query)

Step 4: Create tools

In the LangChain context, a tool is a callable (function or wrapper) that an agent can access while solving a task. It includes

A name (e.g., "SlackTool")
A description (used by the agent to decide when to use it)
A function (like query_slack_tool) that implements the actual behavior

In our example, we create two tools as follows:

tools = [
    Tool.from_function(name="SlackTool", func=query_slack_tool, description="Use this tool to query slack messages"),
    Tool.from_function(name="JiraTool", func=query_jira_tool, description="Use this tool to query Jira tickets")
    ]

where the functions query_slack_tool and query_jira_tool use the retrievers that we created in the previous step:

def query_slack_tool(query: str) -> str:
    docs = slack_retriever.get_relevant_documents(query)
    return "\n".join([doc.page_content for doc in docs])

and

def query_jira_tool(query: str) -> str:
    docs = jira_retriever.get_relevant_documents(query)
    return "\n".join([doc.page_content for doc in docs])

Step 5: Hook it all up using the agent

We use LangChain’s initialize_agent with two tools in the previous step and a Gemma3:12b LLM hosted via Ollama.

model = OllamaLLM(model="gemma3:12b", temperature=0.4)

agent = initialize_agent(tools, model, 
                          agent="zero-shot-react-description",
                          verbose=True, 
                          handle_parsing_errors=True)

response = agent.run("Use Slack and Jira to determine why customers in the west coast reported application access issues on 2025-06-13")

The agent can now reason:

When to query Slack or Jira
Combine both sources
Synthesize a cause or hypothesis

As we discussed earlier, this agent type uses the ReAct (Reasoning + Acting) framework without needing custom examples (i.e., zero-shot). It relies on the description of each tool to decide which one to invoke based on the query. It alternates between thought steps and tool use — until it has enough information to respond. This makes it ideal for dynamic environments where new tools may be introduced or where the user queries vary widely.

Example in Action

For the prompt that we provided:

Use Slack and Jira to determine why customers in the west coast reported application access issues on 2025-06-13

this is a output of the agent:

> Entering new AgentExecutor chain...
Okay, I need to investigate the application access issues reported by West Coast customers on 2025-06-13. I'm going to start by checking Slack for any relevant discussions. Then, I'll check Jira to see if there were any related tickets.

Action: SlackTool
Action Input: "application access west coast 2025-06-13"/Users/badri/work/src/ai-sandbox3/rag_agent.py:263: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 1.0. Use :meth:`~invoke` instead.
  docs = slack_retriever.get_relevant_documents(query)

Observation: 2025-06-26T05:35:44Z CLEAR SECURITY_BREACH Origin IP access attempt blocked or resolved, Suspicious Activity Cleared
2025-06-16T17:35:28Z RAISE BANDWIDTH_UTILIZATION Bandwidth Utilization near critical level, &gt; 97% for edge region us-east-1a
2025-06-26T04:12:08Z RAISE SECURITY_BREACH Suspicious Origin IP Access Attempt detected for domain <http://protected.example.com|protected.example.com>
2025-06-13T18:42:19Z RAISE CPU_UTIL_HIGH CPU util on edge node us-west wk-us-west1-a-fyxz breached 85%
2025-06-23T16:45:12Z RAISE LATENCY_SLOW High Average Response Time, Average &gt; 3500ms for requests to <cdn.example.com>
2025-06-14T11:32:49Z CLEAR BANDWIDTH_THROTTLE Bandwidth Usage dropped below Tier Limit, Current usage &lt;= 78 Mbps
2025-06-16T19:38:29Z CLEAR BANDWIDTH_UTILIZATION Bandwidth usage dropped below 95%, Utilization normal for region eu-central-1b
2025-06-06T11:37:15Z CLEAR HTTP_5XX_ERR_RATE_LIMIT_EXCEEDED Rate of 5xx errors reduced below threshold, Error Rate Normalized
2025-06-09T14:38:01Z RAISE CDS_DEPLOYMENT_ERR CDN Origin Error - Certificate (SSL) for origin server expired or invalid
2025-06-08T13:58:41Z RAISE HTTP_5XX_ERR_HIGH High Server Error Rate (5xx), Origin server experiencing &gt; 18 errors/sec
Thought:Okay, the Slack messages indicate a few potential issues. Specifically, there's a message from 2025-06-13T18:42:19Z indicating "CPU util on edge node us-west wk-us-west1-a-fyxz breached 85%". This is a strong candidate for the root cause of the West Coast access issues. I'm going to check Jira to see if there's a ticket related to this CPU utilization spike.

Action: JiraTool
Action Input: "CPU utilization us-west 2025-06-13"
Observation: Update load balancer configuration on prod for Zeus Adjust weights to improve traffic distribution. Update on us-west , monitor for 7 days and roll out to all prod. issue_resolution date: 2025-06-13T06:00:00.000-0700
Cleanup log files on zeus VMs Cleanup old log files. Add a script to do this periodically issue_resolution date: 2025-06-13T22:30:00.000-0700
Fix log line variable Fix log line variable causing a BQ issue issue_resolution date: 2025-06-12T00:30:00.000-0700
Add script to monitor disk IO stats on Atlas VMs Add script to monitor disk IO/stats and sync to BQ issue_resolution date: NA
Update Atlas DB schema for Hermes Optimize database queries for faster response times. issue_resolution date: 2025-06-26T01:00:00.000-0700
Rollout cert manager updates for lab tenants Rollout cert manager updates to lab tenants issue_resolution date: 2025-06-26T01:30:00.000-0700
Deploy new logging framework to Atlas Implement ELK stack for better log analysis. issue_resolution date: NA
Rollout cert manager updates for staging and prod Cert manager updates for staging and prod issue_resolution date: 2025-06-26T02:00:00.000-0700
Deploy Zeus monitoring tools to Atlas Install Grafana dashboards for Atlas performance metrics. issue_resolution date: NA
Update Atlas logging configuration Centralize log files using Splunk. issue_resolution date: NA
Thought:The Jira tool results show that there was work done on 2025-06-13 related to updating the load balancer configuration on prod for Zeus, which included adjustments to improve traffic distribution on the us-west region. This aligns with the Slack message indicating high CPU utilization on an edge node in the us-west region. It appears the load balancer adjustments were an attempt to address the CPU utilization issue.

Final Answer: Based on Slack and Jira data, customers in the west coast reported application access issues on 2025-06-13 due to high CPU utilization on an edge node (wk-us-west1-a-fyxz) in the us-west region. Jira tickets indicate that load balancer configurations for Zeus were updated on the same day to improve traffic distribution, likely in response to this CPU utilization spike.

> Finished chain.

The agent was able to correlate that customer issues could be traced back to a CPU spike and in turn a load balancer configuration update. There were other incidents/tickets around the same day, however, the agent was able to reason out that the root cause was attributed to the spike and the configuration change.

Show me the code:

https://github.com/badrinatarajan/ai-sandbox3

Experiments, Conclusions and More:

Getting the langchain libraries and hooking it all up with Ollama and Gemma3 were relatively painless. Switching the LLM between Deepseek/Gemma/Llama32 were also single line changes in the code.
There is quite a bit of variation that can be controlled using the paramters that are passed to the OllamaLLM - specifically experimenting with higher temperature values provided for more interesting and varied responses (to the point of even hallucinating ) , while lower values provided more grounded responses. This was true across different models .
Retrieving the top 10 documents (parameter ‘k’ being passed to the as_retriever() function seemed to be most optimal - Passing a larger number seemed to cause some iterations of Deepseek to just tailspin and not compete the reasoning. Passing a lesser number seemed to miss the mark (in terms of finding the root cause) in some cases.
While it was my intent to actually change the search to a hybrid search (vector + text based) to see if the responses and reasoning might turn out better, there were compatibility issues between the vector database and Langchain in terms of the APIs that supported this combo - so this effort was deferred.
While the response attributed the load balancer weights update to be a result of the CPU spike, in reality it is possible that the CPU spike was actually the result of updating the load balancer weights - in fact this is very much a real world scenario; however the point is that the Agentic workflow was able to correlate events and provide a causality.

Links and References:

Langchain and weaviate integration https://python.langchain.com/docs/integrations/vectorstores/weaviate/
JIRA APIs: https://developer.atlassian.com/cloud/jira/platform/rest/v3/intro/#version
Slack APIs: https://api.slack.com/apis
Working with slack apps (creating, publishing, webhooks ): https://api.slack.com/messaging/webhooks
Supercharging LLMs with RAG
From videos to insights : RAG with OpenLLMs

Badri's Substack

Discussion about this post