Agentic Ops for Predictive Reliability

Prophet forecasting + LLM insights for faster, smarter anomaly detection

Aug 09, 2025

In the last few posts, we explored RAG and how addition of a localized knowledge base component was not just useful , but essential to harness the power of LLMs. We also looked at how agentic workflows take it a step further by adding reasoning capabilities to the system.

Progressing on similar lines, in this post, we’ll take a look at how agentic toolchains can add value by combining LLMs with domain-specific tools , and deliver results that neither of them could achieve alone. Also continuing on our previous thread of Site Reliability Engineering (SRE) workflows, we will build an agentic flow to address SRE tasks.

Before we jump into the details, I think it will be good to explain why the SRE domain is a good usecase for such an agentic toolchain.

1. Understanding the Nature of SRE Data

Site Reliability Engineering (SRE) teams monitor an enormous variety of operational time series data.
Typical telemetry includes but is not limited to Traffic volumes, performance stats, reliability and infrastructure metrics.

These metrics almost always exhibit:

Seasonality — predictable repeating patterns (e.g., daily traffic peaks, weekly cycles).
Trends — gradual changes over time (e.g., traffic growth, slow performance degradation).
Event Effects — deployments, maintenance windows, outages, or marketing campaigns.
Noise — short-term variability from natural randomness in the system.

Typically, organizations start out with simple approaches of static thresholds (“alert if latency > 200ms”) or simple moving averages, but very soon are faced with either false positives that increase alert fatigue , resulting in dampening the alerts , which in turn resulting in missed criticalities, which results in lowering the threshold causing alert fatigue …. you get the idea. Eventually SRE teams arrive at some ad-hoc anomaly detection mechanisms, which might also need to be tuned regularly. In general:

Static Thresholds:
- Ignore seasonality — trigger false positives during predictable high-load periods or even troughs- which is a frequent cause of alert fatigue and missed alerts
- Miss slow degradation that never crosses the fixed threshold - which over time is just a ticking bomb

Simple Moving Averages or Rolling Std. Dev.:
- Smooth out short-term noise, but lag behind when trends shift quickly.
- Assume data is stationary — struggle with multi-seasonal patterns.
Manual SQL or Dashboard Inspection:
- Requires human intuition and constant attention - good when you have sharp engineers ; still cannot scale well.
- Not scalable for thousands of metrics across regions and services.
Ad-hoc Anomaly Detection Scripts (e.g., z-score):
- Sensitive to outliers and parameter tuning.
- Can’t handle complex seasonality without extensive preprocessing.

2. When forecasting is all you need

However, the good news is that this is not a new problem (or even one that is unique to just our SRE domain). Business and operational forecasting is a math problem to solve - one where there are numerous time tested models that when applied correctly, can provide not only rich insights, but also when integrated with some automation can help address the very nature of complex, seasonal and noisy data that is regular for SRE teams.

For the purpose of this article , we have chosen to use Prophet, an open source model from Meta, that shines when data has seasonal trends, is noisy and is prone to holiday effects. Details of the associated paper are linked at the end of the article, however, the gist is that Prophet models data by breaking it into four components:

y(t)=g(t)+s(t)+h(t)+εt

Where:

g(t) — Trend: long-term growth or decline (piecewise linear or logistic growth).
s(t) — Seasonality: repeating daily, weekly, or yearly patterns (via Fourier series).
h(t) — Holiday/Event effects: one-off events causing spikes or dips.
εt — Noise: unpredictable randomness.

The assumption here is that the data is a mix of steady change, repeating rhythms, special events, and randomness —and the modeling pulls those apart so users can see what’s actually unusual. For example, if request latency spikes every Monday at 9 AM due to user traffic, Prophet learns that it’s normal and won’t raise a false alarm. But if latency suddenly jumps on a Thursday afternoon with no known reason, it flags it as an anomaly. Similarly, the model considers the effects of weekend traffic behavior (that might increase in the case of retail traffic and might have troughs in case of enterprise SaaS traffic - and accordingly predict normals and flag anomalies.

I need to emphasize, that this decomposition approach is critical for SRE work because it filters out the expected and highlights the unexpected — allowing teams to respond faster to genuine operational issues without being buried in alert noise.

There are a few other models/mechanisms that are worth mentioning here and might warrant separate posts or experiments to compare and contrast with the relatively simpler, but just effective, easy to use and faster Prophet model :

ARIMA - (Auto Regressive Integrated Moving Average) - Another statistical tool that predicts future outcomes based on historical data. As the acronym suggests this model has a AR component - autoregression (relationship between a data point and its previous values), Integration component (which makes the data stationary - this is not typically the case with trends or seasonality, so this component tries to account for that variance) and a MA - Moving Average component that effectively looks at past forecast errors to improve future predictions.
LSTM (Long Short Term Memory) - LSTMs are a type of a recurrent neural network designed to remember information over longer periods, and are (were ?) all the rage in NLP at least prior to Transformers. They operate by selectively adding or removing information , regulating the flow of data through the network. They can be effective for our usecase, considering they can retain patterns over long intervals such as days or weeks and make accurate predictions since they can filter our irrelevant noise, however, they do come with significant computational costs that might require GPUs.

One question to ask ourselves is if the LLM itself can be used for such forecasting. While they are great at a a few tasks, including reasoning , it is critical to note that :

LLMs are trained for pattern recognition in text, not for precise numeric regression.
They lack the statistical rigor and uncertainty quantification needed for robust anomaly detection.
Asking an LLM to “forecast” without a numerical model risks producing outputs that look plausible but are statistically meaningless — the classic “confident hallucination.”

They can definitely augment , analyze, “think” and reason out solutions, once the quantification has been completed by tools that are great at their jobs - the old adage of not trying to bring a sword to a gun fight.

3. A deeper dive, and building blocks

As mentioned earlier, a impactful agentic toolchain combines the power of domain specific tools with the capabilities of a LLM . For our SRE usecase, the first building block (tool) that we consider will be an anomaly detection tool based on the Prophet forecasting model.

3.1 Anomaly detection tool

While the tool itself should (and is) agnostic to the type of data that it is working on, for the sake of this exercise, we will consider a rather mundane network traffic data set - this however can be replaced by other infrastructure, load, performance metrics, that we are interested in detecting the anomalies on.

Network data for a cloud based tenant will typically include bytes_in, bytes_out, request_duration, response_status and the time series that the data is being considered for, along with cloud deployment region information for the data. A typical plot , including troughs for weekends, holidays and spikes during daytime traffic over a period of several months might have a pattern as follows :

This data however is prone to move outside of normal variance on account of managed and unmanaged factors, such as say software deployments and infrastructure failures . Any decent SRE set of automation tools would either have synthetic checks - both post deployment as well as steady state ones - that would capture outright failures (say traffic dropped completely) , or could monitor such synthetic test services for any dips, but it is a bit hard to account for various scenarios:

Deployment happens on a weekend when traffic is already low, so the confidence level of such metrics might be low as well
Synthetic tests do not always capture the nature of live traffic that could have different flow paths based on the different customer profiles/configurations that might be in play
The traffic comes back, although has a degradation (i.e, not a complete outage) or worse there is a slow degradation over a period of days
Unknown network effects on account of infrastructure, or even the trending nature of the traffic results in preconfigured thresholds being breached or rendered ineffective

Any modeling needs to take into account such vagaries of traffic - Typically, with Prophet, this happens in 3 steps :

Train Prophet on historical data.
Predict the expected value range (with confidence intervals).
Flag anomalies if real values fall outside that range.

       
       #consider holiday effects
       all_holidays = pd.concat([public_holidays, weekends], ignore_index=True)

        #Train on historical data
        model = Prophet(holidays=all_holidays)
        model.fit(df_pre)

        # Forecast 30 minutes ahead
        future = model.make_future_dataframe(periods=60, freq='min')
        forecast = model.predict(future)

        # Actual values after deployment
        df_agg.set_index('timestamp', inplace=True)
        df_post = df_agg.loc[
            (df_agg.index >= deployment_time) &
            (df_agg.index < deployment_time + pd.Timedelta(minutes=30))
        ].copy()
        
        #combine the forecasted results with the actuals     
        forecast.set_index('ds', inplace=True)
        result = forecast[['yhat', 'yhat_lower', 'yhat_upper']].join(df_post, how='inner')

         #flag anomalies - in this case if actuals are less than the lower threshold
         result['anomaly'] = result['bytes_out'] < result['yhat_lower']

In the snippet above, we flag if the actuals are less than the lower bounds of the uncertainty interval. In general these are the 3 main values

The uncertainty interval itself is tunable as a part of the initialization.


        #Train on historical data
        model = Prophet(interval_width=0.8) #default = 0.8
        model.fit(df_pre)

A narrower interval might detect anomalies more aggressively, while a wider interval might detect big deviations, reducing noise but possibly missing smaller degradations.

For example , using the network data above, simulation of a partial drop in traffic with the default interval width seemed to flag the anomalies immediately, while allowing a larger width allowed for more leeway in detecting the anomaly

3.2 Service metadata Database Tool

The second tool that we will consider for our chain will be a database tool - This is just a connector to a database (we use sqlite3 , but replacing this with any other similar DB is just a matter of rewriting the APIs to do standard CRUD operations on the database). The point of this tool is to provide additional customized data for the toolchain to help reason and answer system level questions.
For example, a sample DB could have a schema with the following fields, which indicate when a service was onboarded and if the service is currently enabled.

 connection.execute(text("""
             CREATE TABLE IF NOT EXISTS service_config_state (
                      service VARCHAR(255) NOT NULL PRIMARY KEY, 
                      region VARCHAR(255) NOT NULL,
                      created_at DATETIME DEFAULT CURRENT_TIMESTAMP,  
                      state VARCHAR(50) NOT NULL DEFAULT 'enabled')
               """))

And insertion after a create operation would result in a response like :

Retrieving data from sqlite:///service_db.sqlite...
('service1', 'us-west1', '2025-02-14 10:00:00', 'enabled')
('service2', 'us-west1', '2025-07-15 1:00:00', 'enabled')
('service3', 'us-east1', '2025-07-29 17:00:00', 'enabled')
('service4', 'us-south1', '2025-08-02 08:00:00', 'enabled')
('service5', 'us-west1', '2025-08-03 17:00:00', 'enabled')

A key point to note is that while doing the actual query as a part of the toolchain, it is possible for us to actually ask the LLM to generate a SQL query from a set of NL prompts - in the snippet below, we provide the path of the DB (also as a part of the prompt), but have the code to get the schema and then feed it to the LLM to generate the actual SQL query.

        db_path = db_path.strip("'")
        engine = create_engine(f'sqlite:///{db_path}')
        #print(f'db_path {db_path} , engine {engine}')
        table_name = "service_config_state"
        table_info = get_table_schema(engine, table_name)
        #print(f"Table schema for {table_name}:\n{table_info}")

        # Define the prompt template for generating SQL queries
        prompt_template = PromptTemplate(
            input_variables=["timestamp", "table_info", "region"],
            template="""
You are a SQL expert. Given the table schema:

{table_info}
 
write a SQL query that returns all services where region = '{region}' and created_at < '{timestamp}'.
Only return the query — no explanation.

SQL:"""
        )

        sql_chain = LLMChain(llm=llm, prompt=prompt_template)

        results = []
        # for the list of anomaly timestamps we obtained from the previous tool, execute the SQL query
        for ts in timestamps:
            sql_query = sql_chain.run({"timestamp": ts, "table_info": table_info, "region": region})

3.3 Hooking it all up in a agentic workflow

A sample workflow for our SRE usecase would be something like :

Anomaly Detection — Prophet analyzes SRE metric data (e.g., from CSV or BQ) and outputs anomaly timestamps per metric and per region.
Impact Assessment — The agent uses NL-to-SQL to query the config/state database for affected services at those times.
Summary + Action — LLM composes a structured incident summary or answers a question and optionally triggers escalation workflows (Slack/Jira).

For this last act, we will use our time tested langchain toolchain to orchestrate the tools and then add the LLM layer using Ollama. Continuing on our thread of using open LLMs, we will use Gemma3 to tie it all together

 
    # ------------------------
    # Set up the LLM
    # ------------------------
  
    llm = Ollama(model="gemma3:12b", temperature=0)
    

    # ------------------------
    # Define Tools for the Agent
    # ------------------------

    tools = [
        Tool(
            name="detect_anomalies",
            func=detect_anomalies_tool,
            description="Use this tool to detect anomaly timestamps from a network CSV file. Input: path to the CSV file"
        ),
        Tool(
            name="get_affected_services",
            func=get_services_by_region_tool,
            description="Use this to find enabled services at one or more timestamps. Input: 'db_path=<db>, timestamp=<timestamp>, region=<region>'"
        )
    ]
    # ------------------------
    # Initialize LangChain Agent
    # ------------------------

    agent = initialize_agent(
        tools=tools,
        llm=llm,
        agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
        verbose=True
    )

    # ------------------------
    # Agent Call
    # ------------------------

    response=agent.run("""
    1. Use the detect_anomalies tool to find anomaly timestamps per region from 'network_traffic_data_mbytes_per_region_with_traffic_drop.csv'.
    2. For each timestamp, use the get_affected_services tool to get services created before that timestamp from the database located at 'service_db.sqlite'.
    """)

Note that the agent uses a ZERO_SHOT_REACT_DESCRIPTION, similar to our previous experiments. With this mode, no examples are needed to tell the LLM what to do; the agent is expected to perform the task using the instructions that were provided ( as a part of the agent.run() call. The ReACT framework is a reasoning and acting loop where the LLM alternates between

Thinking (internal reasoning)
Acting (calling tools)
Observing (receiving results)
Responding

4 Example in Action

For the prompt we provided:

1. Use the detect_anomalies tool to find anomaly timestamps per region from 'network_traffic_data_mbytes_per_region_with_traffic_drop.csv'.
2. For each timestamp, use the get_affected_services tool to get services created before that timestamp from the database located at 'service_db.sqlite'.

here is a sample response

> Entering new AgentExecutor chain...
Okay, let's start by using the `detect_anomalies` tool to find the anomaly timestamps.

Action: detect_anomalies
Action Input: 'network_traffic_data_mbytes_per_region_with_traffic_drop.csv'Detecting anomalies in network_traffic_data_mbytes_per_region_with_traffic_drop.csv at time 2025-07-29 12:00:00...
20:08:02 - cmdstanpy - INFO - Chain [1] start processing
20:08:38 - cmdstanpy - INFO - Chain [1] done processing
Prophet model found {'us-west1': [Timestamp('2025-07-29 12:04:00'), Timestamp('2025-07-29 12:05:00'), Timestamp('2025-07-29 12:06:00'), Timestamp('2025-07-29 12:07:00'), Timestamp('2025-07-29 12:08:00'), Timestamp('2025-07-29 12:09:00'), Timestamp('2025-07-29 12:10:00'), Timestamp('2025-07-29 12:11:00'), Timestamp('2025-07-29 12:12:00'), Timestamp('2025-07-29 12:13:00'), Timestamp('2025-07-29 12:14:00'), Timestamp('2025-07-29 12:15:00'), Timestamp('2025-07-29 12:16:00'), Timestamp('2025-07-29 12:17:00'), Timestamp('2025-07-29 12:18:00'), Timestamp('2025-07-29 12:19:00'), Timestamp('2025-07-29 12:20:00'), Timestamp('2025-07-29 12:21:00'), Timestamp('2025-07-29 12:22:00'), Timestamp('2025-07-29 12:23:00'), Timestamp('2025-07-29 12:24:00'), Timestamp('2025-07-29 12:25:00'), Timestamp('2025-07-29 12:26:00'), Timestamp('2025-07-29 12:27:00'), Timestamp('2025-07-29 12:28:00'), Timestamp('2025-07-29 12:29:00')]} with anomalies.

Observation: {'us-west1': [Timestamp('2025-07-29 12:04:00'), Timestamp('2025-07-29 12:05:00'), Timestamp('2025-07-29 12:06:00')]}
Thought:Now I have the anomaly timestamps for us-west1. I need to use the `get_affected_services` tool for each of these timestamps.

Action: get_affected_services
Action Input: 'db_path=service_db.sqlite, timestamp=2025-07-29 12:04:00, region=us-west1'Parsed query db_path=service_db.sqlite, timestamp=2025-07-29 12:04:00, region=us-west1  args into {'db_path': 'service_db.sqlite', 'timestamp': '2025-07-29 12:04:00', 'region': 'us-west1'}

Tables in Engine(sqlite:///service_db.sqlite):

Table: service_config_state
/Users/badri/work/src/ai-sandbox4/anomaly_agent.py:132: LangChainDeprecationWarning: The class `LLMChain` was deprecated in LangChain 0.1.17 and will be removed in 1.0. Use :meth:`~RunnableSequence, e.g., `prompt | llm`` instead.
  sql_chain = LLMChain(llm=llm, prompt=prompt_template)
Executing SQL query for timestamp 2025-07-29 12:04:00:
SELECT service
FROM service_config_state
WHERE region = 'us-west1' AND created_at < '2025-07-29 12:04:00';

Observation: Affected services at 2025-07-29 12:04:00: service1, service2
Thought:Now I have the affected services for the first timestamp. Let's do the next one.

Action: get_affected_services
Action Input: 'db_path=service_db.sqlite, timestamp=2025-07-29 12:05:00, region=us-west1'Parsed query db_path=service_db.sqlite, timestamp=2025-07-29 12:05:00, region=us-west1  args into {'db_path': 'service_db.sqlite', 'timestamp': '2025-07-29 12:05:00', 'region': 'us-west1'}

Tables in Engine(sqlite:///service_db.sqlite):

Table: service_config_state
Executing SQL query for timestamp 2025-07-29 12:05:00:
SELECT service
FROM service_config_state
WHERE region = 'us-west1' AND created_at < '2025-07-29 12:05:00';

Observation: Affected services at 2025-07-29 12:05:00: service1, service2
Thought:Now I have the affected services for the second timestamp. Let's do the last one.

Action: get_affected_services
Action Input: 'db_path=service_db.sqlite, timestamp=2025-07-29 12:06:00, region=us-west1'Parsed query db_path=service_db.sqlite, timestamp=2025-07-29 12:06:00, region=us-west1  args into {'db_path': 'service_db.sqlite', 'timestamp': '2025-07-29 12:06:00', 'region': 'us-west1'}

Tables in Engine(sqlite:///service_db.sqlite):

Table: service_config_state
Executing SQL query for timestamp 2025-07-29 12:06:00:
SELECT service
FROM service_config_state
WHERE region = 'us-west1' AND created_at < '2025-07-29 12:06:00';

Observation: Affected services at 2025-07-29 12:06:00: service1, service2
Thought:I have now retrieved the affected services for all anomaly timestamps.

Final Answer: Anomalies detected at us-west1: 2025-07-29 12:04:00, 2025-07-29 12:05:00, 2025-07-29 12:06:00. Affected services at these timestamps: service1, service2

> Finished chain.

From our service database, we can see that only those 2 services that were provided as a part of the Final Answer output were the ones that were created at the time the anomalies were detected.

5. Show me the code:

https://github.com/badrinatarajan/ai-sandbox4

6. Experiments, Conclusions and More:

Getting the Prophet library and friends installed was a breeze - a caveat though seems to be that this project may not be in active development and so no big changes /upgrades are expected. However, the library seems to work pretty easily with several months of data and with minimal or expected latencies when running locally.
Getting the prompts to work as desired required some tuning though - Note that the database to use, csv files etc were passed as prompts (as opposed to being hardcoded) . Llama3.2 and Gemma3 were able to generate the SQL prompts quite easily - though the queries themselves were relatively easy. More complex joins might need a bit of work, or hardcoding/providing the queries statically instead of them being dynamically generated.
The Langchain toolchain works predominantly with strings as inputs to the tools. In our case, (depending on the LLM) it took a bit of work converting the dict to strings. The Langchain can work apparently with Strutured tools that can take non-string inputs - this was not tested in our case though.
Deepseek still seems to have issues in reliably working with Langchain toolchain. While this might just be because of the thinking mode - where the LLM generates the </think> tags that are not expected, it seemed like there was not a way to turn this off using APIs or even prompts - this needs to be investigated.

Badri's Substack

Discussion about this post