Human-Friendly Observability with Generative AI
Generative AI Enhancing Observability
Preface
Imagine building a complex web application that works with many microservices working together. This is a pure SaaS native web application that has many elements that can go wrong. To provide 99.99% uptime & high availability, all the components/services should be up & running at any given point in time.
Sounds Challenging?
It is indeed. Now let’s put ourselves in the shoes of the user who is using this web application & is trying to edit some data in a table shown in a widget on the screen & it fails. There are two questions to answer here.
- How does the application operations team know that something is going wrong & take a preventive/proactive step to correct the state of the system?
- Now that the error has happened, how is the user made aware of what went wrong and why & what can the user do next to recover from that error?
This problem is not new & we have been using a lot of sophisticated mechanisms to answer both the questions above. In this article let us see how generative AI can help us answer them better.
Observability & AI — Benefits
Observability has evolved into a critical aspect of modern application architectures, demanding advanced tools and methodologies to ensure efficient monitoring and troubleshooting. Traditional monitoring tools often struggle to keep up with the complexity of distributed systems, leading to a need for innovative solutions.
Dynamic Baseline Establishment
Establishing a baseline for normal system behavior is crucial for effective observability. Generative AI can dynamically adapt to changes in application behavior and adjust baselines accordingly. This adaptability is essential in dynamic environments where traditional static baselines may not accurately represent the system’s normal state. By continuously learning and updating baselines, Generative AI ensures that observability tools remain effective in the face of evolving application architectures.
Example:
Consider a web application that experiences traffic variations throughout the day. Traditional baselines might fail to adapt to these changes. Generative AI continually learns from the application’s behavior, adjusting baselines dynamically to accommodate fluctuations and ensuring accurate anomaly detection.
Automated Anomaly Detection
Generative AI excels at pattern recognition and anomaly detection. By training models on historical data and expected behaviors, AI algorithms can automatically identify deviations from normal patterns. In the context of observability, this means the ability to detect anomalies in application metrics, logs, and traces in real time. This automated anomaly detection reduces the time it takes to identify and respond to issues, improving overall system reliability.
Example:
Consider an e-commerce platform experiencing a sudden surge in traffic during a flash sale. Generative AI, trained on historical data, can identify this unusual spike in user activity as an anomaly. The system generates alerts in real time, allowing the operations team to investigate and scale resources accordingly.
Let’s take a hypothetical interaction:
Support Engineer > I got an alert for the spike in incoming traffic.
How many active sessions do we have in the past two hours?
Gen AI > There is an uptick in the incoming traffic.
The average number of users active during 10 PM -12 PM on the application
is 12k. In the last hour during the same time window,
18k users were having active sessions.
Support Engineer > Add another node to serve the traffic from the US
region & register it in the load balancer.
Gen AI > Done.
Natural Language Processing for Human-Centric Observability
Generative AI, equipped with Natural Language Processing (NLP), can facilitate human-centric observability. By transforming raw data into human-readable insights, NLP-powered AI systems make it easier for developers and system administrators to interpret complex metrics and logs. This enhanced accessibility accelerates issue resolution and fosters collaboration among cross-functional teams.
In a scenario where developers need to quickly understand the impact of an API change, Generative AI with Natural Language Processing (NLP) transforms raw logs and metrics into human-readable insights. Developers can easily grasp the context, accelerating the debugging process.
Contextual Log Analysis
Traditional log analysis tools often generate overwhelming amounts of data, making it challenging to identify relevant information during troubleshooting. Generative AI can help by providing contextual analysis of logs, extracting meaningful patterns, and correlating events across multiple logs. This contextual understanding allows for quicker root cause analysis, reducing downtime and improving the efficiency of incident response teams.
There could be two consumers of log analysis.
- End-user.
- Technical user.
Example:
In a microservices architecture, logs from different services can be overwhelming. Generative AI, using contextual analysis, identifies correlations between logs, helping teams quickly pinpoint the root cause of issues. For instance, a sudden increase in error logs might be linked to a specific service or component failure.
Let’s take a hypothetical interaction:
Technical User
Support Engineer > User with id "abc123", says that at 2:30 PM PST
he was not able to save data into the component "Data Table".
What went wrong?
Gen AI > After analyzing the logs for user with id "abc123" around
2:30 PM PST, it looks like Operational Data Management Service was
failing with the error: <Error description>.
After looking at the linked logs & trace Ids it seems the
connection pool was exhausted from the database.
Information Source: "Error Code: DB_POOL_123".
The proposed resolution is to increase the pool size.
End User
User > I am not able to save data into the component "Data Table".
What is wrong?
Gen AI > The error code is DB_POOL_123 from Service A.
There is a recovery mechanism associated with the error code,
do you want to run the recovery mechanism?
User > Yes please.
Gen AI > It is done, please try again.
User > I can use the widget now.
AI Readiness
All the above amazing human-friendly interactions with a bot & features like auto-healing/auto-scaling would require us to make changes in the way the services are designed & the way we log errors/ context in the messages we push to log servers from UI & Services.
- There should be a single thread tying up a whole transaction together. eg. traceID
- The services should be AI-ready, and there should be a health check & scaling/healing mechanism for each service which can be invoked in a controlled way.
- The more you help AI, the better it can help you. The more & better our logs are, the more meaningful & human friendly it will be for the LLMs to serve the bot users. The error codes should be standardized, each error code should have a recovery, cause & origin associated with it.
Conclusion
Generative AI offers a multifaceted approach to improving observability, as illustrated by the examples and block diagrams. Automated anomaly detection, predictive analysis, contextual log analysis, dynamic baseline establishment, and natural language processing collectively empower technical architects to build resilient and efficient systems. The integration of Generative AI into observability practices not only streamlines troubleshooting but also fosters a proactive approach to managing application performance, ensuring the reliability and optimal functioning of modern applications in dynamic environments. As technology advances, the marriage of Generative AI and observability is poised to redefine how we monitor and maintain complex systems.
While there are promising outcomes to this, there is always data security & privacy concerns that goes hand in hand with LLMs & Generative AI.
Here is an interesting read I want to leave you with before I end this article.
You can follow for more such articles & stay connected on LinkedIn