- ADELE: Anomaly Detection from Event Log Empiricism
A large population of users gets affected by sudden slowdown or shutdown of an enterprise application. System administrators and analysts spend considerable amount of time dealing with functional and performance bugs. These problems are particularly hard to detect and diagnose in most computer systems, since there is a huge amount of system generated supportability data (counters, logs etc.) that need to be analyzed. Most often, there isn’t a very clear or obvious root cause. Timely identification of significant change in application behavior is very important to prevent negative impact on the service. In this work, we present ADELE, an empirical, data-driven methodology for early detection of anomalies in data storage systems. ADELE learns from system’s own history to establish the baseline of normal behavior and gives accurate indications of the time-period when something is amiss for a system. Validation on more than 4800 actual support cases shows ∼ 83% true positive rate and ∼ 12% false positive rate in identifying periods when the machine is not performing normally. We also establish the existence of problem signatures” which help map customer problems to already seen issues in the field. ADELE’s capability to predict early paves way for online failure prediction for customer systems ADELE’s capability to predict early paves way for online failure prediction for customer systems.
- RCA: Root Cause Analysis From Event Log
The problem of anomaly detection from noisy system log is a very challenging task. Even if an alert is triggered or an anomalous behavior is detected it is more difficult and time-consuming to find the actual root cause in large scale systems, . First, various types of modules make the functional dependencies. These dependencies follow the transitive closure of the call chain. Due to this complexity in dependencies, a support/monitoring team has to maintain deep domain knowledge of almost every modules and its semantics. When the system evolves quickly through rapid deployment of new features, it is very hard for the team to keep updating such knowledge. In addition to that, there is a dedicated support team for each specific modules where the co-ordination between two different (for two different module) team is missing. Second, the total number of module is large and each module can generate hundreds of metrics. Third, in such a large complex evolving system, the prior knowledge of dependency tree between module is not available. Typically the concerned person has to go through different log files and event data to do the Root-Cause Analysis (RCA) of an issue occurred in a system. The complexity and size of logs makes it often difficult for human operators and administrators to track the problem and perform root cause analysis. A big challenge is to provide the necessary tools and techniques for the operators to focus their attention to specific subsystems (module) thus reducing the complexity of the diagnostic process. Automatic failure diagnosis and root cause analysis support mechanisms can potentially narrow down, or even localize faults within a very short time which both helps to preserve system availability and customer satisfaction. It is very hard to exactly pin-point the root cause module in such large scale complex system, so we propose a methodology which can generate some set of modules probably responsible for root cause of the problem.
Members & Collaborations:
- Bivas Mitra
- Niloy Ganguly
- Jayanta Basak (NetApp)
- Ajay Bakshi (NetApp)