Production computing infrastructures, particularly multi-tenant cloud infrastructures, have become increasingly complex and require constant monitoring and maintenance. Cloud service providers are faced with the challenge of both high operation cost and daunting service downtime penalty. Existing monitoring tools continuously collect a large amount of metric and log data but still fail to answer the key operation questions about when and why a cloud infrastructure experiences a problem. In this talk, I will present a set of automatic system anomaly prediction and diagnosis techniques using unsupervised online learning methods. Our techniques can raise advance alerts before an anomaly affects the system and provide important clues on why an anomaly occurs. Furthermore, we can extract causal relationships among a large number of system components and anomalous events to facilitate automatic root cause analysis. We have tested our techniques more than hundreds of real system failures on more than a dozen production system. Our results show that we can increase anomaly detection coverage by 50% and reduce the false alarm rates by more than 95% and reduce the root cause analysis time from days to minutes.
11:40 am - 11:50 am Arrival and socializing
11:50 am - 12:00 pm Opening
12:00 pm - 12:50 pm Helen Gu, " AIOps for Intelligent Cloud Operations"
12:50 pm - 1:10 pm Q&A
Speaker: Helen Gu: InsightFinder Inc.
Please register here
Webinar ID: 839 8556 8755
Website: Click to Visit
Save this Event:iCalendar
Windows Live Calendar