High service availability is crucial for cloud systems. A typical cloud system uses a large number of physical hard disk drives and solid state drives. Disk errors are one of the most important causes that lead to service unavailability. Disk error (such as reallocate sector error and long access latency) can be seen as a form of gray failure, which are fairly subtle failures that are hard to be detected, even when applications are afflicted by them. In this talk, we will introduce an approach to predict disk errors proactively to avoid severe damage to the cloud system. The ability to predict faulty disks enables live migration of existing virtual machines and allocation of new virtual machines to the healthy disks, therefore improving service availability. To build an accurate online prediction model, we utilize both disk-level sensor (SMART) data as well as system level signals. We develop a cost-sensitive ranking-based machine learning model that can learn the characteristics of faulty disks in the past and rank the disks based on their error-proneness in the near future. We evaluate our approach using real-world data collected from a production cloud system.
Senior Data Scientist
Youjiang Wu is a senior data scientist in Microsoft Azure. His focus area is AIOps, i.e, using big data analytics, machine learning, and other AI technologies to enable efficiently and effectively building and operating cloud service. The long-term vision is to build services that are self-monitoring, self-healing, and self-management with low human intervention. He currently leads the efforts of building cloud prediction platform for various AIOps scenarios in Azure. Before joining Microsoft, he worked in Amazon on using big data analytics and machine learning to optimize inventory management and supply chain operation. He holds 2 patents, one on predicting hardware failure and the other one on managing expiring inventory. He studied at Beijing Institute of Technology and University of Washington.