Tuesday, October 31, 2006

Paper: "Flight Data Recorder: Monitoring Persistent-State Interactions to Improve Systems Management"

Flight Data Recorder: Monitoring Persistent-State Interactions to Improve Systems Management
Chad Verbowski, Emre Kıcıman, Arunvijay Kumar, and Brad Daniels, Microsoft Research; Shan Lu, University of Illinois at Urbana-Champaign; Juhan Lee, Microsoft MSN; Yi-Min Wang, Microsoft Research; Roussi Roussev, Florida Institute of Technology

Abstract

Mismanagement of the persistent state of a system—all the executable files, configuration settings and other data that govern how a system functions—causes reliability problems, security vulnerabilities, and drives up operation costs. Recent research traces persistent state interactions—how state is read, modified, etc.—to help troubleshooting, change management and malware mitigation, but has been limited by the difficulty of collecting, storing, and analyzing the 10s to 100s of millions of daily events that occur on a single machine, much less the 1000s or more machines in many computing environments.

We present the Flight Data Recorder (FDR) that enables always-on tracing, storage and analysis of persistent state interactions. FDR uses a domain-specific log format, tailored to observed file system workloads and common systems management queries. Our lossless log format compresses logs to only 0.5–0.9 bytes per interaction. In this log format, 1000 machine-days of logs—over 25 billion events—can be analyzed in less than 30 minutes. We report on our deployment of FDR to 207 production machines at MSN, and show that a single centralized collection machine can potentially scale to collecting and analyzing the complete records of persistent state interactions from 4000+ machines. Furthermore, our tracing technology is shipping as part of the Windows Vista OS.

1 Comments:

Blogger agmiklas said...

Flight Data Recorder: Monitoring Persistent-State Interactions to Improve Systems Management
Chad Verbowski *, Emre Kiciman, Arunvijay Kumar, and Brad Daniels Microsoft Research; Shan Lu, University of Illinois at Urbana-Champaign; Juhan Lee, Microsoft MSN; Yi-Min Wang, Microsoft Research; Roussi Roussev, Florida Institute of Technology


Chad presented the Flight Data Recorder, a new tool that will ship with Windows Vista that allows all changes to the persistent state of a system to be logged for later analysis. Various system management tasks that up until now have been something of a black art essentially reduce to queries over the logs gathered by the FDR. As a motivating example, Chad told of a server at Microsoft that would exhibit extremely poor performance every few weeks. A system administrator eventually determined that this was because the system's page file was being inappropriately shrunk. Unfortunately, he was unable to determine why this was happening. The best he could do was to send out an e-mail to the other admins asking them to make sure they weren't resizing the file. However, after running the FDR for a few weeks, the logs were used to quickly pinpoint the offending script.

Currently, system management tools break down into roughly three categories. Some use a similar logging approach, but only activate on-demand due to space constraints. These types of tools are of limited usefulness when trying to determine why a particular piece of configuration data changed. Another class of tools uses signatures to look for known-bad configurations. However, creating signatures general enough to be useful across a wide variety of machines is a time consuming process. Finally, manifest-based approaches require that applications provide the system with a list of all configuration dependencies. However, Chad pointed out that it is difficult to generate manifests for third-party and legacy applications.

The proposed approach simply logs all changes to the system's persistent state. The main contribution of the work is its novel method of encoding the activity logs. This method requires on average just 0.5-0.9 bytes per event. Since typical systems generate on the order of 10M events per day, the resulting logs are small enough to be practically sent over the network, archived, correlated with other systems, and quickly queried. Due to the design of the data file format, common queries can be executed against a day's worth of stored data in as little as three seconds. Finally, Chad mentioned that it should be possible to serve as many as 5000 systems running the FDR with a single archive server.

The authors see the FDR as being useful not only for system management, but also for ensuring that various security and management policies are being followed. For example, the logs captured by the FDR can be used to determine how often a locked-down production server is modified without proper approval. The FDR can also be used to assist in locating system "extensibility points": configuration settings that control the loading of extra system services or plug-ins. This has important implications for detecting and removing malware. In summary, the FDR has made it possible to know about all persistent state changes on a system.

Q: Can the FDR be used to predict how a system will respond to a configuration change? (Brad Flight, MSF ? *NOTE* I'm unable to find this individual on the attendee list. The name may be incorrect.)

A: We can search for another system that already has the configuration change, but it otherwise similar to the machine in question. If we can find such a system, we can use it to approximate how this machine would behave with the change.


Q: What is the right way to query the logs generated by the FDR? Should we be using SQL, or perhaps a customized query engine with a programmatic API, etc? (George Candea, EPFL)

A: Right now, we just expose the raw tables as they are in the log file. Any special-built query engine should be optimized to handle queries of the form "what files have been changed since time T", since the most common requests seem to be those that look for modified configuration entries.


Q: Other types of events might be worth logging. For example, did you consider logging all socket activity?

A: We did think about logging IPC activities, but the system doesn't
currently implement this feature.

[NOTE: This talk summary has not been checked by the paper authors.]

9:03 PM  

Post a Comment

<< Home