2008-06-22

Automated Debugging (5): reproducing problems - better than what I had thought

1. Why reproduce?

* Observing the problem. Without being able to reproduce the problem, one cannot observe it or find any new facts.
* Check for success. How do you know that the problem is actually fixed?

2. Reproducing is Tough
* Reproducing is one of the toughest problems in debugging.
* One must recreate the environment in which the problem occurred
* also need to recreate the problem history – the steps that lead to the problem

3. Reproducing the Environment

* Iterative Reproduction
* Start with your environment
* While the problem is not reproduced, adapt more and more circumstances from the user’s environment
* Iteration ends when problem is reproduced (or when environments are “identical”)
* Side effect: Learn about failure-inducing circumstances

4. Reproducing Execution

* Basic idea: Any execution is determined by the input (in a general sense)
* Reproducing input → reproducing execution!



4.1 data

* Easy to transfer and replicate
* Caveat #1: Get all the data you need
* Caveat #2: Get only the data you need
* Caveat #3: Privacy issues

4.2 User interaction

Record and Replay

4.3 Communication

* General idea: Record and replay like user interaction
* Bad impact on performance
* Alternative #1: Only record since last checkpoint (= reproducible state)
* Alternative #2: Only record “last” transaction

ZW comments: not sure what the alternatives mean.

4.4 Randomness

* Program behaves different in every run
* Based on random number generator
* Pseudo-random: save seed (and make it configurable): Same applies to time of day
* True random: record + replay sequence

4.5 Operating System

* The OS handles entire interaction between program and environment
* Recording and replaying OS interaction thus makes entire program run reproducible
* Trace the program and replay the trace
* Trace Challenges

Tracing creates lots of data
Example: Web server with 10 requests/sec A trace of 10 k/request means 8GB/day
All of this must be replayed to reproduce the failure (alternative: checkpoints)
Huge performance penalty!

4.6 Scheduling

* Thread changes are induced by a scheduler
* It suffices to record the schedule (i.e. the moments in time at which thread switches occur) and to replay it
* Requires deterministic input replay

Constructive Solutions
* Lock resource before writing
* Check resource update time before writing
* ... or any other synchronization mechanism

4.7 Physical Influences

Rare and hard to reproduce
* Static electricity
* Alpha particles (not cosmic rays)
* Quantum effects
* Humidity
* Mechanical failures + real bugs

4.8 Debugging Tools

* Heisenbug: Code fails outside debugger only
* Bohr Bug, Mandelbug, Schrödinbug

Bohr Bug = Repeatable under well-def’d conditions
Heisenbug = Changes when observed
Mandelbug = Causes are complex and chaotic, appears non-deterministic, but isn’t
Schrödinbug = Never should have worked, and promptly fails as soon one realizes this

5. Isolating Units

* Capture + replay unit instead of program
* Needs an unit control layer to monitor input

Examples:
* Databases. Replay only the interaction with the database.
* Compilers. Record + replay intermediate data structures rather than the entire front-end.
* Networking. Record + replay communication calls.

More Interaction
* Variables (hard to detect)
* Other units (break dependency if needed)
* Time (record + replay, too)


*
*

没有评论: