Replay Debugging for the Datacenter

Replay Debugging for the Datacenter
Author: Gautam Deepak Altekar
Publisher:
Total Pages: 194
Release: 2011
Genre:
ISBN:

Debugging large-scale, data-intensive, distributed applications running in a datacenter ("datacenter applications") is complex and time-consuming. The key obstacle is non-deterministic failures--hard-to-reproduce program misbehaviors that are immune to traditional cyclic debugging techniques. Datacenter applications are rife with such failures because they operate in highly non-deterministic environments: a typical setup employs thousands of nodes, spread across multiple datacenters, to process terabytes of data per day. In these environments, existing methods for debugging non-deterministic failures are of limited use. They either incur excessive production overheads or don't scale to multi-node, terabyte-scale processing. To help remedy the situation, we have built a new deterministic replay tool. Our tool, called DCR, enables the reproduction and debugging of non-deterministic failures in production datacenter runs. The key observation behind DCR is that debugging does not always require a precise replica of the original datacenter run. Instead, it often suffices to produce some run that exhibits the original behavior of the control-plane--the most error-prone component of datacenter applications. DCR leverages this observation to relax the determinism guarantees offered by the system, and consequently, to address key requirements of production datacenter applications: lightweight recording of long-running programs, causally consistent replay of large-scale clusters, and out-of-the box operation with existing, real-world applications running on commodity multiprocessors.

Handbook of Software Fault Localization

Handbook of Software Fault Localization
Author: W. Eric Wong
Publisher: John Wiley & Sons
Total Pages: 614
Release: 2023-04-21
Genre: Computers
ISBN: 1119291828

Handbook of Software Fault Localization A comprehensive analysis of fault localization techniques and strategies In Handbook of Software Fault Localization: Foundations and Advances, distinguished computer scientists Prof. W. Eric Wong and Prof. T.H. Tse deliver a robust treatment of up-to-date techniques, tools, and essential issues in software fault localization. The authors offer collective discussions of fault localization strategies with an emphasis on the most important features of each approach. The book also explores critical aspects of software fault localization, like multiple bugs, successful and failed test cases, coincidental correctness, faults introduced by missing code, the combination of several fault localization techniques, ties within fault localization rankings, concurrency bugs, spreadsheet fault localization, and theoretical studies on fault localization. Readers will benefit from the authors’ straightforward discussions of how to apply cost-effective techniques to a variety of specific environments common in the real world. They will also enjoy the in-depth explorations of recent research directions on this topic. Handbook of Software Fault Localization also includes: A thorough introduction to the concepts of software testing and debugging, their importance, typical challenges, and the consequences of poor efforts Comprehensive explorations of traditional fault localization techniques, including program logging, assertions, and breakpoints Practical discussions of slicing-based, program spectrum-based, and statistics-based techniques In-depth examinations of machine learning-, data mining-, and model-based techniques for software fault localization Perfect for researchers, professors, and students studying and working in the field, Handbook of Software Fault Localization: Foundations and Advances is also an indispensable resource for software engineers, managers, and software project decision makers responsible for schedule and budget control.

Debugging parallel programs with instant replay

Debugging parallel programs with instant replay
Author: Thomas J. LeBlanc
Publisher:
Total Pages: 22
Release: 1986
Genre: Debugging in computer science
ISBN:

The debugging cycle is the most common methodology for finding and correcting errors in sequential programs. Cyclic debugging is effective because sequential programs are usually deterministic. Debugging parallel programs is considerably more difficult because successive executions of the same program often do not produce the same results. In this paper we present a general solution for reproducing the execution behavior of parallel programs, termed Instant Replay. During program execution we save the relative order of significant events as they occur, not the data associated with such events. As a result, our approach requires less time and space to save the information needed for program replay than other methods. Our technique is not dependent on any particular form of interprocess communication. It provides for replay of an entire program, rather than individual processes in isolation. No centralized bottlenecks are introduced and there is no need for synchronized clocks or a globally-consistent logical time. The authors describe a prototype implementation of Instant Replay on the BBN Butterfly Parallel Processor, and discuss how it can be incorporated into the debugging cycle for parallel programs. (Author).

Newt

Newt
Author: Soumyarupa De
Publisher:
Total Pages: 78
Release: 2012
Genre:
ISBN: 9781267478122

Data-intensive scalable computing (DISC) systems facilitate large-scale analytics to mine "big data" for useful information. However, understanding and debugging these systems and analytics is a fundamental challenge to their continued use. This thesis presents Newt, a scalable architecture for capturing fine-grain lineage from DISC systems and using this information to analyze and debug analytics. Newt provides a unique instrumentation API, which actively extracts fine-grain lineage across complex, non-relational analytics. Newt combines this API with a scalable architecture for storing lineage to accommodate the high throughputs of DISC systems. This architecture enables efficient dataflow tracing queries across thousands of operators found in modern data analytics. Newt extends tracing with replay, enabling users to perform step-wise debugging or regenerate lost outputs at a fraction of the cost to execute the entire analytics. Newt further facilitates replay for re-executing analytics without bad inputs to produce error-free outputs. Finally, Newt also enables retrospective lineage analysis, which we use to identify errors in the dataflow using outlier detection techniques. We illustrate the flexibility of Newt's capture API by instrumenting two DISC systems: Apache Hadoop and Hyracks. This API incurs 10-51% time overhead and 30-120% space overhead on workloads consisting of relational and non-relational operators, including a Hadoop-based de novo genomic assembler. Newt can also accurately replay selected outputs, which can reduce the time to recreate errors during debugging. We show that it incurs 0.3% of the original runtime when replaying individual outputs in a WordCount workload. Finally, this work shows the effectiveness of Newt's debugging methodology by pinpointing faulty operators in a dataflow.