Replay Debugging for the Datacenter

Replay Debugging for the Datacenter
Author: Gautam Deepak Altekar
Publisher:
Total Pages: 194
Release: 2011
Genre:
ISBN:

Debugging large-scale, data-intensive, distributed applications running in a datacenter ("datacenter applications") is complex and time-consuming. The key obstacle is non-deterministic failures--hard-to-reproduce program misbehaviors that are immune to traditional cyclic debugging techniques. Datacenter applications are rife with such failures because they operate in highly non-deterministic environments: a typical setup employs thousands of nodes, spread across multiple datacenters, to process terabytes of data per day. In these environments, existing methods for debugging non-deterministic failures are of limited use. They either incur excessive production overheads or don't scale to multi-node, terabyte-scale processing. To help remedy the situation, we have built a new deterministic replay tool. Our tool, called DCR, enables the reproduction and debugging of non-deterministic failures in production datacenter runs. The key observation behind DCR is that debugging does not always require a precise replica of the original datacenter run. Instead, it often suffices to produce some run that exhibits the original behavior of the control-plane--the most error-prone component of datacenter applications. DCR leverages this observation to relax the determinism guarantees offered by the system, and consequently, to address key requirements of production datacenter applications: lightweight recording of long-running programs, causally consistent replay of large-scale clusters, and out-of-the box operation with existing, real-world applications running on commodity multiprocessors.

Leveraging Distributed Tracing and Container Cloning for Replay Debugging of Microservices

Leveraging Distributed Tracing and Container Cloning for Replay Debugging of Microservices
Author: Mihir Mathur
Publisher:
Total Pages: 69
Release: 2020
Genre:
ISBN:

Microservice architectures have gained prominence in recent years for building large-scale industrial distributed systems. However, microservice architectures make the usage of replay debugging, a powerful technique for finding root causes of faults, very challenging because of the polyglot (written in several languages) services, large accumulated state of services, and tight latency limits imposed by long hop-chains. This work attempts to provide a framework for enabling replay debugging in production microservice applications. We study 25 real-world faults in microservice systems collected from diverse sources, categorize these faults by fault symptoms, and create 15 application agnostic mutation operators for microservices. We then propose a language agnostic replay debugging framework for microservice applications that uses a distributed tracing system to record network requests and enables replay of those requests on cloned service containers running in a debug environment. A key component of this framework is an anomaly detector that uses span-level and container-level monitoring to detect fault symptoms found in our study and localizes faults to trace level so that faulty traces can be easily replayed to find the root cause. An open-source microservices application injected successively with the mutation operators is used for an evaluation that shows that our framework is upto an order of magnitude lighter-weight than language-specific recording tools such as Chrome DevTools or VisualVM and can help in finding root causes of 9 out of 15 mutations at a line or function level.

Programming Environments for Massively Parallel Distributed Systems

Programming Environments for Massively Parallel Distributed Systems
Author: Karsten M. Decker
Publisher: Birkhäuser
Total Pages: 417
Release: 2013-04-17
Genre: Computers
ISBN: 3034885342

Massively Parallel Systems (MPSs) with their scalable computation and storage space promises are becoming increasingly important for high-performance computing. The growing acceptance of MPSs in academia is clearly apparent. However, in industrial companies, their usage remains low. The programming of MPSs is still the big obstacle, and solving this software problem is sometimes referred to as one of the most challenging tasks of the 1990's. The 1994 working conference on "Programming Environments for Massively Parallel Systems" was the latest event of the working group WG 10.3 of the International Federation for Information Processing (IFIP) in this field. It succeeded the 1992 conference in Edinburgh on "Programming Environments for Parallel Computing". The research and development work discussed at the conference addresses the entire spectrum of software problems including virtual machines which are less cumbersome to program; more convenient programming models; advanced programming languages, and especially more sophisticated programming tools; but also algorithms and applications.

Distributed and Parallel Systems

Distributed and Parallel Systems
Author: Péter Kacsuk
Publisher: Springer Science & Business Media
Total Pages: 240
Release: 2012-12-06
Genre: Computers
ISBN: 1461544890

Distributed and Parallel Systems: From Instruction Parallelism to Cluster Computing is the proceedings of the third Austrian-Hungarian Workshop on Distributed and Parallel Systems organized jointly by the Austrian Computer Society and the MTA SZTAKI Computer and Automation Research Institute. This book contains 18 full papers and 12 short papers from 14 countries around the world, including Japan, Korea and Brazil. The paper sessions cover a broad range of research topics in the area of parallel and distributed systems, including software development environments, performance evaluation, architectures, languages, algorithms, web and cluster computing. This volume will be useful to researchers and scholars interested in all areas related to parallel and distributed computing systems.

Distributed Computer Systems

Distributed Computer Systems
Author: H. S. M. Zedan
Publisher: Butterworth-Heinemann
Total Pages: 320
Release: 2014-05-12
Genre: Computers
ISBN: 1483192326

Distributed Computer Systems: Theory and Practice is a collection of papers dealing with the design and implementation of operating systems, including distributed systems, such as the amoeba system, argus, Andrew, and grapevine. One paper discusses the concepts and notations for concurrent programming, particularly language notation used in computer programming, synchronization methods, and also compares three classes of languages. Another paper explains load balancing or load redistribution to improve system performance, namely, static balancing and adaptive load balancing. For program efficiency, the user can choose from various debugging approaches to locate or fix errors without significantly disturbing the program behavior. Examples of debuggers pertain to the ada language and the occam programming language. Another paper describes the architecture of a real-time distributed database system used for computer network management, monitoring integration, as well as administration and control of both local area or wide area communications networks. The book can prove helpful to programmers, computer engineers, computer technicians, and computer instructors dealing with many aspects of computers, such as programming, hardware interface, networking, engineering or design.

Parallel and Distributed Processing and Applications

Parallel and Distributed Processing and Applications
Author: Minyi Guo
Publisher: Springer
Total Pages: 971
Release: 2006-11-19
Genre: Computers
ISBN: 3540680705

This book constitutes the refereed proceedings of the 4th International Symposium on Parallel and Distributed Processing and Applications, ISPA 2006, held in Sorrento, Italy in November 2006. The 79 revised full papers presented together with five keynote speeches cover architectures, networks, languages, algorithms, middleware, cooperative computing, software, and applications.