You are here
ARCHITECTURAL SUPPORT FOR IMPROVING SYSTEMHARDWARE/SOFTWARE RELIABILITY
- Date Issued:
- 2010
- Abstract/Description:
- It is a great challenge to build reliable computer systems with unreliable hardware and buggy software. On one hand, software bugs account for as much as 40% of system failures and incur high cost, an estimate of $59.5B a year, on the US economy. On the other hand, under the current trends of technology scaling, transient faults (also known as soft errors) in the underlying hardware are predicted to grow at least in proportion to the number of devices being integrated, which further exacerbates the problem of system reliability. We propose several methods to improve system reliability both in terms of detecting and correcting soft-errors as well as facilitating software debugging. In our first approach, we detect instruction-level anomalies during program execution. The anomalies can be used to detect and repair soft-errors, or can be reported to the programmer to aid software debugging. In our second approach, we improve anomaly detection for software debugging by detecting different types of anomalies as well as by removing false-positives. While the anomalies reported by our first two methods are helpful in debugging single-threaded programs, they do not address concurrency bugs in multi-threaded programs. In our third approach, we propose a new debugging primitive which exposes the non-deterministic behavior of parallel programs and facilitates the debugging process. Our idea is to generate a time-ordered trace of events such as function calls/returns and memory accesses in different threads. In our experience, exposing the time-ordered event information to the programmer is highly beneficial for reasoning about the root causes of concurrency bugs.
Title: | ARCHITECTURAL SUPPORT FOR IMPROVING SYSTEMHARDWARE/SOFTWARE RELIABILITY. |
29 views
9 downloads |
---|---|---|
Name(s): |
Dimitrov, Martin, Author Zhou, Huiyang, Committee Chair University of Central Florida, Degree Grantor |
|
Type of Resource: | text | |
Date Issued: | 2010 | |
Publisher: | University of Central Florida | |
Language(s): | English | |
Abstract/Description: | It is a great challenge to build reliable computer systems with unreliable hardware and buggy software. On one hand, software bugs account for as much as 40% of system failures and incur high cost, an estimate of $59.5B a year, on the US economy. On the other hand, under the current trends of technology scaling, transient faults (also known as soft errors) in the underlying hardware are predicted to grow at least in proportion to the number of devices being integrated, which further exacerbates the problem of system reliability. We propose several methods to improve system reliability both in terms of detecting and correcting soft-errors as well as facilitating software debugging. In our first approach, we detect instruction-level anomalies during program execution. The anomalies can be used to detect and repair soft-errors, or can be reported to the programmer to aid software debugging. In our second approach, we improve anomaly detection for software debugging by detecting different types of anomalies as well as by removing false-positives. While the anomalies reported by our first two methods are helpful in debugging single-threaded programs, they do not address concurrency bugs in multi-threaded programs. In our third approach, we propose a new debugging primitive which exposes the non-deterministic behavior of parallel programs and facilitates the debugging process. Our idea is to generate a time-ordered trace of events such as function calls/returns and memory accesses in different threads. In our experience, exposing the time-ordered event information to the programmer is highly beneficial for reasoning about the root causes of concurrency bugs. | |
Identifier: | CFE0002975 (IID), ucf:47941 (fedora) | |
Note(s): |
2010-05-01 Ph.D. Engineering and Computer Science, School of Electrical Engineering and Computer Science Doctorate This record was generated from author submitted information. |
|
Subject(s): |
debugging reliability computer architecture |
|
Persistent Link to This Record: | http://purl.flvc.org/ucf/fd/CFE0002975 | |
Restrictions on Access: | public | |
Host Institution: | UCF |