© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/DSD.2019.00080

# Design of SRAM-based Low-Cost SEU Monitor for Self-Adaptive Multiprocessing Systems

Junchao Chen\*, Marko Andjelkovic\*, Aleksandar Simevski\*, Yuanqing Li\*, Patryk Skoncej\*, Milos Krstic\*†

\*IHP-Leibniz-Institut für innovative Mikroelektronik, Im Technologiepark 25, 15236 Frankfurt (Oder), Germany

<sup>†</sup>University of Potsdam, August-Bebel-Str. 89, 14482 Potsdam, Germany

{chen,andjelkovic,simevski,krstic}@ihp-microelectronics.com

Abstract—Cosmic radiation phenomena such as Solar Particle Events cause high radiation flux lasting from hours to days, thus increasing the probability of Single-Event Upsets (SEUs) for several orders of magnitude. In space applications it is necessary, therefore, to monitor the SEU rate in order to ensure timely detection of high radiation levels and efficient protection of radiation-sensitive circuits. This work proposes an approach combining the SEU monitoring and data storage functions in the same on-chip Static Random Access Memory (SRAM) module, with negligible cost and overheads compared to traditional stand-alone SEU monitors. Furthermore, it also enables the detection of permanent faults in SRAM. The proposed monitor is intended to be further integrated into a highly dependable and self-adaptive multiprocessing platform in which it will drive the selection of the multiprocessor operating modes. Thus, a dynamic trade-off between reliability, performance and power consumption in real-time can be achieved.

*Keywords*—SRAM, SEU monitor, multiprocessing system, reliability

### I. INTRODUCTION

The radiation-induced effects, in particular the Single Event Upsets (SEUs), are one of the major concerns in the design of modern nano-scale CMOS integrated circuits for space applications [1]. SEU is a transient fault in storage components caused by an energetic particle (e.g. neutron, proton, heavy ion or alpha particle) that passes through the sensitive region within an off-state transistor. The passage of energetic particle results in charge deposition. The primary condition for an SEU occurrence is that this deposited charge exceeds the critical charge of the element. Generally, SEUs may be caused either when the particle strikes a memory element directly, or when the particle-induced glitch in combinational logic, know as Single Event Transient (SET), propagates through and is captured by a memory element. As a result of SEUs, a malfunction or complete failure of an electronic system may occur. Thus, efficient protection against SEUs is essential in space missions.

The Galactic Cosmic Rays (GCRs), Solar Particle Events (SPEs) and trapped particles in planetary magnetospheres are the primary origins of particles capable of inducing SEUs [2]. These particles can directly affect electronic devices at high altitudes, or indirectly by interacting with the atmosphere. These phenomena, particularly SPEs, can increase the particle flux for several orders of magnitude, thereby increasing the probability of SEUs. According to the data obtained from

several space missions [3]–[5], the SEU rate for different Static Random Access Memories (SRAMs) under background conditions is around  $10^{-8}$  Upsets/bit/day, while the SEU rate can rise up to  $10^{-5}$  Upsets/bit/day or even higher during an SPE. Since the high-level radiation can last for hours or even days [6], it is vital to employ real-time monitoring of the SEU rate in order to detect early the high radiation levels, and subsequently apply appropriate hardening measures.

SEU rate monitoring could be accomplished by using specialized SEU monitors. The state-of-the-art approaches typically employ stand-alone SEU monitors which are realized as separate functional elements (either discrete or integrated). The two most common solutions are based on radiation sensitive elements such as memory-based monitor [7]–[13] or pixel detectors [14], [15]. However, these traditional SEU monitors have common shortcomings: 1) stand-alone monitors are often not realized in the same technology as the target system, thus making the data processing more challenging, and 2) use of stand-alone monitors often increases the overall cost, area and power consumption.

To overcome these limitations, this paper proposes an approach by integrating the SEU monitoring and data storage functions in the same chip. In this way, the cost and area/power overheads can be significantly reduced compared to the solutions based on stand-alone SEU monitors. The proposed SEU monitor is embedded into a radiation-hardened on-chip SRAM for space applications with a scrubbing mechanism based on Single-Error Correction and Double-Error Detection (SEC-DED). The idea is to utilize these mechanisms for the purposes of SEU monitoring. Moreover, permanent faults in the SRAM can also be detected. In order to facilitate the application of self-adaptive fault-tolerance mechanisms and achieve the best possible performance, the proposed monitor is intended to be integrated into a multiprocessing system [16]. The use of the proposed monitor can enable the dynamic self-adaptive selection of the operating modes for a multiprocessing system, providing optimal system reliability under variable radiation conditions in real-time.

The rest of the paper is structured as follows. Section II gives a brief description of space radiation environment. Section III reviews the related work. The architecture and operation of the proposed SEU monitor are described in Section IV. Analysis of results is given in Section V. Section VI explains the application of the proposed monitor in the

multiprocessing system. The conclusion and main direction for future work are outlined in Section VII.

#### **II. SPACE RADIATION ENVIRONMENT**

The space radiation can be classified into two main groups: particles trapped by planetary magnetospheres in radiation belts and transient radiation particles [17]. The planetary magnetic fields can trap the charged particles such as protons, electrons and heavy ions. In the case of Earth, the Van Allen belt is the primary source of trapped charged particles, and the particularly critical region is the South Atlantic Anomaly over the South American continent. The transient radiation fields consist of GCR and SPE, and are mainly composed of heavy ions, protons and alpha particles. The GCR is composed of high-energy charged particles that originate outside of the solar system modulated by an 11-year solar activity cycle. The SPE is categorized into solar flares and Coronal Mass Ejection (CME). The solar flares are mainly electron-rich and can last for hours, while the CMEs are proton-rich and can last for several days. Besides that, cosmic rays and solar particles can hit the top of the atmosphere, and then attenuate to form protons, electrons, heavy ions, neutrons, muons, and pions [2]. The most important product of the attenuation process are neutrons, which are also a common source of SEUs.

TABLE I UPSET RATES DURING LARGE SPES [18]

| Data          | Background<br>( $upsets$<br>$bit^{-1}$<br>$day^{-1}$ ) | Worst five<br>Minutes<br>(upsets<br>$bit^{-1}$<br>$day^{-1}$ ) | Worst Day<br>(upsets<br>$bit^{-1}$<br>$day^{-1}$ ) | Worst<br>Week<br>(upsets<br>$bit^{-1}$<br>$day^{-1})$ |
|---------------|--------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------|-------------------------------------------------------|
| April 15,2001 | $3.7\!\times\!10^{-8}$                                 | $3.8\!\times\!10^{-5}$                                         | $6.1\!\times\!10^{-7}$                             | $1.3 \times 10^{-7}$                                  |
| Nov. 5,2001   | $3.8\!\times\!10^{-8}$                                 | $2.5\!\times\!10^{-5}$                                         | $7.4\!\times\!10^{-7}$                             | $2.1\!\times\!10^{-7}$                                |
| Oct. 28,2003  | $4.4\!\times\!10^{-8}$                                 | $2.5\!\times\!10^{-5}$                                         | $6.1\!\times\!10^{-7}$                             | $2.1\!\times\!10^{-7}$                                |
| Jan. 20,2005  | $8.1\!\times\!10^{-8}$                                 | $2.4\!\times\!10^{-5}$                                         | $6.5\!\times\!10^{-7}$                             | $2.3\!\times\!10^{-7}$                                |

SPEs can dominate the radiation environment, causing high fluxes of high energy particles [6]. SPEs are able to rise to a peak flux over several tens of minutes or hours, and then decay during hours or days. The peak flux of SPEs may be two to five orders of magnitude higher than the background levels. SPEs occur on average 20 times per year, and a few of them are strong enough to cause hazards in electronic circuits and systems. As an example, Table I shows the upset rates measured for a  $4k \times 32$  bit 0.25-um CMOS SRAM in geostationary orbit satellite during several large SPEs [18]. These results show that during the worst irradiation conditions, up to 5 upsets per day have been recorded. Considering that the analyzed SRAM has a relatively small capacity, it can be expected that higher upsets would occur in larger memories. Therefore, a proper SEU monitor is needed to detect the changes in radiation levels caused by early SPEs, triggering an alarm for activating the appropriate protection measures [19].



III. RELATED WORK

During the past few decades, a lot of work has been done on analysis of SEU effects in integrated circuits. Many reports are focused on the detection and monitoring of SEUs. Due to the relatively low cost, high sensitivity to radiation and the possibility of implementation in different technologies, SRAMs are widely used as SEU monitors [9]. The SEU rate monitoring with SRAMs is based on counting the bit flips in the elementary SRAM cells, where the number of bit flips per unit time represents the SEU rate. The typical elementary SRAM cell is a six-transistor (6T) cell depicted in Fig.1. The memory element of the cell is a latch implemented by two cross-coupled inverters (transistors Mp1/Mn1 and Mp2/Mn2). The other two NMOS transistors (Mpg1 and Mpg2) are necessary for controlling the read and write operations. The most radiation-sensitive nodes within the SRAM cell are the QC and QT nodes. In comparison to other logic gates, SRAM cells usually exhibit higher sensitivity to radiation and are thus suitable as radiation monitors. In general, the overall sensitivity of the SRAM module is determined by the number of cells, i.e., by the total memory capacity.

Various solutions based on either custom-designed or commercial memories have been investigated as SEU monitors. Harboe-Sørensen et al. [7] devised an SRAM-based simple and reliable beam monitoring system which could be used at any accelerator and as support of beam calibrations. Barak et al. [8] used a commercial SRAM as ionising radiation monitor, confirming its applicability in satellites. Prinzie et al. [9] proposed an integrated SRAM radiation monitor designed in the 180nm process, with the possibility to control the sensitivity by adjusting the supply voltage. Tsiligiannis et al. [10] investigated the commercial 90 nm SRAMs as radiation monitors for the mixed-field radiation environment of a CERN particle accelerator. Spiezia et al. [11] used several large and sensitive commercial SRAMs as a part of the radiation levels monitoring system in the Large Hadron Collider (LHC). In order to increase the sensitivity, Tang et al. [12] modified the traditional 6T-cell SRAM structures into a compact 2T monitor and confirmed with accelerated irradiation tests that the implementation of the proposed structure in a 65 nm process has a higher sensitivity to radiation strikes compared to conventional SRAM structures. A Block RAM (BRAM)based SEU monitor inside the FPGA was proposed by Glein et al. [13]. In the proposed approach, the custom wrappers for BRAMs were introduced, monitoring the upset rate in used (utilized by the user) and unused area of BRAMs. The simulation results showed that up to thousands of SEUs could be detected each day during the peak flux of SPEs.

Besides the memory-based SEU monitors, another approach is based on pixel detectors. In contrast to memory-based monitors which can provide only the SEU rate, the pixel detectors can also measure the particle energy. Chapman et al. [15] detected the SEUs by capturing the deposited charge in a CMOS active pixel sensor of a digital camera. Based on [14], Havranek et al. proposed a monilithic pixelated detector for measuring the cosmic radiation and SEU detection.

The state-of-the-art solutions for SEU rate monitoring have important shortcomings in terms of high cost and area/power overheads. These shortcomings are resulting from the fact that state-of-the-art solutions are implemented as stand-alone functional units. To overcome these limitations, we propose a concept of SEU monitor integrated within the target chip. Our SEU monitor is also based on SRAM, but instead of performing only the SEU monitoring function, the SRAM module is also used as a standard data storage unit within the target chip. Besides, the proposed monitor can also detect permanent faults in SRAM. The comparison details of the related designs and the proposed monitor are explained in Section V-D.

## IV. IMPLEMENTATION OF THE SEU MONITOR

The proposed SEU monitor is intended to be integrated into SRAM blocks which contain Error Detection And Correction (EDAC) and scrubbing mechanisms. These mechanisms are widely used techniques in mission-critical applications, such as space and aviation [20]. Fig. 2 shows the block diagram of the 20-Mbit SRAM chip containing the proposed SEU monitor. The chip is essentially a Synchronous SRAM (SSRAM) consisting of five 512k x 8-bit asynchronous SRAM blocks, a control unit, a scrubbing module and an EDAC module. However, one of the memory blocks is used only internally for the purpose of storing the 7-bit EDAC syndrome computed on each 32-bit write to the rest of the four memory blocks. Thus, the user sees effectively a 16-Mbit device organized as 4M x 32-bit. The memory blocks are based on the conventional 6T memory cell shown in Fig. 1. Each read, write or scrubbing cycle uses the EDAC module and involves the access to 32 bits selected by a 19-bit address. The EDAC and the scrubbing modules are employed to protect the memory cells against SEUs and detect single/double bit errors as well as permanent faults in each memory word. Three 8-bit SEU counters are integrated into the control unit to count single/double bit errors and permanent faults individually. Besides, a register file is used to record the faults in order to avoid duplicate counting of the double and permanent faults.

## A. EDAC Module

In order to detect and correct SEUs in the SRAM, a builtin EDAC module by using the (39,32) HSIAO SEC-DED code [21] is deployed to protect the SRAM contents. EDAC can improve the upset rates of the SRAM by several orders of



Fig. 2. 20-Mbit SRAM chip with SEU monitor

magnitude. Thus, a reliable memory device with very high density is provided. On each 32-bit data write, the EDAC module calculates a 7-bit parity syndrome and stores it in the special (internal-only) 4Mbit memory block (see Fig. 2). On each 32-bit data read, the 32-bit data and its corresponding syndrome are read and decoded. During read and scrubbing, the EDAC module can detect single and double bit errors. In this case, the corresponding error signal and data address are sent to the control unit which has control bits that direct the next actions (e.g., raise the error signal on the output pin, or, re-write the data with corrected bits in case of a single-bit error).

## B. Scrubbing Module

The primary role of the scrubbing module is to avoid accumulation of radiation-induced soft errors. In our case it is further used to drive the SEU monitor and provide additional information. In the SSRAM of Fig. 2, the scrubbing module periodically reads memory words when the chip is idle. It automatically increments the next scrubbing address after completing the current scrubbing cycle. The addresses start from zero to the last  $2^{19} - 1$  address, after which it starts again from zero. In a case of a single-bit error, it corrects the error by using the EDAC module and performs a writeback at the same address with corrected data. The scrubbing procedure is entirely autonomous and transparent for the user, which means that the user can access the SSRAM even if the scrubbing procedure is in progress. The scrubbing rate, which is the delay between accessing consecutive memory words, can be configured by the user by writing to an internal control register, and it is a minimum of four clock cycles. Therefore, the minimum time for scrubbing all the memory words is 42 ms when the working frequency is 50MHz, which is the test frequency for this chip.

#### C. SEU Monitor

The proposed SEU monitor is integrated into the control unit to perform the error counting. The basic function of the control unit in Fig. 2 is to provide synchronous access to the 16-Mbit SRAM modules and to the internal registers (which reside in the control unit). There are several control and status



Fig. 3. Detection flowchart of the SEU monitor.

registers which direct the behavior of the chip, some of them explained in Subsections IV-A and IV-B.

The SEU monitor simply piggybacks on the EDAC and scrubbing mechanisms. In order for the SEU monitor to work, scrubbing has to be in operation. When a single, double or permanent fault is detected, one of the three error counters is incremented. If a counter overflows, it starts counting again from zero, but a corresponding overflow bit is also set in the status register. However, according to empirical expectations based on the event counts from the existing missions (see Table I), with timely scrubbing and rewrite as well as reset, this monitor guarantees normal operation even during large SPE peak fluxes.

A  $32 \times 21$ -bit address register file is used to log erroneous addresses in order to avoid counting the same errors multiple times and detect permanent faults. A single 21-bit entry consists of a *valid entry* bit, a 19-bit *address*, and an *error type* bit which differentiates between double bit errors and permanent faults. Up to 32 erroneous addresses can be thus recorded simultaneously. If the register file overflows, the oldest individual record will be automatically discarded and a corresponding overflow bit will be set in the status register. Moreover, a valid entry bit will be reset if a double-bit error address is rewritten by the user.

The detection flowchart of the proposed monitor is shown in Fig. 3. Upon receipt of the chip idle signal, the scrubbing procedure starts from the ending address of the previous procedure to check each 39-bit memory word (32-bit data and its 7-bit HSIAO syndrome). If no error is detected in the current memory word, or the corresponding address has already been logged in the address register file, the error detection will proceed to the next address. On the other side, if a new error is found, the current memory word needs to be re-scrubbed immediately. If no error is found in the second scrubbing round, it means that the EDAC has corrected this error in the previous scrubbing round, identifying it as a single-bit error. On the other hand, double bit error and permanent fault cannot be corrected by EDAC and the 'error type' bit is appropriately set. Furthermore, the corresponding error address of the double/permanent fault is logged in the address register file. Otherwise, the duplicate counting of the same double bit errors and permanent faults cannot be avoided, and the corresponding counters will quickly overflow.

One aspect to be considered is the following. If the SSRAM is constantly being accessed by the user (without idle cycles between read/write operations), the entire SSRAM could not be scrubbed in time since the scrubbing operation is designed to be transparent to the user. The SEU monitor is also not active in such a case. Therefore, regularly scrubbing all memory words (e.g., once an hour) is a possible solution. This can ensure an overall detection of the potential SEUs and avoid the accumulation of soft errors.

The user can read the SEU counters, the address register file and status registers as well as write and read the control registers at any time. By writing the corresponding bits in the control registers, the user can also reset the SEU counters and all 'valid entry' bits of the address register file. Since the counters and all other registers can also be affected by radiation particles, Triple-Modular Redundant (TMR) flipflops are used in order to enhance their robustness against SEUs [22].

## V. ANALYSIS OF RESULTS

#### A. SEU Monitor Performance Analysis

1) Soft errors: The main objective of the proposed monitor is to detect and count SEUs in the SSRAM. A set of simulations was performed considering different numbers of faults in each of the 39-bits memory words, and evaluating the effectiveness of the proposed monitor. The test procedure includes injecting a large quantity of bit-flips, which correspond to SEUs, into random SRAM cells. All single and double bit errors are detected and counted during scrubbing. However, because of the HSIAO SEC-DED code limitations, multiple bit errors (MBEs) could not be correctly detected. The odd number of bit flips will be detected as single-bit errors, and most of the even number of bit flips will be treated as doublebit errors.

Besides the HSIAO SEC-DED code, many other coding mechanisms can also be employed to detect and correct several multiple bit errors, such as SEC-DED-DAEC (Double Adjacent Error Correcting) [23], SEC-DAEC-TAEC (Triple Adjacent Error Correction) [23], 3-bit burst ECC [24], etc. However, the area and timing overheads of these mechanisms are higher than the HSIAO SEC-DED code, and the hardware implementation of error detection and correction is also more complicated. The HSIAO code provides a fast and simple encoding/decoding with low hardware overhead. As the probability of adjacent double bit errors is much higher than other MBEs, the HSIAO SEC-DED codes is a suitable choice in our case. Moreover, the occurrence probability of an uncorrectable MBE and the accumulation of transient faults can be significantly decreased, if the entire memory is scrubbed regularly.

2) Hard errors: There are several ways to detect permanent faults in the SSRAM relying on the mechanisms implemented in the chip. One of them is by reporting a single bit error by raising the error output pin. The control unit has a field in the status register which tells the erroneous bit position. By writing and reading data patterns at that address, the software can determine the presence of permanent faults. However, more sophisticated approach is the one described in Subsection IV-C, which was also verified by a set of simulations.

#### B. SEU Sensitivity of an SRAM Cell

Since the SRAM is intended to operate as a radiation monitor within a host chip designed with standard logic gates, it is essential to evaluate the SEU sensitivity of the SRAM cell with respect to the SEU sensitivity of standard flipflops and SET sensitivity of standard combinational gates. It is also important to evaluate the SEU sensitivity of the SRAM in terms of supply voltage, since low-power (e.g., voltage-scaling) techniques are also frequently applied in space applications.

The SET/SEU sensitivity has been evaluated in terms of critical charge which was estimated through the standard current injection approach in SPICE simulations, i.e., by injecting the double-exponential current pulse [25] in the circuit nodes. The constant timing parameters of the double-exponential current pulse were used (rise time = 10 ps and fall time = 100 ps) while the injected charge was varied during the simulations to obtain the critical charge values.

Fig. 4 depicts the variation of the critical charge of an SRAM cell in terms of supply voltage, for the cases when a logic '0' and a logic '1' are stored in the cell. It can be observed that the critical charge depends on the stored value, and it decreases as the supply voltage is reduced. The reduction of supply voltage leads to the decrease of the driving strength of transistors, consequently reducing the transistors capability to dissipate the induced charge.

The critical charge values for the analyzed SRAM and the most common standard gates in IHP 130 nm CMOS library, obtained for the nominal supply voltage of 1.2 V, are presented in Table II. Since the critical charge for combinational gates depends on the input logic levels, in this case only the lowest critical charge values for each gate, obtained by injecting the



Fig. 4. Critical charge of SRAM cell in terms of supply voltage

current pulse at the gates output, are presented. As can be seen, for the logic '1' stored in the SRAM cell, the critical charge of the SRAM cell is lower than that of all investigated standard cells. On the other hand, when the logic '0' is stored, the critical charge of the SRAM cell is slightly higher than that of NOR, XOR and XNOR gates. However, since the charge higher than the critical charge is required to cause a SET capable of propagating through the combinational circuit, it is clear that the SRAM cell is more sensitive to particle strikes than all investigated logic cells. This indicates that the analyzed SRAM cell can be utilized as a radiation monitor within a system designed in the investigated 130 nm CMOS technology. Moreover, this analysis relates to the regular standard cell library. In the radiation hardened library, both standard cells and SRAMs are additionally hardened. Nevertheless, due to the cost of hardening, the critical charge of SRAM cannot be effectively increased very much.

TABLE II CRITICAL CHARGE FOR SRAM AND DIFFERENT STANDARD CELLS

| Element                 | Critical charge(fC) for SET or SEU |
|-------------------------|------------------------------------|
| SRAM (stored logic '1') | 9.7                                |
| SRAM (stored logic '0') | 13.9                               |
| D flip-flop             | 24.9                               |
| INV                     | 20.1                               |
| NAND                    | 22.2                               |
| AND                     | 19.6                               |
| NOR                     | 12.8                               |
| OR                      | 19.3                               |
| XOR                     | 12.9                               |
| XNOR                    | 13.6                               |

#### C. Synthesis Results

Since the SEU monitor is implemented on top of a standard radiation-hardened 20-Mbit SRAM with EDAC and scrubbing, it is essential to investigate the introduced overheads of power and area. The following synthesis results assume the IHP 130 nm standard CMOS library with a supply voltage of 1.2 V, and an operating nominal frequency of 50 MHz.

The total area of the chip is  $14 mm^2$ , and the power consumption is 384 mW. The main contributor to these figures are

the five asynchronous SRAM blocks, while the contribution of the entire "digital" logic (Control unit, SEU monitor, EDAC and Scrubbing module) is only  $0.0957 mm^2$ , i.e., the introduced area overhead is less than 1%. Similarly, the estimated power consumption is only 0.211 mW, i.e., the induced power overhead is even less than 0.1%. The area and power overheads resulting from additional modules are thus negligible.

However, it is also instructive to compare only the "digital" parts of the chip with and without SEU Monitor. Tables III and IV show such a comparison. The main parts of the SEU Monitor are the three 8-bit rad-hard counters and the  $32 \times 21$ -bit address register file. The number of flip-flops in the proposed design is 947, while the original design without SEU Monitor has 253 flip-flops. Therefore, these additional flip-flops are actually the main contributors to the power and area overhead.

TABLE III Area comparison (in  $um^2$ ) between radiation-hardened controllers with and without SEU Monitor

|                        | Without monitor | With monitor |
|------------------------|-----------------|--------------|
| Combinational area     | 11298           | 67344        |
| Non-combinational area | 7408            | 28395        |
| Total area             | 18706           | 95739        |

TABLE IV Power comparison (in MW) between radiation-hardened controllers with and without SEU Monitor

|                         | Without monitor | With monitor |
|-------------------------|-----------------|--------------|
| Combinational logic     | 0.022           | 0.091        |
| Non-combinational logic | 0.032           | 0.120        |
| Total power consumption | 0.054           | 0.211        |

### D. Evaluation and Comparison

The proposed monitor has been compared with the other memory-based monitors presented in Section III, and the results are summarized in Table V. It can be observed that the existing SRAM-based monitors [7]-[12] use the "set/check test pattern" approach to detect SEUs, i.e. a set of known patterns is written, read and compared in the memory array. Therefore, these SRAM-based monitors cannot be reused as storage elements in a computing system, and most of them need an extra custom-PCB to implement the detection function. For the BRAM-based monitor in FPGA [13], although the EDAC and scrubbing functions are utilized to achieve the SEU detection and storage function, the overhead due to additional components is larger than for our design. Namely, 298 custom BRAM wrappers are deployed for the BRAM monitor, and the resource overhead of the monitor is 4.9% of the FPGA resources [13]. Furthermore, none of the related designs [7]-[13] has the ability to detect permanent faults in memory arrays.



Fig. 5. Decision tree for determining the operation mode

Therefore, the proposed SEU monitor design provides a substantial advancement over the previous SRAM-based designs by combining the SEU detection and storage function, and also reduces the resource overhead compared to the BRAM-based design. Moreover, the proposed design supports the detection of permanent faults in memory arrays, which is to the best of our knowledge not feasible with any of the reported designs. The proposed design can be efficiently implemented with a negligible power/area overhead.

#### VI. APPLICATION OF THE PROPOSED DESIGN

The proposed SEU monitor is intended to be used as an integral part of a multiprocessing system in order to achieve dynamic self-adaptive properties of the system enabling adaptive trade-off between reliability, power consumption, and performance. Multiprocessing systems have an inherent hardware redundancy and are convenient for deployment of reconfigurable/dynamic mechanisms, such as the core-level N-Module Redundancy (NMR), dynamic task scheduling, adaptive voltage scaling, etc. The proposed SEU monitor can be used to determine the Soft Error Rate (SER) and predict the potential SPE in such a system, and provide information to the mechanisms for dynamic reconfigurability and self-adaptation, i.e., determine the optimal operating modes under the premises of reliability.

The decision tree for determining the optimal operation mode for the self-adaptive platform is shown in Fig. 5. The reliability requirements of the system are based on the Safety Integrity Level (SIL) defined by the IEC 61508 standard which is commonly referred by the systems with high-reliability requirements such as those in space applications [26]. In this standard, four SILs are proposed, with SIL 4 being the most dependable and SIL 1 as the least. The relationship between

|                                    | Harboe-Sørensen<br>et al. [7]         | Barak et al. [8]                        | Prinzie et al. [9]         | Tsiligiannis et<br>al. [10] | Spiezia et al.<br>[11]                    | Tang et al. [12]           | Glein et al. <sup>2</sup> [13] | Proposed Design         |
|------------------------------------|---------------------------------------|-----------------------------------------|----------------------------|-----------------------------|-------------------------------------------|----------------------------|--------------------------------|-------------------------|
| Type of SEU Moni-<br>tor           | Commercial<br>SRAM Atemel<br>AT60142F | Commercial<br>SRAM Intersil<br>HM 65162 | Custom-designed<br>SRAM    | Commercial<br>SRAM          | Commercial<br>SRAM Cypress<br>CY62157EV30 | Custom-designed<br>SRAM    | BRAM                           | Custom-designed<br>SRAM |
| Implementation                     | SRAM chip on<br>PCB                   | 6 SRAM chips<br>on PCB                  | ASIC                       | 3 SRAM chips<br>on PCB      | 4 SRAM chips<br>on PCB                    | ASIC                       | FPGA                           | ASIC                    |
| Capacity                           | 4 Mbit                                | $6 \times 16$ kbit                      | 20 kbit                    | 3×32 Mbit                   | $4 \times 8$ Mbit                         | 64 kbit                    | 10.6 Mbit                      | 20 Mbit                 |
| Date Storage                       | No                                    | No                                      | No                         | No                          | No                                        | No                         | Yes                            | Yes                     |
| Detection of Perma-<br>nent Faults | No                                    | No                                      | No                         | No                          | No                                        | No                         | No                             | Yes                     |
| Detection Approach                 | Set/Check Test<br>Patterns            | Set/Check Test<br>Patterns              | Set/Check Test<br>Patterns | Set/Check Test<br>Patterns  | Set/Check Test<br>Patterns                | Set/Check Test<br>Patterns | EDAC+Scrubbing                 | EDAC+Scrubbing          |
| Supply Voltage (V)                 | 3.3                                   | 5                                       | 0~1.8                      | N/A                         | 2.2~3.6                                   | 0.75~1.5                   | 1.2~3.3                        | 1.2                     |
| Frequency (MHz)                    | N/A                                   | 4.5                                     | N/A                        | 50                          | N/A                                       | N/A                        | 260                            | 50                      |
| Technology<br>(CMOS)               | 250 nm                                | N/A                                     | 180 nm                     | 90 nm                       | N/A                                       | 65 nm                      | 65 nm                          | 130 nm                  |

 $\begin{tabular}{ll} TABLE V \\ Comparison of the Proposed Design to Other Memory-based SEU Monitors \\ \end{tabular}$ 

the SERs and the configuration modes under the constraint of SILs can be determined by static analysis. Four reliability tables can be formed to represent the connection between the SERs and operation modes under the different reliability requirements from each SIL. The system can launch a specific operating mode within a certain SERs range, in order to satisfy the SIL demand. Basing on the real-time SER information coming from the proposed monitor and the required SIL from the user/tasks requirements, the operating mode can be determined and launched according to these tables. Moreover, the onset of SPE phenomenon can be predicted by evaluating the Mean Times To Upset (MTTU) of the monitor. In [19], the author proposes a method to predict the SPE with fairly high accuracy. The prediction of the SPE can let the system to respond appropriately in advance, in order to avoid the predicted large particle fluxes.

TABLE VI EXAMPLE OF NMR OPERATION MODES UNDER DIFFERENT SERS AND THE DURATION TIME OF THESE SERS IN A ONE YEAR AVERAGE

| $\begin{array}{l} \mathbf{SER} & (upsets \ \cdot \\ bit^{-1} \cdot day^{-1}) \end{array}$ | Operating Mode                    | <b>Duration Time / Year</b> (hours) |
|-------------------------------------------------------------------------------------------|-----------------------------------|-------------------------------------|
| $< 10^{-8}$                                                                               | High-Performance or De-<br>Stress | 5460                                |
| $10^{-8} - 10^{-7}$                                                                       | DMR                               | 3120                                |
| $10^{-7} - 10^{-6}$                                                                       | TMR                               | 162                                 |
| $>10^{-6}$                                                                                | QMR                               | 18                                  |

As a case study, the proposed monitor could be integrated into a 4-core multiprocessing system, detailed in [16]. This 4-core multiprocessing system has three operating modes: 1) in de-stress (and power-saving) mode, three of the cores are powered off, while only one core is actively executing instructions; 2) in fault-tolerant mode, two, three or all four cores simultaneously execute the same tasks in a Dual, Triple or Quadruple Modular Redundant (DMR, TMR, or QMR) fashion, respectively, in order to increase the error resilience; 3) in high-performance mode, all cores execute different tasks. The objective of switching between different operating modes is to dynamically improve the reliability or enhance the performance by adjusting the "redundant" and "power off" status of the processing cores. Regarding the transient faults induced by radiation particles, the DMR enables detection of one core error output, TMR can mask one core error, and QMR has the ability to mask up to two core errors simultaneously.

TABLE VII Power comparison of different operation modes of a 4-Core Multiprocessor in one year

| Operating Mode               | Power Consumption |
|------------------------------|-------------------|
| Self-adaptive mode switching | 12258 <i>p</i>    |
| De-Stress                    | 8760 ho           |
| DMR                          | $17520\rho$       |
| TMR                          | $26280\rho$       |
| QMR                          | $35040\rho$       |
| High-Performance             | $35040\rho$       |

By integrating the proposed monitor into the 4-core multiprocessing system, a self-adaptive mode switching can be achieved by autonomously configuring the least amount of redundancy depending on the current SERs. For example, table VI shows the connection between the SERs and the operating modes of the 4-core multiprocessing system as well as the average duration time for corresponding SERs in one year. The average SERs duration time in one year is the merging of SERs under differing solar conditions into a one-year average [13]. The SERs classification in Table VI are determined from the published empirical results (see Table I), and the accurate SER data from the analyses of the SIL requirements will be present in our future work. Moreover, besides triggering the on-demand NMR formation of the cores, the self-adaptive mode switching can also effectively reduce the total system power consumption. The parameter  $\rho$  expresses the power consumption of one core per hour. Thus, the system power consumption in one year is:  $P_{year} = \rho \cdot 5460 + 2\rho \cdot 3120 + 3\rho \cdot 162 + 4\rho \cdot 18 = 12258\rho$ . The power consumption comparison of different operation modes in one year is shown in table VII. The comparison results show that the power consumption of the proposed self-adaptive mode switching approach is even lower than the DMR mode. Furthermore, if the predicted SPE shows a high occurrence probability, the TMR or QMR mode can be activated in advance.

#### VII. CONCLUSION AND FURTHER WORK

An on-chip low-cost SEU monitor based on a standard 20-Mbit SRAM for space applications has been presented. The proposed design extends the functionality of an SRAM by combining the existing EDAC and scrubbing modules with three rad-hard SEU counters and an address register file. Single and double bit errors as well as permanent faults in each 39-bit memory word can be detected and counted during scrubbing procedure. SPICE simulations confirmed that the SRAM cells are very sensitive to particle strikes. Therefore, the use of SRAM as a SEU monitor is a valid choice. The synthesis results show that the area and power consumption overheads are negligible compared to the 20-Mbit SRAM. The use of the SEU Monitor is foreseen in a multiprocessing system with reconfigurable/dynamic mechanisms. The optimal operating mode of the multiprocessing system can be dynamically determined according to the SERs from the proposed monitor, and realizing the best trade-off between reliability, performance and power consumption during run-time.

The future work will mainly focus on three directions. Firstly, a prototype chip will be manufactured in IHP 130 nm technology, and the performance of the developed monitor will be validated under high-energy irradiation. Secondly, the modification of the algorithm in the EDAC module is planned to allow detection of MBEs with low area and timing overheads. Finally, the adaptive operating mode switching for the multiprocessing system will be implemented and verified.

## ACKNOWLEDGMENTS

This work has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No. 722325. Furthermore, this work has also received funding from the EU Eurostars programme under the grand agreement number E!10049 (EuroSRAM4Space).

#### REFERENCES

- [1] E. Dubrova, Fault-tolerant design. New York, NY: Springer, 2013.
- [2] J. L. Barth et al., "Space, atmospheric, and terrestrial radiation environments," in IEEE Transactions on Nuclear Science, vol. 50, no. 3, June 2003, pp. 466-482.
- [3] R. Harboe-Sørensen et al., "Observation and analysis of Single Event Effects on-board the SOHO satellite," RADECS 2001. 2001 6th European Conference on Radiation and Its Effects on Components and Systems, Grenoble, France, 2001, pp. 37-43.
- [4] K. H. Yearby et al., "Single-event upsets in the cluster and double star digital wave processor instruments," in Space Weather, vol. 12, no. 1, Jan. 2014, pp. 24-28.

- [5] D. L. Hansen et al., "Correlation of Prediction to On-Orbit SEU Performance for a Commercial 0.25-μm CMOS SRAM," in IEEE Transactions on Nuclear Science, vol. 54, no. 6, Dec. 2007, pp. 2525-2533.
- [6] G. R. Heckman et al., "Prediction and evaluation of solar particle events based on precursor information," Advances in Space Research Volume 12, Issues 23, 1992, pp. 313-320.
- [7] R. Harboe-Sørensen et al., "Design, Testing and Calibration of a "Reference SEU Monitor" System," 2005 8th European Conference on Radiation and Its Effects on Components and Systems, Cap d'Agde, 2005, pp. B3-1-B3-7.
- [8] J. Barak et al., "Detecting heavy ions and protons in space: single-events monitor," Eighteenth Convention of Electrical and Electronics Engineers in Israel, Tel Aviv, Israel, 1995, pp. 5.5.1/1-5.5.1/3.
- [9] J. Prinzie et al., "An SRAM-Based Radiation Monitor With Dynamic Voltage Control in 0.18 um CMOS Technology," in IEEE Transactions on Nuclear Science, vol. 66, no. 1, Jan. 2019, pp. 282-289.
- [10] G. Tsiligiannis et al., "An SRAM Based Monitor for Mixed-Field Radiation Environments," in IEEE Transactions on Nuclear Science, vol. 61, no. 4, Aug. 2014, pp. 1663-1670.
- [11] G. Spiezia et al., "A New RadMon Version for the LHC and its Injection Lines," in IEEE Transactions on Nuclear Science, vol. 61, no. 6, Dec. 2014, pp. 3424-3431.
- [12] Q. Tang et al.,"A compact high-sensitivity 2-transistor radiation sensor array," 2017 IEEE International Reliability Physics Symposium (IRPS), Monterey, CA, 2017, pp. SE-7.1-SE-7.4.
- [13] R. Glein et al., "BRAM implementation of a single-event upset sensor for adaptive single-event effect mitigation in reconfigurable FPGAs," 2017 NASA/ESA Conference on Adaptive Hardware and Systems, Pasadena, CA, 2017, pp. 1-8.
  [14] M. Havranek et al., "MAPS sensor for radiation imaging designed in
- [14] M. Havranek et al., "MAPS sensor for radiation imaging designed in 180nm SoI CMOS technology" Journal of Instrumentation, 13 C06004, 06, 2018.
- [15] G. H. Chapman et al., "Single Event Upsets and Hot Pixels in digital imagers," 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, Amherst, MA, 2015, pp. 41-46.
- [16] A. Simevski, "Architectural framework for dynamically adaptable multiprocessors regarding aging, fault tolerance, performance and power consumption", PhD dissertation, BTU Cottbus-Senftenberg, 2014.
- [17] S. Bourdarie et al., "The Near-Earth Space Radiation Environment," in IEEE Transactions on Nuclear Science, vol. 55, no. 4, Aug. 2008, pp. 1810-1832.
- [18] D. L. Hansen et al., "Correlation of Prediction to On-Orbit SEU Performance for a Commercial 0.25-µm CMOS SRAM," in IEEE Transactions on Nuclear Science, vol. 54, no. 6, Dec. 2007, pp. 2525-2533.
- [19] R. Glein et al., "Detection of solar particle events inside FPGAs," 2016 16th European Conference on Radiation and Its Effects on Components and Systems, Bremen, 2016, pp. 1-5.
- [20] C. Hafer et al., "Next generation radiation-hardened SRAM for space applications," 2006 IEEE Aerospace Conference, Big Sky, MT, 2006, pp. 8 pp.-.
- [21] M. Y. Hsiao, "A Class of Optimal Minimum Odd-weight-column SEC-DED Codes," in IBM Journal of Research and Development, vol. 14, no. 4, July 1970, pp. 395-401.
- [22] V. Petrovic and M. Krstic. "Design flow for radhard TMR Flip-Flops," Design and Diagnostics of Electronic Circuits and Systems, IEEE International Symposium on, 2015, pp. 203.
- [23] L. Saiz-Adalid et al., "MCU Tolerance in SRAMs Through Low-Redundancy Triple Adjacent Error Correction," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 10, Oct. 2015, pp. 2332-2336.
- [24] X. She et al., "SEU Tolerant Memory Using Error Correction Code," in IEEE Transactions on Nuclear Science, vol. 59, no. 1, Feb. 2012, pp. 205-210.
- [25] F. Wrobel et al., "Comparison of the transient current shapes obtained with the diffusion model and the double exponential law - Impact on the SER," 2013 14th European Conference on Radiation and Its Effects on Components and Systems, Oxford, 2013, pp. 1-4.
- [26] Functional Safety of electrical / electronic / programmable electronic safety related systems (IEC 61508), Intermational Electrotechnical Commission, 2005.