# **Resiliency Challenges in Future Communications Infrastructure**

Hang Nguyen Intel Corporation May 14, 2014

Acknowledgements: Pranav Mehta and Shivani Sud of Intel



## Outline

NFVI and desired Characteristics The Resiliency Challenge Faults, Effects, and Measures Resiliency framework for NFV Summary



### **Infrastructure Challenges**

#### New kinds of consumers



5 Billion

people will be directly touched by connectivity in 2015

#### New kinds of connections



29X

is the amount that mobile data connectivity will grow between 2010 and 2015 \$46 Billion

55%

in revenue will be generated worldwide by consumer mobile apps in 2015

New kinds of devices

of respondents say their next phone

New ecosystems

purchase will be a smartphone

#### Source: Yankee Group\*, 2012

\*Other names and brands may be claimed as the property of others.

Drive more Demanding Performance





### **Transforming the Network with SDN and NFV**



110

## **NFV End User Value Proposition**

### Lower TCO

### On-Demand Service

Rapid Service Innovation



### **NFV Infrastructure Attributes**

Reliability

Availability

Manageability

Security

Performance

### $RAS \rightarrow A$ main theme



# **Increasingly Common Faults**



### Faults can be very costly

## **Moore's Law**

# # Transistors double every ~2 years

4004

8008

1971

iter.

#### **Scaling Trends**



Transistor dimensions scale to improve performance, reduce power and reduce cost per transistor



#### IEEE Communications Quality and Reliability Workshop, 2014

**IDF**201

## **Unprecedented Integration**





- Moore's Law enables unprecedented levels of integration
- Heterogeneous system integration of Cores, Graphics, Media, IOs, memory technologies, etc. to satisfy USERS' experiences and reduce OPERATORS' expenditures

#### Heterogeneous System Integration further drives the Resiliency Challenge



### **Potential Fault Sources**





# **Types of Faults**

(inte

| Faults                            | Туре                                                 | Example                                                                     |
|-----------------------------------|------------------------------------------------------|-----------------------------------------------------------------------------|
| Permanent faults                  | Stuck at 0 - 1                                       | Open, shorts, power supply or fan shutdown                                  |
| Gradual faults                    | Spatial: Variations                                  | Fast and slow cores                                                         |
|                                   | Temporal:<br>Temperature effects                     | Change in frequency with temperature                                        |
| Aging faults                      | Degradation (slow gradual temporal)                  | Loss of frequency over time, erratic bits in memory                         |
| Intermittent/transi<br>ent faults | Soft errors (radiation<br>induced)<br>Voltage droops | Flipped bit causes data<br>corruption, loss of control,<br>not reproducible |

| Faults cause errors (data & control) |                                        |  |
|--------------------------------------|----------------------------------------|--|
| Datapath/array errors                | Detected/corrected by parity/ECC       |  |
| Control errors                       | Control lost (Blue screen. system hang |  |
| Silent Data Corruption               | Not detectable                         |  |

### **Sources of Variations**









**Temp Variation & Hot spots** 

# Variability and Degradation





Smaller Transistors Higher  $\sigma$  in Vt ~ 10mV in  $\sigma$ (Vt) per generation Transistor aging Degrades drive current with time Results in performance loss over time



## **NTV for Energy Efficiency**

When designed to voltage scale



### **NTV and Variability**



#### Variability becomes worse at NTV



### Interconnect scaling: E-field increases





Decreasing line pitch increases E field resulting in lower reliability

40nm pitch & 100B+ interconnects

- E field increases, and so do the number of interconnects
- Cu Migration & Dielectric failures specially with ULow-K ILD and linear defects are concerns

## Soft Errors: Cache cell



Decreasing cell to cell distance increases probability of multi-bit upset

#### SER per bit is decreasing but... Number of memory bits can double



# Soft Errors: Latch and System



SER per latch bit is decreasing but... Number of latches double SER for cache remains ~ constant But SER for chip logic continues to increase because SER/latch is not decreasing fast enough



### **Road to Unreliability?**

Pessimistic speculation, please do not use as data



### Will this happen?



# Faults, Effects, and Measures

| Type of Fault           | Effect                                               | Measure                                                  |
|-------------------------|------------------------------------------------------|----------------------------------------------------------|
| Permanent faults        | Fan, power supply, shorts and opens                  | Sensors for detection<br>Recover, reconfigure            |
| Gradual spatial faults  | Variations in frequency of cores                     | Screening, configuration                                 |
| Gradual temporal faults | Temperature increase<br>causing frequency loss       | Detect and correct, proactively reconfigure              |
| Intermittent faults     | Data corruption by noise or soft error, control loss | Diagnose, retry, recover                                 |
| Slow degradation        | Frequency loss<br>Erratic bits in memory             | Proactive measure,<br>testing, decommission<br>faulty HW |

### Resiliency best implemented as SW and HW Co-design



### **Resiliency Framework for NFV**



Failure detection & recovery at lower layers to contain faults propagation to upper layers reducing the system overhead Predictive Failure Analysis to catch failures before they occur and allow system to take actions to provide HA

# Summary

Integration and NFV consolidation drive resiliency challenges Must understand and characterize faults SW and HW Co-design is the most effective way to achieve resilient system Detect errors in HW, diagnose & correct via SW Solutions should be autonomous Solutions must incur low cost, performance and power impact



System 'Health' Monitoring and Failure Prediction are the fundamental Resiliency Toolkit for NFV



