### Tiny-Tail Flash

# Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND SSDs

**Shiqin Yan**, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman\*, Andrew Chien, and Haryadi S. Gunawi









### Why SSDs don't perform

From their earliest days, people have reported that SSDs were not providing the performance they expected. As SSDs age, for instance, they get slower. Here's why.

Google: Taming The Long Latency Tail -When More Machines Equals Worse Results

### Why it's hard to meet SLAs with SSDs

https://storagemojo.com/2015/06/03/why-its-hard-to-meet-slas-with-ssds/



### Why SSDs don't perform

From their earliest days, people have reported that SSDs were not providing the performance they expected. As SSDs age, for instance, they get slower. Here's why.

Google: Taming The Long Latency Tail -When More Machines Equals Worse Results

"if your read is stuck behind an erase you may have wait **IOs of milliseconds**. That's a **IOOx** increase in latency variance"

### Why it's hard to meet SLAs with SSDs











































































A GC moves tens of valid pages!





A GC moves tens of valid pages!

which makes channel/chips busy for tens of ms!





A GC moves tens of valid pages!

which makes channel/chips busy for tens of ms!





A GC moves tens of valid pages!

which makes channel/chips busy for tens of ms!





Tail-tolerant techniques in distributed/storage systems:

Leverage redundancy to cut tail!



Tail-tolerant techniques in distributed/storage systems:

Leverage redundancy to cut tail!

Full Stripe Read

**RAID:** 











Tail-tolerant techniques in distributed/storage systems:

Leverage redundancy to cut tail!



**RAID:** 



Tail-tolerant techniques in distributed/storage systems:

Leverage redundancy to cut tail!



**RAID:** 





Error rate increases → RAIN (Redundant Array of Independent NAND)



Error rate increases -> RAIN (Redundant Array of Independent NAND)

Similarly, we leverage RAIN to cut "tails"!

Full Stripe Read

A B C





Error rate increases → RAIN (Redundant Array of Independent NAND)

Similarly, we leverage RAIN to cut "tails"!



SSD:



Error rate increases → RAIN (Redundant Array of Independent NAND)

Similarly, we leverage RAIN to cut "tails"!



SSD:

New techniques:

Current SSD

**RAIN** 

technology: (Parity-based Redundancy)

I. Plane-Blocking GC

New techniques:

Current SSD

**RAIN** 

technology: (Parity-based Redundancy)

New techniques:

- I. Plane-Blocking GC
- II. GC-Tolerant Read

Current SSD

**RAIN** 

technology:

(Parity-based Redundancy)

New techniques:

- I. Plane-Blocking GC
- II. GC-Tolerant Read
- III. Rotating GC

Current SSD

**RAIN** 

technology: (Parity-based Redundancy)

New techniques:

- I. Plane-Blocking GC
- II. GC-Tolerant Read
- III. Rotating GC
- IV. GC-Tolerant Flush

Current SSD

**RAIN** 

technology: (Parity-based Redundancy)



# Results 100%



























## **Outline**

- □ Introduction
- **□**Background
- ☐ Tiny-Tail Flash Design
- □ Evaluation, limitations, conclusion





















1. read to controller (check with ECC)





- 1. read to controller (check with ECC)
- 2. write to another block





for  $(1 \dots \# \text{ of valid pages})$ :

- 1. read to controller (check with ECC)
- 2. write to another block





for  $(1 \dots \# \text{ of valid pages})$ :

- 1. read to controller (check with ECC)
- 2. write to another block





## for (1 ... # of valid pages):

- 1. read to controller (check with ECC)
- 2. write to another block









3. Erase the old block













## **Outline**

- □ Introduction
- ■Background
- ☐ Tiny-Tail Flash Design
  - Plane-Blocking GC
  - GC-Tolerant Read
  - Rotating GC
  - GC-Tolerant Flush
- □ Evaluation, limitations, conclusion



# **Plane**Blocking







#### **Base GC Logic:**

for (every valid page)

1. flash read+write (over channel)

2. wait

#### **Plane Blocking**





#### **Base GC Logic:** for (every valid page) 1. flash **Plane Blocking** (over c **GC** Logic: 2. wait for (every valid page) 1 flash read+write (inside plane)

#### **Plane Blocking**





## Base GC Logic: Plane Blocking

for (every valid page)

1. flash r (over c 2. wait

# Plane Blocking GC Logic:

for (every valid page)

- 1 flash read+write (inside plane)
- 2 serve other user I/Os





#### **Base GC Logic:**

#### **Plane Blocking**

for (every valid page)

1. flash i (over o

# Plane Blocking GC Logic:

for (every valid page)

- 1 flash read+write (inside plane)
- 2 serve other user I/Os

#### **SSD Controller**





#### **Base GC Logic:**

#### **Plane Blocking**

for (every valid page)

1. flash (over 2. wait

# Plane Blocking GC Logic:

for (every valid page)

- 1 flash read+write (inside plane)
- 2 serve other user I/Os

### **SSD Controller**



#### **Overlap**

- 1 intra-plane copyback with
- 2 channel usage for other non-GCing planes











- **Issue 1:** No ECC check for garbage-collected pages
  - (will discuss later)



- ☐ Issue 1: No ECC check for garbage-collected pages
  - (will discuss later)
- ☐ Issue 2:





- ☐ Issue 1: No ECC check for garbage-collected pages
  - (will discuss later)

#### ☐ Issue 2:





### **Outline**

- □ Introduction
- ■Background
- ☐ Tiny-Tail Flash Design
  - Plane-Blocking GC
  - RAIN + GC-Tolerant Read
  - Rotating GC
  - GC-Tolerant Flush
- □ Evaluation, limitations, conclusion













LPN (Logical Page #)





LPN (Logical Page #)

Static mapping: LPN0 → C[0]PG[0] LPN1 → C[1]PG[0]





LPN (Logical Page #)

Static mapping: LPN0  $\rightarrow$  C[0]PG[0] LPN1  $\rightarrow$  C[1]PG[0]

LPN 0, 1, 2  $\rightarrow$  **P**<sub>0,1,2</sub>





Full Stripe Read



















Issue: **partial** stripe read





Issue: **partial** stripe read





Issue: **partial** stripe read



Must generate extra

N-1 reads!



Issue: **partial** stripe read



Must generate extra

N-1 reads!



Issue: **partial** stripe read



Must generate extra

N-1 reads!

Add **contention** to other N -1 channels and planes



Issue: **partial** stripe read



Must generate extra

N-1 reads!

Add **contention** to other N -1 channels and planes

Convert to full stripe if:  $T_{extra-reads} < T_{GC}$ 









Full-stripe read 0 1 2





One parity  $\rightarrow$  cut one tail **Can't cut two tails!** 

Full-stripe read 0 1 2





One parity  $\rightarrow$  cut one tail **Can't cut two tails!** 

Full-stripe read 0 1 2





One parity  $\rightarrow$  cut one tail **Can't cut two tails!** 





One parity  $\rightarrow$  cut one tail **Can't cut two tails!** 





One parity  $\rightarrow$  cut one tail **Can't cut two tails!** 





## **Outline**

- □ Introduction
- ■Background
- ☐ Tiny-Tail Flash Design
  - Plane-Blocking GC
  - GC-Tolerant Read
  - Rotating GC
  - GC-Tolerant Flush
- □ Evaluation, limitations, conclusion









#### **Rotating!**



# **Rotating GC:**



#### **Rotating!**



# **Rotating GC:**



#### **Rotating!**



# **Rotating GC:**



Anytime, **at most 1** plane per plane group can perform GC

**Concurrent GCs** in **different** PGs are permitted.













## **Outline**

## ☐ Tiny-Tail Flash Design

- Plane-Blocking GC
- GC-Tolerant Read
- Rotating GC
- GC-Tolerant Flush (in paper)
- **□**Evaluation
- ☐ Limitations
- □ conclusion



# **Implementation**

- **SSDsim** (~2500 LOC)
  - Device simulator
- □ **VSSIM** (~900 LOC)
  - QEMU/KVM-based
  - Run Linux and applications
- OpenSSD
  - Many limitations of the simple programming model
- □ Future: ttFlash on OpenChannel SSD



## **Evaluation**

- □ Simulator: **SSDsim** (verified against hardware)
- Workload: 6 real-world traces from Microsoft Windows
- □ Settings and SSD parameters:
  - SSD size: 256GB, plane group width = 8 planes (1 parity, 7 data)

| Sizes           |             | Latencies           |                      |
|-----------------|-------------|---------------------|----------------------|
| SSD Capacity    | 256 GB      | Page Read           | $40 \mu \mathrm{s}$  |
| #Channels       | 8           | (flash-to-register) |                      |
| #Planes/channel | 8           | Page Write          | $800 \mu \mathrm{s}$ |
| Plane size      | 4 GB        | (register-to-flash) |                      |
| #Planes/chip    | <b>**</b> 1 | Page data transfer  | $100 \mu \mathrm{s}$ |
| #Blocks/plane   | 4096        | (via channel)       |                      |
| #Pages/block    | 256         | Block erase         | 2 ms                 |
| Page size       | 4 KB        |                     |                      |





































#### Evaluated on 6 windows workload traces with various characteristics





## **Other Evaluations**



## **Other Evaluations**

- Filebench on VSSIM+ttFlash
  - ttFlash achieves better average latency than base case





## **Other Evaluations**

- Filebench on VSSIM+ttFlash
  - ttFlash achieves better average latency than base case

- Vs. Preemptive GC
  - ttFlash is more stable than semi-preemptive GC
    - (If no idle time, preemptive GC will create GC backlogs, creating latency spikes)





- ttFlash depends on RAIN
  - I parity for N parallel pages/channels
  - We set N = 8, so we lose one channel out of 8 channels.
  - Average latencies are 1.09 1.33x slower than NoGC, No-RAIN case

- ttFlash depends on RAIN
  - I parity for N parallel pages/channels
  - We set N = 8, so we lose one channel out of 8 channels.
  - Average latencies are 1.09 1.33x slower than NoGC, No-RAIN case
- $\square$  RAID  $\rightarrow$  more writes (P/E cycles)
  - ttFlash increases P/E cycles by 15 18% for most of workloads
  - Incur > 53% P/E cycles for TPCC, MSN (random write)

- ttFlash depends on RAIN
  - I parity for N parallel pages/channels
  - We set N = 8, so we lose one channel out of 8 channels.
  - Average latencies are 1.09 1.33x slower than NoGC, No-RAIN case
- $\square$  RAID  $\rightarrow$  more writes (P/E cycles)
  - ttFlash increases P/E cycles by 15 18% for most of workloads
  - Incur > 53% P/E cycles for TPCC, MSN (random write)
- □ ECC is **not** checked during GC
  - Suggest background scrubbing (read is fast & not as urgent as GC)
  - Important note: in ttFlash, foreground/user reads are still ECC checked



### **Tails under Write Bursts**

ttFlash 55MB/s Latency CDF w/ Write Bursts



Under write burst and at high watermark, ttFlash must dynamitcally disable Rotating GC to ensure there is always enough number of free pages.











technology: Powerful Controller

**RAIN** (parity-based redundancy)

Capacitor-backed RAM

#### New techniques:

- I. Plane-Blocking GC
- II. GC-Tolerant Read
- III. Rotating GC
- IV. GC-Tolerant Flush



technology: Powerful Controller

**RAIN** (parity-based redundancy)

Capacitor-backed RAM



#### New techniques:

- Plane-Blocking GC
- II. GC-Tolerant Read
- III. Rotating GC
- IV. GC-Tolerant Flush

technology: Powerful Controller

**RAIN** (parity-based redundancy)
Capacitor-backed RAM

ttFlash GC-induced long tail (Percentile) Overall results achieved: Between 99 - 99.99th percentiles: ttFlash I-3x slower than NoGC

Base 5-138x slower than NoGC

# Thank you! Questions?



