# IODA: A Host/Device Co-Design for Strong Predictability Contract on Modern Flash Storage

Huaicheng Li<sup>\*</sup>\*, Martin L. Putra<sup>\*</sup>, Ronald Shi<sup>\*</sup>, Xing Lin<sup>\*</sup>, Gregory R. Ganger<sup>\*</sup>, Haryadi S. Gunawi <sup>\*</sup>

The 28th ACM Symposium on Operating Systems Principles (SOSP'21)



\*University of Chicago, \*Carnegie Mellon University, \*NetApp

# IODA: A Host/Device Co-Design for Strong Predictability Contract on Modern Flash Storage



"Small but powerful"

## "Attack of GC" – Unpredictability in SSDs



## "The Tail Menace" in Flash Arrays





SSD0 SSD1 SSD2 SSD3

## "The Tail Menace" in Flash Arrays



## A slow SSD makes the entire flash array slow!



## "A New Hope" – NVMe Predictable Latency Mode

NVMe Predictable Latency Mode (**PLM**)

A major leap

- Predictable/Busy Time Window (TW)
- Device status query & toggling

But insufficient

- Coarse-grained device-level predictability
- "Soft-contract" breaking predictability
- Requiring complex status tracking
- **—** ····



How to leverage NVMe PLM and enhance it for predictable latencies?

#### The IODA Story

- ☐ Goal: Tail-free flash array system on top of slightly-extended PLM interface
- ☐ Design Principles:
  - **Simple** policies for efficiency
  - Minimal changes for easy deployment
- ☐ IODA Approach/Techniques:
  - ♣ Per-I/O latency predictability
  - ◆ Busy Remaining Time (BRT) Exposure
  - **Time Window** (TW) Formulation
  - + An end-to-end design exploiting above extensions



- □ Background & Motivation
- □ IODA Overview
- □ IODA Design
  - Predictable latency flagged I/Os
  - Busy remaining time
  - Time window formulation
  - Relaxed TW for better write amplification
- □ Evaluation
- □ Summary

#### Leverage Redundancy for Performance

Tiny-Tail Flash: Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND SSDs

Trimming the Tail for Deterministic Read Performance in SSDs

Latency Reduction and Load Balancing in Coded Storage Systems

RAIL: Predictable, Low Tail Latency for NVMe

MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-Aware OS Interface

#### An old, effective idea; Yet, challenging for PLM

When to issue the parity reads?

(1) Wait for timeout

Best threshold? Tricky

(2) Always Proactive (always send full-stripe)

Increased load --> Inefficient

#### Semantic gap between the Host and SSD to communicate the "busyness"



## IOD<sub>1</sub>: Predictable Latency Flagged I/Os



"Fail-if-Slow": the SSD should fast-fail an I/O if it contends with GC











#### The Effectiveness of "Fail-if-Slow" Interface



#### A Case Against Proactive Reconstruction



Semantic Gap: the host doesn't know how long SSD "busyness" will last



End up waiting for the busiest SSD

## Busy Remaining Time (BRT) Exposure



"Fail-if-Slow": the SSD should fast-fail an I/O if it contends with GC





Piggybacking BRT to reconstruct data from less busy SSDs



#### The Effectiveness of "BRT" Interface



## **IODA Busy Latency Windows**



"Fail-if-Slow": the SSD should fast-fail an I/O if it contends with GC





TW Coordination: SSDs take turns to perform GCs



#### IODA Time Window (TW) Formulation



TW Upper Bound



#### More in the paper!

- □ IODA *TW* analysis
  - 6 SSD models
  - Relaxed TW
  - TW vs. WAF tradeoffs
- □ Implementation
  - Platforms: FEMU + OpenChannel-SSD
  - Kernel: Linux Software-RAID + NVMe
- More evaluation results
  - 9 datacenter block traces + 21 real applications
  - IODA vs. **7** State-of-the-art approaches
  - IODA on OpenChannel-SSD
  - IODA throughput and write latency

**-** ...

#### IODA: A Host/Device Co-Design for Strong Predictability Contract on Modern Flash Storage

Huaicheng Li University of Chicago and Carnegie Mellon University Martin L. Putra University of Chicago Ronald Shi University of Chicago

Xing Lin NetApp Gregory R. Ganger Carnegie Mellon University Haryadi S. Gunawi University of Chicago

#### Abstract

Predictable latency on flash storage is a long-pursuit goal, et, supredictability stays doe to the unavoidable disturbance from many well-known SSD internal activities. To combat this issue, the recent WMe 10 Determinism (10D) interface advocates host-level controls to SSD internal management tasks. While promising, challenges remain on how to exploit it for traily predictable performance.

We proved 100A, and 100 decrementatic fluid wavey design bell on top of multi be proveded activation to the (100 interjões fire sur splejoument. 100A replints data redundancy in fisce fire sur splejoument. 100A replints data redundancy in the 100A. 53D see see orien et sup skip fish and 200 surprises a allow producidade 100t through posacitive data reconstrucian. In the case of concerne internal splejouristics. 100A surprise visible formulation in parameter predictable data for the concerned internal special conduction in the final time to the concerned of the concerned internal special data NVMs interface and a small modification in the final data with the concerned of the contract of the 100A superior and the 5-0909<sup>48</sup> distructive by up to 75%. 100A is also the seasons to the ideal of them that the concerned of the contract to the ideal.

#### CCS Concepts

Computer systems organization → Firmware; Embedded hardware; Embedded software; • Information systems → Flash memory; • Hardware → Emerging interfaces.

controller, prediction, and proactive approaches.

Permission to make digital or but copies of all et part of this work for personal or classors as in granted witness for probled that copies are not made or distributed for prefit or commercial advantage and that copies bear the notices and the full catains on the first gar. Copyrights for consonic of this work cased by others than the authority many be boomed. Advantages of the contributed for the contributed of the contributed of the witness of the contributed of the contributed of the contributed of the two credit in permission. It is replicated to the complete for permission and one store the two results witness to the contributed of the contributed of the contributed of the 2002 T. 21. Control 2-59, 2002. Viewed Erest. Germanier

O 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8709-9/21/10... \$15.00

CM ISBN 978-1-4503-8709-3/21/10...\$15.00 ms://doi.org/10.1145/3477132.3483573

#### Keywords

Software/Hardware Co-Design, Predictable Latency, NVMe I/O Determinism, SSD, Flash Storage

#### ACM Reference Format

Husicheng Li, Martin L. Putra, Ronald Shi, Xing Lin, Gregory R. Ganger, and Haryadi S. Gunawi. 2021. IODA: A HostDevice Co-Design for Strong Predictability Contract on Modern Flash Stonage. In ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP '21), October 26-29. 2021. Virtual Event, Germany: ACM, New York, NY, USA, 17 pages, https://doi.org/10.1145/34771523.3

#### 1 Introduction

Flash arrays are popular storage choices in data centers and they must address users' craving for low and predictable latencies [1-3]. Thus, many recent SSD products are released and evaluated not just on the average speed but the percentile latencies as well [4-7]. These all paint the reality that customers would like SSDs with deterministic latencies.

would like SSIS with a terrimonate intensities. Deterministic latency, however, is hard to achieve because SSD performance is inherently non-deterministic due to the internal management activities such as the garbage collection (GC) 5, process, wear leveling, and internal buffer flush [8–10]. These activities will in-

evitably trigger many background I/Os and disturb user regoots. Noothly, CC is necessary path to recoreme NAVD Flash's inability for its-place overwrites. It involves timecommuning data mercent to reclaim space and cented with illustration, the figure on the right whom the paint interve year litheration, the figure on the right whom the paint interve year between the "Sime" (with CC) and the "falled" (or GC) cases. Modern SSIsh often recent to large over-provincing space (e.g., up to 50% of the SSDs) raw NAND expensity) [11] to provide logoom for more efficient background and processtions of the simple speciments or contemprise spacework, our profile generations to contemprise provinces, our profile of generations to contemprise processors, our profile generation of the contemprise processors, our profile generation of the special profile generation of the special profile generation of the special profile generation of the generation of the special profile generation of the special profile generation of the profile generation of the special profile generation of the special profile generation of the generation of the special profile generation of the special profile generation of the generation of the special profile generation of the special profile generation of the generation of the special profile generation of the special profile generation of the generation of the special profile generation of the special profile generation of the generation of the special profile generation of the special profile generation of the generation of the special profile generation of the special profile generation of the generation of the special profile generation of the special profile generation of the generation of the special profile generation of the special profil

To tame the SSD performance challenges, there have been many efforts to evolve the device interfaces [15-17]. The Stor

#### **IODA Stack and Evaluation Setup**



#### **IODA** Evaluation





IODA Results: (95th – 99.99th)

Up to 75x improvement over Base



#### IODA Throughput



IODA doesn't sacrifice the array's aggregate bandwidth

#### **IODA Takeaways**

- □ A Co-Design Approach for Performance Predictability
  - Proactive reconstruction via fast-fail interface
  - BRT for improved latencies
  - TW formulation to program the window length
  - Cross-device synchronization

## Thank you!

I'm on the job market.

IODA: https://github.com/huaicheng/IODA