Solid State Drive (SSD) Health Monitoring

https://www.vita.virginia.gov/media/vitavirginiagov/it-governance/ea/pdf/Boot-via-SAN__vs__Boot-via-DASD__Technical-Brief.pdf
Image Credit: Virginia Information Technologies Agency

Posted on January 30, 2019 | Completed on January 1, 2019 | By: Scott E. Armistead

What technologies can autonomously assess the health of solid-state drives (SSDs) and predict potential failures?

 

The Defense Systems Information Analysis Center (DSIAC) received a technical inquiry requesting information on technologies that can, without operator action, assess the health of solid-state drives (SSDs) to measure performance and predict potential failures.

DSIAC staff reviewed information found using the Defense Technical Information Center Research and Engineering Gateway and open sources on to Self-Monitoring, Analysis, and Reporting Technology (SMART) embedded in modern computer storage media devices and their control electronics.  DSIAC also reviewed information on SSD reliability and software utilities provided by original equipment manufacturers (OEMs) or second parties for SSD health monitoring. This information was provided to the requester.

For normal consumer and many industry applications, DSIAC found that both the OEM and second-party software health-monitoring be used to continually monitor and automatically send notifications about SSD issues.  The tools generally provide a conservative notification that allows the user to replace the drive prior to losing data integrity. However, for critical applications, these tools are of limited use in predicting the date or time an SSD will actually fail. Additionally, other factors, such as the following, bring into question the accuracy/reliability of the reported information on SSD issues, especially when using second-party utilities:

  • Loose industry standards for determining SSD reliability and reporting of SMART attribute data.
  • Discrepancies in how different manufacturers define their reported SMART attributes.
  • The failure of some manufacturers to fully disclose what their SMART attributes are or specifics in data they report.
  • The possible lack of error logs for assessing past performance.
  • The possible lack of environmental sensor data.

Some of these factors could be mitigated by matching a manufacturer’s SSD to the health-monitoring tools they developed to monitor it.

When considering employment of SSDs in critical Department of Defense applications, DSIAC found that health-monitoring tools likely do not provide the necessary risk mitigation.  The unexpected and unpredictable nature in which many SSDs fail may necessitate additional measures, like more frequent backups, scheduled early drive replacement, built-in redundancy, and/or use of drive array configurations that can automatically rebuild a failed drive.

 


1.0  Introduction

The inquirer requested information on technologies that can, without operator action, assess the health of SSDs to measure performance and predict potential failures.  The inquirer specified that the technology must accomplish these tasks without disrupting the data on the drive or causing additional damage.  The technology must also provide the option for an intuitive user interface to display the results of the assessment.

The Defense Systems Information Analysis Center (DSIAC) staff conducted research using the Defense Technical Information Center (DTIC) Research and Engineering (R&E) Gateway and open sources to find information on the following:

  • SSD reliability.
  • Self-Monitoring, Analysis, and Reporting Technology (SMART) attributes.
  • SSD manufacturer drive health-monitoring tools.
  • Second-party drive health-monitoring tools.
  • Industry standards related to drive health monitoring and reporting.

DSIAC staff also reviewed the results of previous testing and analysis related to SSD reliability. The results of the research are consolidated in this response report.

 


2.0  SMART for Computer Storage Media Health Monitoring

2.1  A Brief History of SMART

Predictive failure technology for computer storage media was introduced into computer systems in the early 1990s; however, this technology initially provided a binary result of either the device being functional or likely to fail soon, which was insufficient for predictive analytics [1].

In the mid-1990s, Compaq developed the Intellisafe utility, which could measure a disk’s health parameters and values and transfer them to the operating system and user-space monitoring software.  Although this utility provided more advanced health-monitoring capabilities, the system was not standardized, and each disk manufacturer independently decided which parameters would be provided for monitoring and what thresholds would be used for reporting analytics [1].

In 1995, Compaq, with support from most of the hard disk drive (HDD) manufacturers, submitted Intellisafe to the small form factor (SFF) committee for standardization, and it was adopted under the name of SMART (or S.M.A.R.T.).  In 1995, Compaq also placed Intellisafe in the public domain [1].

SMART technology is now included in all computer HDDs, SSDs, and embedded MultiMediaCard drives.  Its primary function is to detect and report various indicators of drive reliability with the intent of anticipating imminent hardware failures [1].

Each drive manufacturer normally defines their own set of SMART reported attributes and establishes threshold values that attributes should not exceed during normal operations.  Each attribute has a raw value, whose meaning is entirely up to the drive manufacturer (but often corresponds to counts or a physical unit, such as degrees Celsius or seconds), a normalized value, which ranges from 1 to 253 (1 represents the worst case and 253 represents the best), and a worst value, which represents the lowest recorded normalized value.  The initial default value of attributes is 100 but can vary among manufacturers.  Associated with these attributes may be a Threshold Exceeds Condition (TEC), which is an estimated date when a critical drive statistic attribute will reach its threshold value.  When drive health-monitoring software reports a “Nearest TEC,” it should be regarded as a “Failure date.”  To predict the date, the drive tracks the rate at which the attribute changes [1].

2.2  Parallel Advanced Technology Attachment (PATA)

The technical documentation for SMART is in the PATA standard [1]; it uses the underlying Advanced Technology Attachment (ATA) and ATA Packet Interface (ATAPI) standards.  PATA provides an interface standard for the connection of storage devices such as HDDs, optical disc drives, and SSDs in computers.  The standard is maintained by the X3/InterNational Committee for Information Technology Standards (INCITS) committee [2].

In late 2016, INCITS began to standardize the descriptions of SMART attributes and produce a report, INCITS/TR-54-201x, to be registered with the American National Standards Institute (ANSI).  As of September 2019, the report was approximately 10% complete; the following is an excerpt from the scope summary [3]:

SMART (Self-Monitoring Analysis and Reporting Technology) has been in the industry for 20+ years and has recently become obsolete in ACS-4.  SMART is capable of reporting information about the storage device’s condition through attributes.  These attributes have been vendor specific since the creation of the capability.  During the last 20 years, many publications have been created that document these attributes with conflicting definitions. This has lead [led] to diverging implementation of these attributes.  There are many interested parties attending T13 that can agree on the meaning of some of these attributes. This technical report is intended to document the attributes where the committee can reach agreement.

It should be noted that with the introduction of the Serial ATA (SATA) interface in 2003, the use of PATA has significantly declined, and some motherboard chipset manufacturers have removed support for PATA.  Since late 2013, no HDDs with the PATA interface have been produced [2].  Common modern SSD interface types (e.g., SATA, Serial Attached Small Computer System Interface [SCSI][SAS], ATA/Integrated Drive Electronics, Peripheral Component Interconnect Express [PCIe], Universal Serial Bus [USB], etc.) do report SMART attribute type data; however, they do not report exactly the same data that would be relevant to an HDD.  Using health-monitoring software designed for HDDs may incorrectly report the status of SDDs due to the missing data; therefore, it is important to use health-monitoring tools specifically designed for SSDs [4].

 


3.0  Using SMART for Predicting Drive Failure

Continuous monitoring of SMART data is possible and can indicate imminent SSD failure.  Software on the host system can automatically notify the user so that preventative action can be taken to prevent data loss; however, using SMART data to predict “exactly” when SSD failure will occur is more problematic due to the following [1–4]:

  • Available SMART data may not correlate directly to SSD reliability or failure.
  • SMART attribute data are not standardized.
  • Some drive manufacturers intentionally leave attributes undocumented and consider the information proprietary.
  • Error logs for assessing past performance may not be maintained.
  • The meaning and interpretation of attributes vary among drive manufacturers, and environmental sensor data (e.g., temperature) may not be available.
  • Some SMART-enabled motherboards and related software may not communicate at all with certain SMART-capable drives (e.g., external USB and FireWire connected drives).

Additionally, one tool may report the drive as failing, while another reports it as healthy depending on how they interpret reported SMART attributes.  This can, and does, lead to unexpected and unpredicted SSD failures [1–4].

Therefore, the use of SMART attribute data for drives may work best when combining drives from a specific manufacturer with software tools that are designed to specifically monitor their manufactured SSD’s health [4].  Even so, only some drive manufacturers and drive health utilities provide continuous monitoring and/or predictive failure analytics; this situation is further complicated by the fact that there are nearly 100 manufacturers of SSD drives [5].

An example of a tool that is specifically designed to monitor SSD health is Innodisk’s [6] iSmart Diagnostic and Monitoring System, as well as Innodisk’s solid-state storage solutions (e.g., industrial SSDs, network attached storage [NAS] SSD arrays, Disk on Module flash memory devices, etc.). iSmart software can be used with many brands of SSDs and can produce analytics data based on SMART attributes, such as graphs of drive wear leveling (a technique used in SSD arrays to prolong service life) or expected drive life span and end-of-life date (see Figure 1).  However, available SMART attribute-related data are more extensive for Innodisk products than other brands, and the life span graph is only available for Innodisk products [7].

 

Figure 1:  (Left) iSmart Drive Wear Leveling Graph; (Right) iSmart Life Span Graph [8].

 

The author’s review of open-source material and consumer ratings indicates that second-party software developers have managed to reverse engineer access to and much of the meaning behind the SMART data from many original equipment manufacturer (OEM) SSD devices.  From this reverse engineering, these developers have produced algorithms and health-monitoring tools that can be used across a broad spectrum of OEM SSD products.  Based on this review of open-source material and consumer ratings, it is the author’s opinion that OEM SSD products have performed well and would be deemed suitable for consumer and industry noncritical applications.

However, for the use of SSDs in critical Department of Defense (DoD) applications, the health-monitoring tools alone would not likely provide the necessary risk mitigation and would require additional measures such as more frequent backups, scheduled early drive replacement, built-in redundancy, use of drive array configurations that can automatically rebuild a failed drive, etc.

A detailed discussion of issues related to using SMART attributes [1] and a list of SSD manufacturers [5] can be found in Wikipedia.

 


4.0  Flash Memory Device Reliability

Though industry has useful theories and models of how SSDs “should” perform, there is little openly available, long-term performance data.  Manufacturers usually provide laboratory-based performance (e.g., accelerated life testing) data for a family of their drives in terms of quantities related to averaged failures over a group of drives for a given time, averaged quantity of data that can be written to a drive, or averaged number of times a drive or memory cell can be written to.  In the article “SSD Reliability:  Is Your SSD Less Reliable Than A Hard Drive?” five common forms of SSD reliability ratings provided by drive manufacturers are discussed [9]:

  • Terabytes written (TBW). Terabytes of data that can be written over the lifetime of the SSD. TBW is one of the more prevalent forms of SSD reliability ratings, but the least useful, as most drives would fail of old age before reaching this number.
  • Programmed and erased cycles (P/E). The number of P/E cycles that drive memory cells can process in their lifetime (the number varies, as there are different types of memory cells).
  • Gigabytes (GB) per day. How many GBs of data are being saved/overwritten per day.
  • Drive writes per day (DWPD). How many times you can rewrite the entire drive per day.
  • Mean time between failures. Predicted time elapsed between inherent failures during normal operation.

Though DSIAC could find these types of ratings for groups of SSD types, DSIAC found little valuable statistical data on operational (i.e., in the field), long-term performance of specific SSDs.  DSIAC also found little statistical data on the ability of health-monitoring software/hardware to maximize drive life or accurately predict the date or time of a specific SSD failure.

Of note is the 2007 study by Google, “Failure Trends in a Large Disk Drive Population,” which analyzes over 100,000 consumer-grade HDDs and the use of SMART attributes in predicting drive failures. The study found that 56% of the HDDs failed without recording any count in the “four strong” SMART warnings, and 36% failed without recording any SMART error at all.  The following is an excerpt from the study [10]:

Our analysis identifies several parameters from the drive’s self-monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures.

4.1  Studies on Long-Term SSD Reliability

DSIAC found three studies published between 2014 and 2016 that provide some insight into long-term SSD reliability.

“Flash Reliability in Production:  The Expected and The Unexpected” [11]

In 2016, the University of Toronto and Google jointly published a paper on reliability of different data server flash-based storage technologies (multilevel cell [MLC]–3,000 to 10,000 write cycle rating; enterprise MLC [eMLC]–20,000 to 30,000 write cycle rating; and single-level cell [SLC]–100,000 write cycle rating).  The following is an excerpt of the study, titled “Flash Reliability in Production:  The Expected and The Unexpected” [11]:

As solid-state drives based on flash technology are becoming a staple for persistent data storage in data centers, it is important to understand their reliability characteristics.  While there is a large body of work based on experiments with individual flash chips in a controlled lab environment under synthetic workloads, there is a dearth of information on their behavior in the field.  This paper provides a large-scale field study covering many millions of drive days, ten different drive models, different flash technologies (MLC, eMLC, SLC) over 6 years of production use in Google’s data centers.  We study a wide range of reliability characteristics and come to a number of unexpected conclusions.  For example, raw bit error rates (RBER) grow at a much slower rate with wear-out than the exponential rate commonly assumed and, more importantly, they are not predictive of uncorrectable errors or other error modes.  The widely used metric UBER (uncorrectable bit error rate) is not a meaningful metric, since we see no correlation between the number of reads and the number of uncorrectable errors.  We see no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes.  Comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors.

The study covers millions of drive days over a 6-year period, 10 different drive models (including enterprise and consumer models), and three different types of flash memory (i.e., MLC, eMLC, and SLC).  Results indicated that the physical age of the SSD, rather than the amount or frequency of data written, is the prime determiner in probability of data retention errors [11].

In one review of this study, author M. Crider (2017) notes that “SSD drives were replaced at Google data centers far less often than conventional hard drives, at about a one to four ratio,” and concludes that “in a high-stress, fast-read environment, SSDs will last longer than hard drives, but be more susceptible to non-catastrophic data errors” [12].

In another review of the study, author R. Harris draws the following conclusions [13]:

  • Ignore Uncorrectable Bit Error Rate (UBER) specs because UBER is a meaningless metric.
  • The good news is that the Raw Bit Error Rate (RBER) increases slower than expected from wearout, and it is not correlated with UBER or other failures.
  • High-end SLC drives are no more reliable that MLC drives.
  • SSDs fail at a lower rate than disks, but the UBER rate is higher (the SSD is less likely to fail during its normal life, but more likely to lose data; therefore, routinely backing up is even more important for SSDs).
  • SSD age, not usage, correlates with error rates and affects reliability (i.e., older drives are more prone to total failure regardless of TBW or DWPD).
  • Bad blocks in new SSDs are common, and drives with a large number of bad blocks are much more likely to lose hundreds of other blocks. This loss of blocks is most likely due to die or chip failure.
  • Approximately 30–80% of SSDs develop at least one bad block, and 2–7% develop at least one bad chip in its first 4 years of deployment.

“A Large-Scale Study of Flash Memory Failures in the Field” [14]

In 2015, Facebook, Inc. and Carnegie Mellon University published the study “A Large-Scale Study of Flash Memory Failures in the Field,” which focused on the use of SSDs as a high-performance alternative to HDD to store persistent data.  The following is an excerpt of the abstract from the study [14]:

This paper presents the first large-scale study of flash-based SSD reliability in the field. We analyze data collected across a majority of flash-based solid-state drives at Facebook data centers over nearly four years and many millions of operational hours in order to understand failure properties and trends of ash-based SSDs. Our study considers a variety of SSD characteristics, including: the amount of data written to and read from flash chips; how data is mapped within the SSD address space; the amount of data copied, erased, and discarded by the flash controller; and flash board temperature and bus power.

Based on our field analysis of how flash memory errors manifest when running modern workloads on modern SSDs, this paper is the first to make several major observations…

In Section 8 of the study, its authors of the study present a summary of their five key observations [14]:

Observation 1:  We observe that SSDs go through several distinct failure periods – early detection, early failure, usable life, and wearout – during their lifecycle, corresponding to the amount of data written to flash chips.

Due to pools of flash blocks with different reliability characteristics, failure rate in a population does not monotonically increase with respect to amount of data written to flash chips. This is unlike the failure rate trends seen in raw flash chips.

We suggest that techniques should be designed to help reduce or tolerate errors throughout SSD lifecycle. For example, additional error correction at the beginning of an SSD’s life could help reduce the failure rates we see during the early detection period.

Observation 2:  We find that the effect of read disturbance errors is not a predominant source of errors in the SSDs we examine.

While prior work has shown that such errors can occur under certain access patterns in controlled environments [5, 32, 6, 8], we do not observe this effect across the SSDs we examine. This corroborates prior work which showed that the effect of retention errors in flash cells dominate error rate compared to read disturbance [32, 6]. It may be beneficial to perform a more detailed study of the effect of these types of errors in flash-based SSDs used in servers.

Observation 3:  Sparse data layout across an SSD’s physical address space (e.g., non-contiguously allocated data) leads to high SSD failure rates; dense data layout (e.g., contiguous data) can also negatively impact reliability under certain conditions, likely due to adversarial access patterns.

Further research into flash write coalescing policies with information from the system level may help improve SSD reliability. For example, information about write access patterns from the operating system could potentially inform SSD controllers of non-contiguous data that is accessed very frequently, which may be one type of access pattern that adversely affects SSD reliability and is a candidate for storing in a separate write buffer.

Observation 4:  Higher temperatures lead to increased failure rates but do so most noticeably for SSDs that do not employ throttling techniques.

In general, we find techniques like throttling, which may be employed to reduce SSD temperature, to be effective at reducing the failure rate of SSDs. We also find that SSD temperature is correlated with the power used to transmit data across the PCIe bus, which can potentially be used as a proxy for temperature in the absence of SSD temperature sensors.

Observation 5:  The amount of data reported to be written by the system software can overstate the amount of data actually written to flash chips, due to system-level buffering and wear reduction techniques.

Techniques that simply reduce the rate of software-level writes may not reduce the failure rate of SSDs. Studies seeking to model the effects of reducing software-level writes on flash reliability should also consider how other aspects of SSD operation, such as system-level buffering and SSD controller wear leveling, affect the actual amount of data written to SSDs.

“The SSD Endurance Experiment:  Casualties on the Way to a Petabyte” [15]

In the 2014 study, “The SSD Endurance Experiment:  Casualties on the Way to a Petabyte,” The Tech Report conducted testing and follow-up analysis on longevity among major brands of 250 GB SSDs by attempting to continuously write 1 petabyte (PB) of data to them.  All drives exceeded their specifications and were fully functional at over 700 TBW.  Three of the six tested drives failed before 1 PB of data could be written, but at least two exceeded that mark [15].

Key findings and conclusions noted within the study by author G. Gasior suggest the following [15]:

  • The manufacturer’s provided TBW data specification is likely very conservative.
  • The MLC NOT-AND (NAND) flash memory-based SSDs reliably detected bad memory cells and replaced those with “spare” ones to maintain user-accessible storage capacity and data integrity even if cell failures incapacitated some of the NAND.
  • The drive health-monitoring software used, Intel’s SSD Toolbox and HD Sentinel, provided warnings based on SMART data long before errors occurred. As a result, there was plenty of time to replace the drive and maintain data integrity.

Key findings noted in the study also suggest that larger SSDs should have a greater durability, as identified in the following excerpt from “How Long Do Solid-State Drives Really Last?” [12]:

Larger capacity SSDs, due to having more available sectors and more “room” to use before failing, should last longer in a predictable manner. For example, if a 250GB Samsung 840 MLC drive failed at 900 TBW, it would be reasonable to expect a 1TB drive to last for considerably longer…

In addition, it does not appear that present industry standards and processes for measuring the health data reported from the drive and/or drive controller system are sufficient to ensure a solution to SSD durability in the very near term.  For example, to facilitate SSD adoption and alleviate product quality and reliability concerns, the JEDEC Solid State Technology Association (formerly the Joint Electron Device Engineering Council), an ANSI-accredited institution that publishes open standards for the microelectronics industry, has published standards for SSD reliability testing and ratings [16].  The document, “Solid State Drive (SSD) Requirements and Endurance Test Method,” defines conditions of use (e.g., application class:  client or enterprise) and corresponding endurance verification requirements. It also establishes an SSD endurance rating (in terms of TBW) to allow a standard comparison based on application class [17].  The document “Solid-State Drive (SSD) Endurance Workloads” describes the standard workload to be used when performing testing [18].

Although can be useful as a measure of projected absolute lifetime for a manufacturer’s “family” of a given drive type, it’s of limited use for determining when to replace an individual drive in critical applications.  Additionally, it is not always clear if a given SSD manufacturer is using the same metrics and workloads as another to test for longevity, making even comparison of one manufacturer’s products to another’s difficult [9].

 


5.0 OEM Drive Health-Monitoring Tools

DSIAC staff research suggests that both OEM and second-party SSD health-monitoring tools are effective in conservatively notifying the user of drive issues and degradation. However, SSD life span and similar provided dates derived from SMART data are generally just a conservative guide for replacing the SSD to help ensure data integrity and do not indicate an expected actual drive failure date. To reliably monitor and predict the failure of an SSD, the SSD type should be selected to address the target operating environment and mission criticality factors for the desired tasks and configuration, and paired with SSD health-monitoring software designed by the SSD’s manufacturer. This task can be simplified and has a higher probability of success by consulting with the drive OEMs on the requirements and using one of their health-monitoring tools that is matched to one of their drives. This method will allow better interrogation, interpretation, and analysis of the available SMART attribute parameters. Sections 2.4.1 through 2.4.7 provide information on a few popular/large SSD manufacturer health-monitoring tools.

5.1 Dell SupportAssist

Dell SupportAssist proactively checks the health of a system’s hardware and software.  When an issue is detected, the necessary system state information can be automatically sent to Dell for troubleshooting.  Health monitoring, including predictive analytics, is provided for solid-state and hard disk drives, batteries, and fans.  SupportAssist is preinstalled on most of all new Dell devices running Windows operating system (OS) and can be found in the Start menu under All Programs in the Dell or Alienware folder [19].

5.2  Intel SSD Toolbox

Intel SSD Toolbox is drive management software that allows the user to monitor drive health, estimated drive life remaining, and SMART attributes.  It can run quick and full diagnostic scans to test the read and write functionality of an Intel SSD, optimize the performance of an Intel SSD using Trim functionality and update the firmware on supported Intel SSDs.  Users can check and tune their system settings for optimal Intel SSD performance, power efficiency, and endurance.  This tool also supports a Secure Erase of secondary Intel SSDs [20].

5.3  Samsung Magician

Samsung enterprise SSDs, such as the Samsung 860DCT and the SM863a, come with a SMART monitoring toolkit (i.e., Samsung Magician software for enterprise SSDs).  The software collects the necessary data and can simplify the calculations by collecting extra data not available from These attributes work with the Magician software and an analyzer function to predict drive life span based on a load recorded over a specific time, without needing to manually calculate the formulas.  For administrators trying to predict how long SSDs will last for a given application, the combination of Samsung Magician with Samsung enterprise SSDs will greatly simplify the process of characterizing loads and better predict how long Samsung SSDs will last [21].

Samsung Magician software features simple graphical indicators that show the SSD health status and total bytes written.  It also includes tools that help to optimize the SSD and ensure the system is always running on par with the expected benchmark.  The tools can be used to optimize Samsung SSDs with three different profiles (i.e., maximum performance, maximum capacity, and maximum reliability) and provides detailed descriptions of each OS setting.  The updated benchmarking feature lets users test SSDs to compare their performance and speed through checks of parameters, such as sequential and random read/write speeds.  Other options can check the total bytes written to help assess the overall health and estimated remaining life span of the SSD.  The user can choose SATA and Advanced Host Controller Interface (compatibility and status.  The system compatibility check ensures that there is no conflict between the computer system/software and the SSD.  Secure erase allows users to wipe the SSD securely to avoid any sensitive data loss.  Samsung Magician is only available for systems using the Windows OS [21].

5.4  Seagate System Monitor and SeaTools

Seagate System Monitor is a Windows OS automated tool generally intended for use in continuous monitoring of information technology server systems.  It provides health-related information to include system time running, health status, and operating temperature for the hard drives, as well as automated alerts of issues. Users can also view the SMART status [22].

Seagate SeaTools is a diagnostic application that performs several basic tests when initiated by the user to help determine the condition and health of both internal and external disk drives.  It can test all types of internal drives, including SCSI, PATA, SATA, etc.  It can also test external drives (i.e., USB or FireWire), and testing does include a SMART check [23].

5.5  Toshiba SSD Utility

Toshiba SSD Utility is for Toshiba Drives, a graphical user interface (GUI)-based tool for managing OCZ SSDs.  The dashboard provides a real-time overview of system status, capacity, interface, health, etc.  In addition, the SSD Utility tool can be used to keep SSD firmware updated, identify the amount of life left in an SSD, and correct the modes to achieve the best performance of the SSD.  Its SSD tuner feature lets users tune the SSD for long-term life and to determine if their SSD is connected to the suitable ports [24].

5.6  Transcend SSD Scope Pro

Developed for use with Transcend SSD products, the SSD Scope Pro helps users monitor and manage SSD status via an intuitive interface.  It offers various useful features, including drive information and SMART status monitoring, diagnostic scan, secure erase, health indication, system clone, and remote monitoring [25].

5.7  Windows Data Lifeguard Diagnostics

The Windows version of the Data Lifeguard Diagnostics utility can perform drive identification, diagnostics, and repairs on a Western Digital FireWire, Enhanced Integrated Drive Electronics, SATA, or USB drive.  In addition, it can provide the drive’s serial and model numbers.  This utility is not compatible with the Mac OS.  The drive needs to be connected to a Windows OS to run this utility [26].

 


6.0  Second-Party Drive Health-Monitoring Tools

DSIAC staff searched open-source review sites for recommended software tools (other than OEM-provided tools) for SSD health monitoring and summarized some of the most highly recommended results.  A comparison of some additional SMART tools can be found in the Wikipedia article “Comparison of S.M.A.R.T. Tools” [27].

6.1  CrystalDiskInfo

CrystalDiskInfo helps users monitor SSD health status and temperature.  The tool can be used to check users’ SSD and other hard disk (HD) types.  Once installed, it can monitor system HD performance in real-time while users work on their system, check a disk’s read and write speed, and project information about the users’ SSD.  This software can show users the error rates of the disk, including “read error rate.”  The performance measuring scales (e.g., seek time performance, throughput performance, etc.) and total power-on time can be viewed in real time [28].

6.2  DriveDx

DriveDx analyzes the current state of a drive using all the drive health indicators that are most likely to indicate a potential drive issue (e.g., SSD wear out/write endurance, input/output errors, pending sectors, reallocated bad sectors, etc.).  DriveDx runs in the background and periodically performs checks to determine the health of users’ SSD or HDD. When any issue or problem is found, it alerts the user immediately [29].

DriveDx calculates various ratings of the current status of key drive characteristics (percent values), including the drive health rating, drive performance rating, and SSD lifetime left indicator (in case of an SSD drive).  These features provide users with a more complete understanding of the current state of their drive.  DriveDx acts as an “early warning system” for pending drive problems.  As a result, users have more chances to save critical data before any data loss occurs.  Unlike most other tools, DriveDx detects not only “OK/Verified” and “Failed” drive health states, but also the “Failing (Pre-fail)” drive state.  DriveDx features a special multitier warning system that will inform the user about deviations from the normal state of drive attributes.  It constantly monitors each SMART attribute (and its change dynamics) and starts continuously warning the user as the drive degrades.  On the initial stages of drive degradation, users will receive notifications of the “Warning” type, then “Failing” (which means that this drive parameter is in a prefailure state), and then “Failed” [29].

6.3  HDDLife and HDDLife Pro

HDDLife provides automated, continual monitoring of drive health.  The professional version of HDDLife, HDDLife Pro, can read SSD disk SMART attribute data and allows users to see the health and resources of their disks, providing them with time to move data long before the end of the SSD life span.  The flash cells of these disks have a limited life cycle; therefore, SSDs allow only a limited number of writes before the drive fails.  Even with level wearing and the extra safety features provided by today’s smart SSD controllers, it’s still valuable to know how much of the drive’s rated life span is used and how much is still available.  The software offers a highly customizable list of warnings.  When a hard drive’s reliability degrades to a certain level, HDDLife will promptly display a warning message over the network or via email.  The software also supports external USB drives [30].

6.4  HD Sentinel

The following is an excerpt from “Hard Disk Sentinel” [31]:

Hard Disk Sentinel (HDSentinel) is a multi-OS SSD and HDD monitoring and analysis software.  Its goal is to find, test, diagnose, and repair HDD problems, and report and display SSD and HDD health, performance, degradations, and failures.  HD Sentinel gives complete textual descriptions and tips, and it displays/reports the most comprehensive information about the hard disks and SSDs inside the computer and in external enclosures (USB hard disks/e-SATA hard disks).  Many different alerts and report options are available to ensure maximum safety of your valuable data.

No need to use separate tools to verify internal hard disks, external hard disks, SSDs, hybrid disk drives (SSHD), disks in RAID arrays, and Network Attached Storage (NAS) drives as these are all included in a single software.  In addition, HD Sentinel Pro detects and displays status and S.M.A.R.T. information about LTO tape drives and appropriate industrial (micro) SD cards too.

HD Sentinel monitors HDD/HDD status including health, temperature, and all S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) values for all hard disks.  Also, it measures the disk transfer speed in real time which can be used as a benchmark or to detect possible hard disk failures and performance degradations.

6.5  Smartmontools

The following is an excerpt from the Wikipedia page, “Smartmontools” [32]:

Smartmontools (S.M.A.R.T. Monitoring Tools) is a set of utility programs (smartctl and smartd) to control and monitor computer storage systems using the Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) system built into most modern (P)ATA, Serial ATA, SCSI/SAS and NVMe hard drives.

Smartmontools displays early warning signs of hard drive problems detected by S.M.A.R.T., often giving notice of impending failure while it is still possible to back up data.

From late 2010 ATA Error Recovery Control configuration has been supported by Smartmontools, allowing it to configure many desktop- and laptop-class hard drives for use in a RAID array and vice versa.

Most Linux distributions provide the smartmontools package.

The Smartmontools package contains two utility programs (i.e., smartctl and smartd) to control and monitor your hard disk.  The tools offer the real-time monitoring of a user’s storage drives and can analyze and warn about potential disk degradation and failure.  Smartmontools supports ATA, ATAPI, SATA-3 to -8 disks, and SCSI disks and tape devices.  This disk tool can run on Mac OS X, Linux, FreeBSD, NetBSD, OpenBSD, Solaris, OS/2, Cygwin, QNX, eComStation, and Windows, and can be run from a live compact disc.  There is also a GUI, GsmartControl, available for smartctrl [33].

6.6  SSD Life

SSD Life, from BinarySense Inc., is a dedicated SSD tool that attempts to measure and predict an SSD’s life span using a BinarySense-developed algorithm, giving users opportunities to back up their data before their SSD fails.  SSD Life can display the disk data in real-time to inform users about any critical defects.  It has been tested with most of the SSD drives in use to check compatibility, and it can work with most SSD manufacturers, such as Kingston, OCZ, and the Apple MacBook Air built-in SSD [34].

6.7  SsdReady

SsdReady, by CEZEO Software Ltd., was developed to predict how long an SSD will last.  Once installed, the tool runs in the background to track writes and the total daily usage of an SSD and provides an estimate of how long the SSD will last, which gives users time to prepare and purchase a new SSD.  In addition, this SSD tool can provide optimization feedback to extend SSD life if it finds too many disks writes.  The paid version allows users to see more data than the free and gives immediate feedback.  The vendor recommends letting the paid version collect data for 1 week.  The program displays rough write data for the day, the approximate life of the average SSD using the data collected up to that point, and other drive status information [35].

 


7.0  Additional Resources

The following articles and reports provide insight into the use of SMART attributes and software tools to monitor disk drives and issues of concern in understanding reliability of SSDs:

  • “Streamline Your SSD Health Assessment with SMART Attributes” by L. Harbaugh [36].
  • “How to Check Your Hard Drive’s Health” by W. Gordon [37].
  • “How to Check SSD Health with 6 Free Tools 2019” [38].
  • “Best 7 Free Tools to Check SSD Health and Monitor Performance” by S. Kelly [39].
  • “Buying a Solid-State Drive: 20 Terms You Need to Know” by J. Burek [40].

 


8.0  DSIAC’s Conclusions and Recommendations

8.1  Conclusions

Current SSD health-monitoring tools are not sufficient by themselves to ensure data integrity in critical DoD applications. In such applications, they cannot be wholly relied upon to report impending drive failures, nor can they be relied upon to accurately predict the failure date/time of an individual drive. There are two main issues that contribute to their inherent unreliability: First, there are no standards for policy and requirements related to SSD reliability testing, defining and implementing SMART attributes, and reporting of information on SMART attributes. Second, the complexity of the SSDs and insufficient manufacturing and quality control processes allow drives to be produced with defects that can propagate. SDDs can and do fail without any SMART attribute warning, and while some policy and standardization changes are being developed, they are only a partial solution.

8.2  Recommendations

DSIAC recommends using a manufacturer’s-developed health-monitoring tool that has been paired with an SSD from the same manufacturer in conjunction with other risk mitigation measures, such as scheduled early drive replacement, frequent backups, redundant storage, and/or drive array configurations that can automatically rebuild corrupted or failed drives.

 


References

[1] Wikipedia. “S.M.A.R.T.” https://en.wikipedia.org/wiki/S.M.A.R.T, accessed 22 January 2019.

[2] Wikipedia. “Parallel ATA.” https://en.wikipedia.org/wiki/Parallel_ATA, accessed 22 January 2019.

[3] INCITS. “INCITS/TR-54-201x:  Information Technology – SMART Attribute Description, a Technical Report Prepared by INCITS and to be Registered with ANSI.” https://standards.incits.org/apps/group_public/project/details.php?project_id=1651, accessed 23 January 2019.

[4] Pitter, A. “Can You Use SMART Tools With SSDs?” betanews, https://betanews.com/2016/01/06/can-you-use-smart-tools-with-ssds/, 2016.

[5] Wikipedia. “List of Solid-State Drive Manufacturers.” https://en.wikipedia.org/wiki/List_of_solid-state_drive_manufacturers, accessed 22 January 2019.

[6] Innodisk. https://www.innodisk.com/, accessed 22 January 2019.

[7] Memory Depot. “iSmart Diagnostic and Monitoring System.” http://www.memorydepot.com/ssd/technology_ismart.html, accessed 22 January 2019.

[8] Innodisk. “Innodisk iSMART Windows 4.0.X User Guide.” https://www.simms.co.uk/wp-content/uploads/2015/03/InnoDisk_iSMART_Windows_GUI_4_0_X.pdf, accessed 22 January 2019.

[9] Padilla, J. A. “SSD Reliability:  Is Your SSD Less Reliable Than a Hard Drive?” WePC, https://www.wepc.com/tips/ssd-reliability/, 26 October 2018.

[10] Pinheiro, E., W.-D. Weber, and L. A. Barroso. “Failure Trends in a Large Disk Drive Population.” 5th USENIX Conference on File and Storage Technologies (FAST 2007), pp. 17–29, https://ai.google/research/pubs/pub32774, February 2007.

[11] Schroeder, B., R. Lagisetty, and A. Merchant. “Flash Reliability in Production:  The Expected and the Unexpected.” Proceedings of the 14th USENIX Conference on File and Storage Technologies, Santa Clara, CA, https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf, 2016.

[12] Crider, M. “How Long Do Solid State Drives Really Last?” How-To Geek, https://www.howtogeek.com/322856/how-long-do-solid-state-drives-really-last/, September 2017.

[13] Harris, R. “SSD Reliability in the Real World:  Google’s Experience.” ZDNet, https://www.zdnet.com/article/ssd-reliability-in-the-real-world-googles-experience/, February 2016.

[14] Meza, J., W. Wu, S. Kumar, and O. Mutlu. “A Large-Scale Study of Flash Memory Failures in the Field.” 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’15), pp. 177–190, New York, NY, https://users.ece.cmu.edu/~omutlu/pub/flash-memory-failures-in-the-field-at-facebook_sigmetrics15.pdf, June 2015.

[15] Gasior, G. “The SSD Endurance Experiment:  Causalities on the Way to a Petabyte.” The Tech Report, https://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte, June 2014.

[16] JEDEC. “Solid State Drives.” https://www.jedec.org/standards-documents/focus/flash/solid-state-drives, accessed 23 January 2019.

[17] JEDEC. “Solid State Drive (SSD) Requirements and Endurance Test Method.” JESD218B.01,  https://www.jedec.org/document_search?search_api_views_fulltext=jesd218b.01, June 2016.

[18] JEDEC. “Solid-State Drive (SSD) Endurance Workloads.” JESD219A, https://www.jedec.org/document_search?search_api_views_fulltext=jesd219a, July 2012.

[19] Dell. “Dell SupportAssist for PCs and Tablets.” https://www.dell.com/support/contents/us/en/04/article/product-support/self-support-knowledgebase/software-and-downloads/supportassist, accessed 23 January 2019.

[20] Intel. “Intel Solid State Drive Toolbox.” https://downloadcenter.intel.com/download/28447/Intel-Solid-State-Drive-Toolbox, accessed 23 January 2019.

[21] Samsung. “Samsung Magician.” https://www.samsung.com/semiconductor/minisite/ ssd/download/tools/, accessed 23 January 2019.

[22] Seagate. “Using Seagate System Monitor.” https://www.seagate.com/manuals/network-storage/business-storage-wss-2012/using-seagate-system-monitor/, accessed 23 January 2019.

[23] The Windows Club. “Seagate SeaTools:  A Hard Disk Diagnostic Tool for Windows.” https://www.thewindowsclub.com/seagate-seatools-hard-disk-diagnostic-tool-windows, accessed 23 January 2019.

[24] Toshiba Memory Corporation. “SSD Utility:  SSD Management Software.” https://ssd.toshiba-memory.com/en-amer/download/ssd-utility, accessed 23 January 2019.

[25] Transcend Information, Inc. “SSD Scope Pro:  Introduction.” https://us.transcend-info.com/ Embedded/Essay-20, accessed 23 January 2019.

[26] Western Digital Support. “Welcome to WD Support:  Testing a Drive for Problems Using Data Lifeguard Diagnostics for Windows.” https://support.wdc.com/knowledgebase/answer.aspx?ID=940, accessed 23 January 2019.

[27] Wikipedia. “Comparison of S.M.A.R.T. Tools.” https://en.wikipedia.org/wiki/Comparison_of_S.M.A.R.T._Tools, accessed 23 January 2019.

[28] Crystal Dew World. “Quick Download:  CrystalDiskInfo.” https://crystalmark.info/en/software/crystaldiskinfo/, accessed 23 January 2019.

[29] Binary Fruit. “DriveDX.” https://binaryfruit.com/drivedx, accessed 23 January 2019.

[30] BinarySense Inc. “HDDlife.” http://hddlife.com/index.html, accessed 23 January 2019.

[31] Hard Disk Sentinel. https://www.hdsentinel.com/, accessed 23 January 2019.

[32] Wikipedia. “Smartmontools.” https://en.wikipedia.org/wiki/Smartmontools, accessed 23 January 2019.

[33] Smartmontools. https://www.smartmontools.org/, accessed 23 January 2019.

[34] SSD Life. “SSD Reliability Analysis.” http://ssd-life.com/, accessed 23 January 2019.

[35] Ssd Ready! “SsdReady Helps You Predict Lifetime of Solid-State Drive in Your Computer!” CEZEO Software, Ltd., http://www.ssdready.com/, accessed 23 January 2019.

[36] Harbaugh, L. “Streamline Your SSD Health Assessment With SMART Attributes.” Insights, https://insights.samsung.com/2018/07/10/streamline-your-ssd-health-assessment-with-smart-attributes/, 10 July 2018.

[37] Gordon, W. “How to Check Your Hard Drive’s Health.” PC Mag, https://www.pcmag.com/feature/360684/how-to-check-your-hard-drive-s-health, 27 April 2018.

[38] Jihosoft. “How to Check SSD Health With 6 Free Tools 2019,” https://www.jihosoft.com/tips/how-to-check-ssd-health.html, accessed 23 January 2019.

[39] Kelly, S. “Best 7 Free Tools to Check SSD Health and Monitor Performance.” Mashtips, https://mashtips.com/ssd-health-test-and-performance-monitor-tools/, 7 September 2018.

[40] Burek, J. “Buying a Solid-State Drive:  20 Terms You Need to Know.” PC Mag, https://www.pcmag.com/article/360358/buying-a-solid-state-drive-20-terms-you-need-to-know, 3 May 2018.

Want to find out more about this topic?

Request a FREE Technical Inquiry!