S.M.A.R.T. monitoring

Overview

The purpose of monitoring S.M.A.R.T. attributes is to provide system administrators and users with information to be able to predict an imminent failure of a hard drive in time to back up critical data and to replace the hard drive itself.

Mechanical failure is responsible for approx. 60 percent of all hard drive failures (1). Even though a total failure of a hard drive could mean the loss of valuable data, most of those events can be predicted in advance with a reliability of up to 70 percent.

Argus Monitor is monitoring S.M.A.R.T. attributes of your hard drives in user defined time intervals (default value is once every 5 minutes). Monitoring is done based on the values of the so called 'critical S.M.A.R.T. attributes'.

The monitoring process is subdivided into the following three disctinct categories:

Error (standard hard drive failure warnings)
Caution (extended hard drive failure warnings)
Information (additional information for expert users)

All of these 3 warning categories can be configured individually using the extended S.M.A.R.T. warning configuration options, available via the Argus Monitor settings dialog. which description you can find in the section about configuration. The following image gives an overview over the S.M.A.R.T. monitoring options available to the user of Argus Monitor, providing the best hard drive failure prediction possible.

S.M.A.R.T. Monitoring categories and flow chart

Argus Monitor S.M.A.R.T. Monitoring

Description of the availabe categories

Category Error

A S.M.A.R.T. error in the context of the standard failure prediction means that a critical S.M.A.R.T. attribute has reached its vendor specific threshold. Values and thresholds are both normalized.

In the image “Example showing all three available S.M.A.R.T. events“ the critical attribute 5 “Reallocated Sector Count“ has reached its threshold of 140. According to S.M.A.R.T. specifications this hard drive is either just failing or about to fail within the next 24 hours.

Example showing all three available S.M.A.R.T. events

Category Caution

The extended hard drive failure prediction monitors raw data values of certain critical S.M.A.R.T. attributes. Those values are vendor specific and not normalized. Studies done by several online data storage providers (e.g. Google (1) and Backblaze (2)) show that heuristic algorithms based on raw data values of critical attributs 5, 187, 196, 197 and 198 are capable of predicting hard drive failues in advance much more reliably.

As a result of the pre-failure heuristic algorithm the attribute 196 “Reallocated Event Count“ in the above image is marked with the word 'Caution'. This in itself is NOT a hard drive failure however, nevertheless the user should pay closer attention to this hard drive and is recommended to perform a backup of all important data stored on this device. Further information about this issue can be found at the end of this page.

Category Information

Changes of the values of critical S.M.A.R.T. attributes fall in this category. An example for such an event is shown the image: the value of attribute 198 “Offline Uncorrectable“ has changed and therefore the word 'Change' is displayed in the table of the attributes.

This category is intended for expert users only. A mere change of a critical S.M.A.R.T. attribute is not considered a hard drive failure as long as the vendor specific threshold has not been reached (see above). This event is only a hint to show that there was a change in one of the values.

For some of the critical values like attribute 3 “Spin Up Time“ this happens regulary and can be considered normal. For those events even expert users should under normal circumstances turn this warning off. In case of an SSD having an attribute like “SSD Life Left“ one might use this mechanism to observe in which time interval the flash writes will lead to a decrease in this value. This might be useful to estimate the remaining lifetime of this SSD accoring to the vendor specifications.

Configuration

The options for the extended S.M.A.R.T. configuration can be found under Settings/S.M.A.R.T./Configuration. This will open a new dialog where all the S.M.A.R.T. warning configuration options can be defined.

The default is to configure all hard drives using the same settings. If you want to you may also configure every hard drive separately (in case of removable drives: the drive you want to configure has to be present in the system when you open the dialog, but the configuration is saved and applied even after you remove/reattach the drive later). If you want to configure all drives independently or if want to use one configuration for all drives can be selected in the top right of the configuration dialog.

S.M.A.R.T. Monitoring configuration settings

Configuration of S.M.A.R.T. warnings

For every of the three available categories you can specify if you want to enable/disable warnings in general and for the categories 'Caution' and 'Information' you can select which attributes should be taken into consideration. During the installation procedure, Argus Monitor will will pre-configure the warnings based on our recommendations. If you have changed some values and want to revert to the this default setting, there is a button 'Default' in the configuration dialog allowing for that.

In the lower part of the configuration dialog you can specify the action that should be taken once one of the configured S.M.A.R.T. checks issues a warning. You can select a Messagebox to be shown that will remain on the destop until you dismiss it by clicking OK. Additionally there is a separate notification window available, that will inform you of events but will fade out automatically after a few seconds. Other options include playing a sound, logging the event to the Argus Monitor event log file, the execution of an external program or sending an email.

Further explanations for category 'Caution'

Quote from a study done by Google (1)

“Work at Google on over 100,000 drives over a 9-month period found correlations between certain SMART information and actual failure rates. In the 60 days following the first off-line scan uncorrectable error on a drive (SMART attribute 0xC6 or 198), the drive was, on average, 39 times more likely to fail than it would have been if no such error occurred. First errors in reallocations, offline reallocations (SMART attributes 0xC4 and 0x05 or 196 and 5) and probational counts (SMART attribute 0xC5 or 197) were also strongly correlated to higher probabilities of failure. Conversely, little correlation was found for increased temperature and no correlation for usage level. However, the research showed that a large proportion (56%) of the failed drives failed without recording any count in the 'four strong S.M.A.R.T. warnings' identified as scan errors, reallocation count, offline reallocation and probational count. Further, 36% of drives failed without recording any S.M.A.R.T. error at all (except temperature), meaning that S.M.A.R.T. data alone was of limited usefulness in anticipating failures.“

Quote from a studie dony by Backblaze (2)

“There are over 70 SMART statistics available, but we use only 5. To give some insight into the analysis we’ve done, we’ll look at three different SMART statistics here. The first one, SMART 187, we already use to decide when to replace a drive, it’s really a test of the analysis. The other two are SMART stats we don’t use right now, but have potentially interesting correlations with failure.

SMART 187: Reported_Uncorrect – Backblaze uses this one.

Number 187 reports the number of reads that could not be corrected using hardware ECC. Drives with 0 uncorrectable errors hardly ever fail. This is one of the SMART stats we use to determine hard drive failure; once SMART 187 goes above 0, we schedule the drive for replacement.

This first chart shows the failure rates by number of errors. Because this is one of the attributes we use to decide whether a drive has failed, there has to be a strong correlation.“

SMART attribute number 187 reports the number of reads that could not be corrected using hardware ECC

Argus Monitor developers advice

S.M.A.R.T. is not able to predict EVERY hard drive failure; e.g. in case of the failure of a drive's electronics S.M.A.R.T. cannot -- by design -- be effective. Nevertheless, failures that are directly linked to errors of the storage medium itself (the magnetic disk or the flash memory of an SSD) can be predicted in advance with a relatively high reliability.

By monitoring the five statistically most significant of the critical S.M.A.R.T. values, Argus Monitor can reliably predict most hard drive and SSD failures in time for the user to backup important data before the drive fails completely.

(1) Statement on enhanced smart attributes by Seagate Technology, Inc.

(2) Failure Trends in a Large Disk Drive Population by Google Inc.

(3) Hard Drive SMART Stats by Backblaze