{"id":11540,"date":"2019-07-25T12:16:40","date_gmt":"2019-07-25T12:16:40","guid":{"rendered":"https:\/\/powerm.ma\/?p=11540"},"modified":"2019-07-25T12:20:30","modified_gmt":"2019-07-25T12:20:30","slug":"lessons-learned1","status":"publish","type":"post","link":"https:\/\/powerm.ma\/lessons-learned1\/","title":{"rendered":"Continuous operations : lessons learned from a simultaneous multiple disk failure issue on a virtualizing RAID computer data storage system\u00a0"},"content":{"rendered":"
The customer experienced an outage on June 2019 which was triggered by a V7000 GEN2 disk related issue resulting in an offline mdisk :\u00a0mdisk offline with 3 failed drives with a potential loss of data.<\/p>\n
Problem timeline:<\/strong><\/span><\/p>\n The disks that failed with \u201cDrive reporting too many medium errors\u201d are related to a particular drive type (HUC156060CSS20 600 GB 15k).<\/span><\/p>\n The Ultrastar C15K600, is\u00a0the world\u2019s fastest hard disk drive in a 15K RPM, 2.5-inch small form factor hard drive and ideally suited for mission-critical data center and high performance computing environments.<\/p>\n The best-in-class performance is achieved through several innovations, including media caching technology that provides a large caching mechanism for incoming data resulting in significantly enhanced write performance.<\/p>\n The C15K600 is HGST\u2019s first hard drive to leverage an industry-leading 12Gb\/s Serial-Attached SCSI (SAS) interface enabling very high transfer rates between host and drive, supporting the performance and reliability needed within the most demanding enterprise computing environments like online transaction processing (OLTP), big data analytics, multi-user applications and data warehousing.<\/p>\n The drive is used on EMC VMAX,EMC VNX, EMC VNX2,HPE 3PAR and IBM Storwize.<\/p>\n <\/p>\n Mean Time To Data Loss (MTTDL) is one<\/strong> of \u00a0the standard reliability metric in storage systems. MTTDL represents a simple formula that can be used to compare the reliability of small disk arrays and to perform comparative trending analyses.<\/p>\n In the storage reliability community, the MTTDL is calculated using Continuous-time Markov Chains (a.k.a. Markov model).<\/p>\n <\/p>\n The last time I heard about Markov Chains was back in 2001 when i was studying Queueing<\/strong>\u00a0Network S<\/b>ys<\/b>tem<\/b>s<\/b>. One of the intriguing course, turns out handy in diagnosing one of the common issues in the storage infrastructure nowadays. It’s not a Math article, so let’s make it simple since we only need to gather two informations : the probability of read error\u00a0during rebuild and probability of data loss over time.<\/p>\n <\/p>\n The odds of read error\u00a0during rebuild is 0.001055\u00a0and the odds of data loss over 5 years is\u00a00.000538\u00a0\ud83d\ude15 so what really happened ?<\/p>\n Doing some research on the valuable EMC and IBM Support Knowledge base , we find some clue of what potentially make those drives offline:<\/p>\n EMC TA 195555: VNX, VNXe, Symmetrix VMAX, CLARiiON CX4 Series: Certain 600GB 15K RPM SAS and FC disk drives may experience increased replacement rates when drives remain idle for extended periods of time, or when unused space is allocated.<\/strong><\/span><\/p>\n Dell EMC has determined that certain 600GB 15K RPM Serial Attached SCSI (SAS) and Fibre Channel (FC) disk drives may experience increased replacement rates when the drives have remained idle for extended periods of time.\u00a0Affected drives may experience increased replacement rates when drives have remained idle for extended periods of time, which may lead to data unavailability.<\/span><\/strong><\/p>\n <\/p>\n IBM notes: on some Hitachi King Cobra F drives used on the IBM Storwize V7000 GEN2, the drives can exhibit either a higher failure rate and\/or a very high number of medium errors on many drives and this can result in a higher risk of data loss.A large proportion of Power-On-Idle Hours<\/span> <\/strong>further aggravates the issue. Indications are that if a drive has experienced a larger number of Power-On-Idle cycles then it will be much more likely to be exposed to this issue. A typical case of error count of a drive occurs when it performs little I\/O for long periods and then performs heavy I\/O<\/span><\/strong>.<\/p>\n Regarding our customers drives , the IBM L3 Support confirm the issue and recommend to upgrade the drives firmware to J2GF after converting the array to RAID6.<\/strong><\/p>\n <\/p>\n Now we have a pretty much clear idea of what would be the root cause analysis of the three drives failures on our V7000, lets see how we were able to recover data.<\/p>\n Hard drive recovery is the process of recovering data and restoring a hard drive to its last known good configuration, after a system\/hard drive crashes or is corrupted\/damaged.<\/p>\n Recovering data from physically damaged hardware can involve multiple techniques.\u00a0Some damage can be repaired by replacing parts in the hard disk. This alone may make the disk usable, but there may still be logical damage. A specialized disk-imaging procedure<\/strong> is used to recover every readable bit from the surface. Once this image is acquired and saved on a reliable medium, the image can be safely analyzed for logical damage and will possibly allow much of the original file system to be reconstructed.<\/p>\n Having that said we hired\u00a0a specialized repair company (Ontrack) to repair Disk ID2<\/span><\/strong>. Due to the emergency of the situation, we sent the disk securely on a 3 hours flight to Paris.\u00a0<\/span><\/span><\/p>\n 48 hours later Ontrack inform us that they were able to copy 99.9% of sectors from the Disk ID2 to a new\u00a0disk<\/span>\u00a0ID5 provided by IBM.<\/strong><\/span><\/p>\n Back to Casablanca Customer Datacenter, we needed to insert the disk ID5<\/span><\/strong> on the V7000 but we had two challenges:<\/p>\n Following an excellent commitment and valuable implication of IBM Systems ,TSS , Support L3 \u00a0and Development team, IBM provided us with a crafted iFix pre tested in the same conditions as those of PowerM customer (V7000 Gen 2 with Cobra Drives and 7.6 firmware).<\/span><\/strong><\/p>\n <\/p>\n High availability (HA) and disaster recovery (DR) are relatively mature approaches that are typically well-known even by non-technical people. Continuous availability is not as mature and can be confused with HA or DR. People still think in terms of \u201chow many 9s\u201d they can achieve (99.999% uptime, for example). But that is an HA topic.<\/p>\n Continuous Availability(CA) = High Availability(HA) + Continuous Operations(CO)<\/p>\n <\/p>\n To justify the cost of continuous availability, a measurable impact on the business must be calculated. But often, as soon as the outage is past or if a data center has been fairly stable, the business side forgets the critical need for continuous availability until the next outage occurs.<\/p>\n\n
Losing 3 Drives in a RAID5+ Hot Spare configuration looks like a Final Destination<\/span><\/strong> movie scene \ud83d\ude00<\/h4>\n
Overview of Hitachi King Cobra F drives\u00a0HUC156060CSS200<\/h3>\n
Talking Math: Mean Time To Data Loss (MTTDL)<\/h3>\n
IBM Technology Support Services and EMC Technical Advisory knowledge base<\/h3>\n
The recovery process<\/h3>\n
\n
\n
\n
10 Lessons learned<\/h3>\n
Deeply understand Continuous Availability vs HA and DR<\/h4>\n