Three Key Requirements of a Sound Disaster Recovery Strategy - March 2011
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Disaster Recovery: Best Practices Contents Abstract ......................................................................................................................................................... 3 Three DR Requirements ................................................................................................................................ 3 Mitigating Risk............................................................................................................................................... 4 Mitigating Risk: Disk for Backup, Tape for DR........................................................................................... 4 Disk for Backup, Tape for DR (and maybe more) ...................................................................................... 5 Ensuring Long‐term Affordability.................................................................................................................. 6 TCO and Tape ............................................................................................................................................ 6 Testing, Testing, Testing! .............................................................................................................................. 7 Conclusion ..................................................................................................................................................... 8 BlueScale, TranScale, Spectra, and the Spectra Logic are registered trademarks of Spectra Logic Corporation. All rights reserved worldwide. All other trademarks and registered trademarks are the property of their respective owners. All library features and specifications listed in this white paper are subject to change at any time without notice. Copyright © 2011 by Spectra Logic Corporation. All rights reserved. 2
Disaster Recovery: Best Practices Abstract For short‐ and long‐term survival, organizations must try to protect their data against any and all of the possible disasters, large and small, that can disrupt daily operations. Disasters can range from power outages, employee theft, and virus and malware, to site shutdown caused by a natural disaster. And whatever the magnitude, any of these disrupts an organization’s normal flow of activity1 so recovery from a disaster should be a top priority for organizations. A strong disaster recovery (DR) plan must address risk mitigation, Disaster Recovery Strategy affordability, and include a test/drill 1) Risk mitigation– so data can survive any disaster component. And part of the DR plan 2) Affordability over the long term must include tape in the mix along 3) Testing, Testing, Testing! with disk, because of tape’s unmatched advantages in terms of portability (easy to move from primary site to ensure availability), invulnerability to malware and viruses (all data backed up prior to the malicious code remains free of corruption), and of course, affordability. Tape is the ultimate insurance. Three DR Requirements A strong DR plan is a necessity: study after study confirms a horrifyingly high mortality rate for organizations caught without DR plan when a disaster strikes. An example: 50% of organizations without a data protection strategy2 never even re‐opened after a strike by a tornado in the Midwest. Another example: a shocking 43% of businesses never re‐opened following a significant data loss due to disaster. Of that figure, 80% failed in a year and 93% within five years.3 This underlies the common sense notion that a good disaster recovery plan is the best insurance a company can take out against major and minor threats to data. Every company has its own disaster recovery requirements according to staff size, IT budget, existing backup/DR equipment, and data priorities. Some commonalities, however, are found across all plans: the need to protect data from the effects of myriad disasters, the element of affordability, and the need to test and improve DR strategies regularly. 1 “Server Virtualization Part 5: Disaster Recovery” http://www.bitpipe.com/detail/RES/1259882237_596.html?tbaction=play&titleId=51904036001 2 Maltby, Emily. “Readying for the Worst.” The Wall Street Journal. 9 September 2009. 3 “Importance of Succession Planning: Continuity Disaster Recovery.” Phoenix Blogs. May 2007. http://continuitydisasterrecovery.phoenix-blogs.com/importance-of-succession-planning/ http://www.bizjournals.com/cincinnati/stories/2004/08/09/focus5.html 3
Disaster Recovery: Best Practices Mitigating Risk The first task in DR planning is to assure that data is protected in the event of significant disruption— essentially, an insurance policy against catastrophe. Uniformly, the best disaster recovery strategy includes storing data off‐site, with at least one copy on tape. The most commonly used data storage options are tape and disk storage, typically combined in a data protection architecture. More recently, the cloud option for storage has begun to gain a foothold in the market. The cloud is widely considered to be not quite ready for primetime due to prohibitive expense,4 security concerns, and an organization’s absence of control over its own data. For example, trusting data to the cloud means that an organization believes that the cloud providers have a competent disaster recovery plan, will not arbitrarily go out of business, and will not gate access to data if a payment issue arises—and such issues may arise if disaster strikes either the cloud company or the organization whose data the cloud company stores. Mitigating Risk: Disk for Backup, Tape for DR Disk is a very effective tool for backup, especially when coupled with the variety of methods that create additional levels of data protection such as RAID. Organizations are increasingly establishing strategies that include off‐site disk storage through remote replication and other technologies. Disk, combined with tape, serve as effective backup system and archiving architecture, with tape serving as a key element in archiving and disaster recovery. In terms of the type of disasters that may occur, only 3% of all disasters5 are significant natural disasters. Although these may get the most press, they do not form the most pressing threats to data. The most common threat to data is hardware malfunction, followed by human error, software corruption, and computer viruses. Disk is vulnerable to some degree6 as shown in long‐term studies7 of disk drive failures. 4 Amazon S3 Data Pricing. http://aws.amazon.com/s3/#pricing. Accessed February 2011. For the least expensive and most reliable Amazon S3 storage, organizations pay $0.055 per GB. For firms that require over 5000 TB of storage (5PB), the total expense comes to $275,000 per month. While most organizations with this kind of data probably won’t be using the cloud only, this figure at least illustrates the radical expense of cloud storage. 5 Gallant Data Recovery Services. “Statistics About Leading Causes of Data Loss,” Displayed April 2011, http://www.gallantent.com/solutions.htm. 66 Shroeder, Beth and Garth Gibson, "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?," Usenix File Storage Technologies Conference, 2007. Pinheiro, Eduardo, Wolf‐Dietrich Weber and Luiz Andrè Barro, “Failure Trends in a Large Disk Drive Population,” Google. 4
Disaster Recovery: Best Practices Protecting data from logical corruption is one of the primary uses of tape in disaster recovery, and something that disk, despite its many strengths as a data backup medium, has never really solved.8 If vicious software or viral malware hits a disk, it can spread to the initial disk, the backup disk, and the RAID’d disk, which serves as the backup for the backup. It is possible that a disk‐only backup strategy can actually worsen the initial disaster situation against which it is supposed to protect. The combination of disk and tape for disaster recovery is the strongest defense against risk. Online providers clearly trust tape, as illustrated by Google’s recent Gmail issue. On February 28, 2011, Google posted on its Gmail platform an apology to the 0.02% (estimated approximately 150,000) of users who could not access their email.9 The culprit? A software bug that attacked email in the disk arrays and disk backups across data centers. All copies of the email were unavailable, save those written to tape. The announcement states, “To protect your information from these unusual bugs, we also back it up to tape. Since the tapes are offline, they’re protected from such software bugs.10” This succinctly illustrates the importance of using tape for DR. For hardware malfunction, despite many claims of disk companies, modern tape libraries with widely available technology such as LTO are exponentially more reliable than enterprise disk—disk’s error rate is about 41,000 times as great as tape, making tape the most reliable storage solution for data protection needs.11 Data written incorrectly is worthless. Disk for Backup, Tape for DR (and maybe more) Firms increasingly use tape to create full copies of data as disaster recovery insurance and for active archiving, and use disk for incremental backups. Along with backing up daily data changes, disk can also be set up to employ any of the disk‐based snapshot or mirroring technologies backups that are support more granular backups—for example, every hour. Incremental and point‐in‐time backups during the day minimize the data loss potential of a disaster occurring between backup windows. Henson, Valerie.” Opinion: Real‐world Disk Failure Rates offer Surprises,” Computerworld, 2007, http://www.computerworld.com/s/article/9025380/Opinion_Real_world_disk_failure_rates_offer_surprises. 7 Shroeder, Beth and Garth Gibson, "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?," Usenix File Storage Technologies Conference, 2007. Pinheiro, Eduardo, Wolf‐Dietrich Weber and Luiz Andrè Barro, “Failure Trends in a Large Disk Drive Population,” Google. Henson, Valerie.” Opinion: Real‐world Disk Failure Rates offer Surprises,” Computerworld, 2007, http://www.computerworld.com/s/article/9025380/Opinion_Real_world_disk_failure_rates_offer_surprises. 8 Hill, David G. Data Protection: Governance, Risk Management, and Compliance. Boca Raton: CLC Press. 2009. Page 53. 9 Gustin, Sam. “GFail: Google ‘Very Sorry’ After the Cloud Eats Thousands of Gmail Accounts,” Wired Epicenter. 2/28/2011. http://www.wired.com/epicenter/2011/02/gmail‐fail/ 10 Treynor, Ben. “Gmail back soon for everyone.” Post on www.gmail.com. 28 February, 2011. http://gmailblog.blogspot.com/2011/02/gmail‐back‐soon‐for‐everyone.html 11 Newman, Harry. “Why Enterprise Tape Can’t Get No Respect.” Enterprise Storage Forum. June 17 2010. http://www.enterprisestorageforum.com/continuity/features/article.php/3888366/Why‐Enterprise‐Tape‐Cant‐ Get‐No‐Respect.htm ‐‐the actual error bit rate for LTO tapes is 1 bit in every 1017 bits, while disk errs 512 bytes per every 1016 bits of data transferred 5
Disaster Recovery: Best Practices Some data will most likely be lost in the event of a disaster, assuming the error occurs somewhere between backup windows. Disk backups can be easily automated to create more frequent backups. Of course, this data is at risk depending on the timing and nature of the data loss, but may be more than worth the investment depending on the nature of the data that is being protected. The importance of using tape (along with disk) for DR is emphasized in the Google example—the tape was unaffected by the bug that led to the email deletion. Ensuring Long‐term Affordability Disaster recovery preparation and plan implementation must not break the bank, given that is precisely what it is trying to prevent. Rather, a strong disaster recovery plan needs to be feasible for an organization over the long‐term. Recent laws like Sarbanes‐Oxley make it necessary to keep data for five to seven years or longer, and are subject to audits at any time; this should factor into a strong DR plan, as one of the unforeseen potential costs of a disaster. For example, Lagasse, Inc., a wholesale distribution company headquartered in New Orleans with over 1,000 associates nationwide, relied on a well tested DR plan that proved its value following Hurricane Katrina. Lagasse had no downtime during the hurricane, and then spent two months after its initial disaster recovery confirming compliance to Sarbanes‐Oxley.12 Tape served well Lagasse well in the compliance tasks that followed the disaster. TCO and Tape For affordable, long‐term storage, tape has long been the industry standard. Recent analyses by the Clipper group13 show that the cost per TB of data storage of disk is anywhere from 5x to 290x as expensive as tape storage, for both initial and long‐term costs. Since disk is on‐line it consumes power 24 hours a day, 7 days a week, every day. For organizations that want to move data off‐site using disk, other less obvious costs are incurred, such as the purchase of WAN bandwidth for longer distance data moves. These networks can cost anywhere from $100 to $1,500 per MB/s of capacity for the network; 12 “Katrina Recovery.” Lagasse, Inc. Powerpoint presentation. Available on the internet at: http://www.slideshare.net/mlancas/katrina‐recovery‐final 13 Jelitto, Jens, Mark Lantz et al. “Magnetic tape storage advances and the growth of archival data,” Proceedings of the First International Workshop on Standards and Technologies in Multimedia Archives and Records (STAR), Lausanne, 2010. http://mmspl.epfl.ch/webdav/site/mmspl/shared/star2010/ppt/star2010_jelitto.pdf 6
Disaster Recovery: Best Practices one analyst notes that organizations can spend more than the rest of their DR budget combined simply on network maintenance.14 Cloud storage claims to be incredibly cost‐effective, but this is not yet proven. “Cloud services are charged on a per MB, GB or TB usage basis, which can make predictable budgeting a challenge. One blue chip company that recently considered moving to cloud for data replication estimated that it would cost them, over a period of three years, $55,000 more [emphasis added] when compared with running a comparable in house system.15” Other questions remain about cloud services use for disaster recovery, not the least having to do with determining the resiliency of the cloud service’s own DR plan. Testing, Testing, Testing! In the event of a disaster, the unavailable data must be restored within a reasonable time frame if the disaster recovery effort is to be successful.16 To bring data up to restore core business functions, it is necessary to have a tested DR plan. A good restore plan takes into account the nature of the data to be restored; if it is mission‐critical data, restore it as quickly as possible. Establish a time frame to reflect data importance—restore mission critical data in a few hours, less important data as a secondary priority, (for example, within a 24‐ or 36‐hour time frame). Restore the least used data whenever it is needed, or at least after the most important data is restored and the organization is up and running. The numbers argue that at least a significant proportion – up to half‐‐ of organizations have either untested DR plans or no DR plans at all: “of the 50% or so of companies that build continuity plans, fewer than 50% actually test their plans, which is like having no plan at all.17” This point of view is 14 Aaron, Jeff. “WAN Requirements for Successful Disaster Recovery.” Continuity Central. http://www.continuitycentral.com/feature0403.htm October 2006 15 Worms, Phil. “Why Cloud Computing Must Be Included in Disaster Recovery Planning,” Cloud Computing, January 26, 2011. http://cloudcomputing.sys‐con.com/node/1689928 16 “IT Disaster Recovery Plan: How Important Is it For You?” IT Outsourcing Adviser. http://www.it‐outsourcing‐ adviser.com/it‐disaster‐recovery‐plan.html 17 Toigo, Jon. Disaster Planning Organization Main Page. Accessed April 201 1. http://www.drplanning.org/portal/# 7
Disaster Recovery: Best Practices confirmed by a Symantec survey that shows small to medium sized organizations at the greatest risk of poor DR planning. Only half even have plans and of those “only 28 percent have actually tested their recovery plans, which is a critical component of actually being prepared for a potential disaster.18 Experts agree that testing your disaster recovery plan is key to a successful disaster response. Philip Jan Rothstein, FBCI, president of the Rothstein Associates management consulting firm believes, “An unexercised contingency plan is often worse than no plan at all.19” Conclusion For businesses, the keys to surviving a disaster are: strong storage architecture that uses disk and tape appropriately, with tape for disaster recovery custom disaster recovery plan regular tests of the DR plan Make sure you can restore your data before you are under tremendous pressure to do so following a disaster. At the point that recovery is required, the plan must be manageable‐‐or else it will be overwhelming and increase risk of its failure. The DR plan is a lifeline for an organization, given the frightening percentages of businesses, without plans, that fail after a disaster of almost any significance. And tape’s role in DR is central, because tape serves as an affordable catastrophic insurance policy for all data, especially for the data stored on disk. 18 Sachoff, Mike. “Half Of Small Businesses Lack A Disaster Recovery Plan,” Small Business Newz, Jan. 12, 2011. http://www.smallbusinessnewz.com/topnews/2011/01/12/half‐of‐small‐businesses‐lack‐a‐disaster‐recovery‐plan 19 Harvey, Cynthia. “Identifying Weak Points In A Disaster Recovery Plan“ Processor, Vol.33 Issue 3, February 11, 2011. http://www.processor.com/editorial/’ 8
You can also read