Maintaining Mission Critical Systems in a 24/7 Environment. Peter M. Curtis
Читать онлайн книгу.occur. Employing new methods of distributing critical power, understanding capital constraints, and developing processes that minimize human error are some key factors in improving recovery time in the event critical systems are impacted by base‐building failures.
The infrastructure reliability can be enhanced by conducting a formal Risk Management Assessment (RMA), gap analysis, and by following the guidelines of the Critical Area Program (CAP). The RMA and the CAP are used in other industries and customized specifically for the needs of Data Center environments. The RMA is an exercise that produces a system of detailed, documented processes, procedures, checks, and balances designed to minimize operator and service provider errors. The practice CAP ensures that only trained and qualified people are associated and authorized to have access to critical sites. These programs, coupled with Probability Risk Assessment (PRA), address the hazards of data center uptime. The PRA looks at the probability of failure of each type of electrical power equipment. Performing a PRA can be used to predict availability, number of failures per year, and annual downtime. The PRA, RMA, and CAP are facilitating agents when assessing each step listed below.
Engineering and design
Project management
Testing and commissioning
Documentation
Education and training
Operation and maintenance
Employee certification
Risk indicators related to ignoring facility process management
Standard and benchmarking
Industry regulations & policies continue to be more stringent than ever. They are heavily influenced by Basel II, Sarbanes‐Oxley Act (SOX), NFPA 1600, and U.S. Securities and Exchange Commission (SEC). Basel II recommends “three pillars” ‐ risk appraisal and control, supervision of the assets, and monitoring of the financial market ‐ to bring stability to the financial system and other critical industries. Basel II implementation involves identifying operational risk then allocating adequate capital to cover potential loss. As a response to corporate scandals in the close to decades ago, SOX came into force in 2002 and passed the following act: The financial statement published by issuers is required to be accurate (Sec 401); issuers are required to publish information in their annual reports (Sec 404); issuers are required to disclose to the public, on an urgent basis, information on material changes in their financial condition or operations (Sec 409); and impose penalties of fines and /or imprisonment for not complying (Sec 802). The purpose of the NFPA 1600 Standard is to help the disaster management, emergency management, and business continuity communities to cope with critical events. Keeping up with the rapid changes in technology has been a longstanding priority. The constant dilemma of meeting the required changes within an already constrained budget can become a limiting factor in achieving optimum reliability.
1.2.1 Levels of Risk
Risk can be described as the worst possible scenario that might occur while performing a task within the facility. Risk assesses how much we know or predict about unforeseen circumstances. As we review risk, management is essential to the facility/IT team as having the proper change management process in place for planned events, and event response procedures in place can ultimately reduce downtime. Reducing the frequency and understanding impact is the key to proper Critical Environment Management. Table 1.1 shows the three typical levels of impact, high, medium, and low, as a result of an event occurrence.
Table 1.1 Levels of Risk Impact to Facilities
Risk Impact | Effects of System Failure |
---|---|
High | It will cause an immediate interruption to the clients’ critical operations such as:Activity requiring a planned major utility service outage, or temporary elimination in system redundancy of the critical environment.Activity that would disrupt critical production operations.Activity that would likely result in an unplanned outage or disruption of operations, if unsuccessful. |
Medium | There is time to recover without impacting the clients' critical operations including any:Activity requiring a planned service outage that does not affect systems, but may impact non‐critical operations.Activity that involves a significant reduction in system redundancy.Activity that is not likely to result in an unplanned outage to the critical environment or disruption of operations, if unsuccessful. |
Low | It will not interrupt operations and will have minimum potential of affecting the clients' critical operations including:Activity involving systems directly supporting operations but the execution of which will be transparent to operations.Activity that cannot result in an unplanned outage of the critical environment or impact operations, if unsuccessful. |
None | Activity not associated with the critical environment. |
1.3 Capital Costs versus Operation Costs
Businesses rest at the mercy of the mission critical facilities sustaining them. Each year billions of capital dollars are spent on the electrical and mechanical infrastructure that supports IT around the globe. It is important to keep in mind that downtime can cost companies millions of dollars per hour or more. An estimated 94% of all businesses that suffer a large data loss go out of business within two years regardless of the size of the business. The daily operations of our economic system and our way of life depend on critical infrastructure being available 100% of the time with no exceptions.
Critical industries are operating continuously, 365 days. Because conducting daily operations necessitate the use of new technology, more and more applications are packed into servers, and servers are being packed into a single cabinet. The growing number of servers operating 24/7 increases the need for power, cooling, and airflow. When a disaster causes the facility to experience lengthy downtime, a prepared organization is able to quickly resume normal business operations by using a predetermined recovery strategy. Strategy selection involves focusing on key risk areas and selecting a strategy for each one. Also, in an effort to boost reliability and security, the potential impacts and probabilities of these risks, as well as the costs to prevent or mitigate damages and the time to recover, should be established.
Many organizations associate disaster recovery and business continuity only with IT and communication functions and miss other critical areas that can seriously impact their business. Within these areas may be a multitude of critical systems that require maintenance, the development of procedures, and appropriate documentation. Some of these systems are listed later in Table 1.3.
One major area that necessitates strategy development is the banking and financial service industry. The absence of strategy that guarantees recovery has an impact on employees, facilities, power, customer service, billing, and customer and public relations. All areas require a clear, well‐thought‐out strategy based on recovery time objectives, cost, profitability impact, and safety. The strategic decision is based on some of the following factors:
The maximum allowable delay time prior to the initiation of the recovery process.
The time frame required to execute the recovery process once it begins.
The minimum computer configurations required to process critical applications.
The minimum communication device and backup circuits required for critical applications.
The minimum space requirements for essential staff members and equipment.
The