Maintaining Mission Critical Systems in a 24/7 Environment. Peter M. Curtis
Читать онлайн книгу.that are motivated to plug into the Information Age require reliability and flexibility regardless of whether the companies are large Fortune 500 corporations or small companies serving global customers. This is the reality of conducting business today. Whatever type of business you are in, many organizations have realized that a 24/7 operation is imperative. An hour of downtime can wreak havoc on project schedules or loss of critical information, resulting in lost hours re‐keying electronic data, not to mention the potential for losing millions of dollars.
Twenty‐five years ago, the facilities manager (FM) was responsible for the integrity of the building. As long as the electrical equipment worked 95% of the time, the FM was doing a good job. When there was a problem with downtime, it was usually a computer fault. As technology improved on both the hardware and software fronts, information technology began to design their hardware and software systems with redundancy, including dual corded equipment (either an A or a B power source can fully carry the IR equipment load). As a result of IT’s efforts, computer systems have become so reliable that they’re only down during scheduled upgrades.
Today the major reasons for downtime are human‐error or utility failures: poor power quality, power distribution failures, incorrect switching of equipment or accidental EPO initiation, and environmental system failures (although that percentage remains small). When a problem does occur, the facilities manager is usually the one in the hot seat. Problems are not limited just to power quality; but also, that the staff has not been properly trained in certain situations. Further complicating matters, recruiting qualified inside staff and outside consultants can be difficult, as facilities management, protection equipment manufacturers, and consulting firms are all competing for the same talent pool to support the mission critical industry. The stark increase in data center construction around the world has only exasperated the situation.
Minimizing unplanned downtime reduces risk, but unfortunately, the most common approach is reactive. That is, spending time and resources to repair a faulty piece of equipment after it has failed. Strategic planning can identify internal risks and provide a prioritized plan for reliability improvements. Also, only when both ends fully understand the potential risk of outages, including recovery time, can they fund and implement an effective plan. Because the costs associated with reliability enhancement are significant, sound decisions can only be made by quantifying the performance benefits and weighing the options against their respective risks.
Planning and careful implementation will minimize disruptions while making the business case to fund capital improvements and maintenance strategies. When the business case for additional redundancies, consultants, and ongoing training reaches the boardroom, the entire organization can be galvanized to prevent catastrophic data losses, damage to capital equipment, and even danger to life safety.
Figure 3.1 “Seven steps” is a continuous cycle of evaluation, implementation, preparation, and maintenance
(Source: Courtesy of PMC Group One, LLC)
Table 3.1 Law of Nines
% Uptime/Reliability Level | Downtime Per Year |
---|---|
99% | 87.6 hours |
99.9% | 8.76 hours |
99.99% | 52 minutes |
99.999% | 5.25 minutes |
99.9999% | 32 seconds |
3.2 Companies’ Expectations: Risk Tolerance and Reliability
In order to design a building with the appropriate level of reliability, a company must first assess the cost of downtime and determine its associated risk tolerance. Because recovery time is now a significant component of downtime, downtime can no longer be equated to simple power availability, measured in terms of one nine (90%) or six nines (99.9999%). Today, recovery time is typically many times longer than outages, since operations have become much more complex. Restoration of a shutdown IT infrastructure backbone must be carried out in a specific sequence so that IT equipment can be restored with limited communication conflicts and be brought back online speedily. Just turning IT equipment on again does not work with our complex IT systems. Is a 32‐second outage really only 32 seconds? Is it perhaps 2 hours or 2 days? The real question is: How long does it take to fully recover from the 32‐second outage and return to normal operational status? Although measuring in terms of nines has its limitations, it remains a useful measurement we need to identify. For a 24/7 facility:
In new 24/7 facilities, it is imperative to not only design and integrate the most reliable systems, but also to keep them simple. When there is a problem, the facilities manager is under enormous pressure to isolate the faulty system without disrupting any critical electrical loads and does not have the luxury of time for complex switching procedures during a critical event. An overly complex system can be a quick recipe for failure via human error if key personnel who understand the system functionality are unavailable. When designing a critical facility, it is important that the building design does not outsmart the facilities manager. Companies can also maximize profits and minimize cost by using the simplest design approach possible or integrate automatic recovery or “self‐healing” automatic controls to recover from a failure. One prevalent example is the current use of Static Transfer Switches (STS’s) discussed in a later chapter. The STS will automatically and within milliseconds switch power sources to critical equipment.
In older buildings, facility engineers and senior management need to evaluate the cost of operating with obsolete electrical distribution systems and the associated risk of an outage. Where a high potential for losses exists, serious capital expenditures to upgrade the electrical distribution system are monetarily justified by senior management. The cost of downtime across a spectrum of industries exploded in recent years, as businesses have become completely computer‐dependent, and systems have become increasingly complex (Table 3.2).
Table 3.2 The Cost of Downtime
(Source: Data from Information Technology Intelligence Consulting).
Industry | Average Cost per Hour in 2017 |
---|---|
Energy | $22,321,000 |
Brokerage | $9,300,000 |
Media | $9,000,000 |
Manufacturing | $8,500,000 |
Health Care | $6,900,000 |
Retail | $6,600,000 |
Telecommunications | $4,800,000 |
Credit Card Operations | $3,100,000 |
Human Life | “Priceless” |
* Prepared by a disaster‐planning consultant of Contingency Planning Research
Imagine that you are the manager