ZERO OUTAGE THE ZERO OUTAGE PRINCIPLE AS A REQUIREMENT FOR DIGITAL TRANSFORMATION - T-Systems
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
WHITE PAPER ZERO OUTAGE CONTENT NO DIGITAL TRANSFORMATION WITHOUT FAIL-SAFE IT 3 QUALITY AS THE MOST IMPORTANT DECISION-MAKING CRITERION 4 ZERO OUTAGE: THE PATH TO IT DEFECTS 5 THE 3-P PRINCIPLE: PEOPLE, PROCESSES, PLATFORMS PEOPLE: THE CRITICAL HUMAN FACTOR 6 PROCESSES: THE COMPANY’S FRAMEWORK 8 PLATFORMS: A COMPANY’S FOUNDATION 10 ZERO OUTAGE IN PRACTICE 12 2
WHITE PAPER ZERO OUTAGE NO DIGITAL TRANSFORMATION WITHOUT FAIL-SAFE IT Reliable information and communication technology (ICT) is the basis for successful digital transformation, both in terms of internal IT operations and ICT services purchased from a service provider. Companies’ commercial activities and their entire existence today depend on it. Businesses which fail to build permanently fail-safe ICT run the risk The smooth co-operation required can only exist if there is a common of having major problems. The market research company Gartner quality standard. As such, the ICT industry needs an ecosystem com- predicted as early as 2013 that a quarter of all businesses would mitted to the Zero Outage principle, one which follows common rules disappear from the market if they were unable to meet the quality for quality management – goals which can be optimally pursued requirements for digital transformation – so-called “digital with Zero Outage. T-Systems used Zero Outage as early as 2011 to incompetence”. introduce a complete quality-assurance program for ICT services. The aim: to minimize downtime and thus maximize its customers’ But ensuring quality in ICT is an intricate management task. business activities in the digital age. Countless components need to work seamlessly together at all times so that areas like production or sales can operate smoothly. The following white paper provides an overview of Zero Outage. This requires clear standards: for the processes, technical platforms and when training personnel (the “3-P principle”). These standards must not only be introduced and implemented, but also consistently maintained. In addition to these standards, constant staff vigilance and a sense of ZERO ERROR urgency are also of crucial importance, because human error remains the most common cause of disruptions and outages. This can only be helped by a holistic approach, which systematically raises staff awareness about quality, and ensures every staff member feels committed to a Zero Defect culture. PRINCIPLE The focus on quality must not be limited to the business’s four walls, because businesses of every size and industry work together across sectors. This means there are increasingly more gateways and touchpoints. Unless every participating organization has and main- tains the same high understanding of quality, there will be a risk of defective products and outages. 3
WHITE PAPER ZERO OUTAGE QUALITY AS THE MOST IMPORTANT DECISION-MAKING CRITERION There are a number of studies which illustrate the great importance of quality in services, particularly in this digital age. For example, two thirds of the businesses surveyed by the consultancy firm PwC in 2015 state that, at 84 percent, quality is the most important criterion when choosing a service provider. This puts quality well ahead of financial considerations (58 percent). The Information Services Group (ISG) also found that IT quality plays a role “very frequently” to “always” in companies’ decision-making. General performance, in the sense of stable processes and sustainable services, is an especially important factor here. EVERY OUTAGE COSTS MONEY Increasing digitalization is putting more pressure on IT departments in companies across all industries. All telecommunications, rescue service systems, postage logistics, transport companies, trade, the entire finance sector and much more are today dependent on problem-free IT. The more platforms and processes are interlinked, the more dependencies exist and thus the more likely it is for incidents to occur. These incidents – even the tiniest ones – can have serious effects, including complete outage on critical business services. Every outage costs money: More than 37 million man-hours are lost by European companies with 50+ staff alone as a result of IT outages and data recovery – and that is just per year. In many sectors, even a brief outage of the IT systems today causes major financial losses for the affected businesses and establishments. (37 Apple’s App Store was unavailable for eleven hours in 2015 due mln h to technical problems, forcing the company to absorb 2.2 million man-hours dollars in losses – per hour. And impacts of this scale are not an exception, they are the norm. Meeting the high-quality requirements for IT in the modern business world thus takes a strategy which minimizes the number of incidents, while also rectifying any disruptions as quickly as possible. This success strategy has a name: Zero Outage. European companies with over 50 employees lose more than 37 million man-hours due to IT downtime and data recovery - every year. 4
WHITE PAPER ZERO OUTAGE ZERO OUTAGE: THE PATH TO IT INCIDENT MANAGEMENT Zero Outage is the term for how an organization behaves in terms of Incident management constitutes a major part of Zero Outage. Stand- systematically and efficiently handling quality-related tasks – with the ardized, comprehensive incident management repairs an acute error as aim of continuously increasing quality. Zero Outage thus affects quickly as possible by achieving maximum professionalization through telecommunications and IT operations, services, projects, the repeated solution processes. Incident management includes a clearly optimization of customer interfaces, and the involvement of further ICT defined communication chain and various escalation levels, as well as suppliers. It is important to note here that Zero Outage also refers to a general manager-on-duty service – known as the “red telephone”. the behavior of an organization’s entire workforce – from top Similar to a standby service, dedicated representatives from the senior management to entry-level staff. or top management, along with a special team, can be contacted 24/7 about critical incidents. The manager on duty is directly involved, and The Zero Outage program covers measures across all levels – from co-ordinates all problem-solving processes as the main contact. Around state-of-the-art platforms, to smooth, standardized processes with 140 managers work as managers on duty at T-Systems worldwide, short repair times, to specially trained staff. This is because stable, taking turns to bear responsibility in times of crisis. reliable ICT can only be achieved through optimum interaction between humans and technology. ZERO The most important principle of Zero Outage is always that of OUTAGE comprehensive, proactive risk management. It operates under the motto of “prevention, not reaction”. It is not about being the fastest to put out the fire in the worst-case scenario, but rather to foresee risks, develop a plan B and C in advance, and thus prevent the fire from starting in the first place. Great importance is thus placed on comprehensive quality assurance right from the planning phase for Optimisation Operation of tele- changes or projects, as well as on a generally high degree of standardi- the customer communications interface and IT zation for processes and technology. Implementation Delivery of Integration of of projects services ICT suppliers Zero Outage includes specific rules and behavioral guidelines for various incidents, such as, in the case of defective system ÖKOSYSTEM components, for network, power or VoIP outages, and even for In order to ensure top quality and reliability end-to-end at all levels, incidents that arise while implementing a change. Active risk T-Systems works with partners and suppliers upholding the same management serves as the basis for all Zero Outage initiatives: Every high-quality standards. After all, they are an integral part of the process single risk cluster is monitored for risks, e.g. incidents, and the meas- chain, both in terms of providing solutions and services, and in ures taken are constantly optimized and further developed. In this way, emergencies. To enable any incidents to be rectified as quickly as Zero Outage has managed to achieve 99.999 percent availability in possible, error sources to be clearly established, and a final solution to ICT, corresponding to an outage time of just a few minutes a year. be found, it is vital to directly involve the respective supplier. In 2013, T-Systems thus expanded the existing Zero Outage program to include STANDARDISATION partners and suppliers. Around 30 top global suppliers and over 60 Clearly defined standards for platforms, processes and personnel are access providers are already Zero Outage-certified. Every year, over 500 pre-requisites for maximum availability and reliability. Standardization unannounced emergency simulations (“fire drills”) ensure the agreed reduces complexity, and is crucial in preventing or quickly rectifying quality is upheld end-to-end, both by T-Systems and the suppliers. incidents. At the same time, the Zero Outage strategy also focuses intensively on operational problems in order to ensure the right conclusions can be drawn. This is the only way improvement initiatives can be started. Even in project management and software engineer- ing, the parties that are in charge all work in accordance with clearly defined, tried-and-tested processes and standards which describe the results prepared by the various project roles during specific phases and stages. 5
WHITE PAPER ZERO OUTAGE THE 3-P-PRINCIPLE: PEOPLE, PROCESSES, PLATFORMS As previously mentioned, optimum functioning between personnel, processes and platforms is essential for guaranteeing 99.999 percent availability. PEOPLE: THE CRITIC AL HUMAN FACTOR People Processes The human factor plays a central role when it comes to incidents in critical system operations. Take air traffic control as an example: Human error is the main cause of 60 percent of plane crashes. In IT THE system operation, the human-error percentage is much higher, at over 3-P-PRINCIPLE 80 percent. Critical systems can today be secured to an extent that renders outages extremely unlikely – but only as long as humans do not make any big mistakes. Platforms Gaining control over the aforementioned problems requires a holistic approach which goes hand in hand with the culture embodied by an entire company. CREATING A ZERO OUTAGE CULTURE If an incident occurs and is caused by human error, this is often In order to successfully and sustainably incorporate the quality due to the following: mindset into a company’s culture, it must become the focus of all values. This affects all areas. The Human Resources • The people involved may be operating using different terms, behavioral patterns and priorities department plays a key role in firmly esquality approach. • The people involved are not properly trained (lack of expertise If quality is already well and truly integrated as an important and certifications) criterion in the recruitment process, and also influences salary • A sense of urgency is lacking in critical situations models, career planning and employee performance appraisals, • Errors when implementing changes and solving problems (no the corresponding standards and values permeate through all of dual-control principle) the organization’s departments. • Middle and senior management do not have the details of operational matters • Shifting of responsibility back and forth (“incident ping-pong”) 6
WHITE PAPER ZERO OUTAGE Creating a Zero Outage culture at a company involves a number of factors. The following measures have, however, been identified as key tools in ensuring the success and permanent establishment of this culture: • “Practice what you preach”: If a manager themself embodies • Managers on duty are also appointed from the operational IT values and standards and acts as a role model, their staff are divisions and from among all managers. The managers are also more likely to adopt these and associate them with the positive trained and are contactable at night and at weekends according example. Such behavior is particularly important when an organi- to a rotating schedule – to assist with incidents or, for example, a zation wants to gear itself around high quality standards. change weekend. This also allows there to be a contact from the management who is available every day and night to help with • The company has defined Zero Outage as a top priority for now escalating the solution. and years to come. There is a mission supported by everyone, and a strategy pursued by everyone. • In the event of a major incident, the manager on duty is the first person to enter the telephone conference and push for the cause • There are clear, measurable KPIs and a plan of what is to be to be investigated until the error is rectified. They embody the achieved over the next twelve months, as well as a long-term sense of urgency and act as a role model for all staff involved. strategic objective. Overarching goals, such as reducing major in- cidents by X percent, figure in the top management’s and senior • The heads of department and team leaders in middle and lower management’s personal target agreements. management are also responsible for quality in daily business. They have a checklist which helps them during daily quality • Successfully changing an entire culture requires not just one checks with the team and gives them a guideline as to the quality single area to commit to Zero Outage, but everyone, including benchmark. those expected to resist it. • Feedback culture: It is important to involve the key players from • A weekly slot is reserved for the quality update at management the operational areas when further developing process standards board meetings: The quality manager reports on what currently and policies. An opportunity for direct feedback should also be are the most important quality KPIs, the highlights and lowlights provided whenever possible. This may be in the form of a Q&A of the past week, and the status of important improvement session held after the staff call, an anonymous feedback survey, programs. If necessary, decisions are also made directly at this or on-site breakfasts, where staff can discuss the quality strategy time. Action is then taken and monitored to see whether in a relaxed, casual atmosphere. improvements have been made – with a follow-up for the next week. 7
WHITE PAPER ZERO OUTAGE QUALITY ACADEMY PROCESSES: THE COMPANY’S FRAMEWORK Only staff who constantly hone their skills can lead the organization A modern, efficient business model is based on countless processes to success, which is why T-Systems established the so-called Quality which ensure, at the various levels of a company, that the processes Academy as a standardized training and certification platform in 2013. function correctly. A company’s fundamental processes depend on IT It serves as a think tank for company-wide knowledge transfer across all and telecommunications in virtually every industry, which is why high quality-related process and IT training courses. Over 20,000 T-Systems process quality end-to-end is essential in making ICT environments as staff and almost 100 top partners and access providers are now certified, fail-safe as possible — because, if the ICT quality is not right, a single ensuring a standardized understanding of quality and solution expertise process error can block or even suspend a business’s entire process. An at all levels. All of the Quality Academy services are geared around operational fault in an ERP system, for example, is difficult to certain professional careers. Staff in the operations division, for example, compensate for through manual replacement methods and processes. are trained in the dual control principle when implementing changes, The consequence: suspension of business operations within a few while project managers increase their know-how relating to quality hours. gates and touchpoints in projects. Each employee can utilize specially configured training modules for their specific career and role and can PROBLEM AREAS AND SOLUTIONS easily navigate between individual sections within the so-called playlist. This enables faster completion of the training, as well as content tailored The causes of process disruptions are extremely diverse, and specifically to the respective target audiences. Staff with little involvement may lie at various levels and departments of the organization: in a process/subject area do not need to complete extensive training, but •D ifferent adaptations of the existing standards, such as instead only the content relevant to their tasks. These methods also make ITIL, COBIT, PMI it easy to combine topics relevant across the board into training modules, • Highly complex, impractical, scientific process descriptions thereby helping process and tool aspects to be learned simultaneously. •N o documentation of responsibilities or a general lack of end-to-end responsibilities Attractive new formats like simulations, mobile training courses or •P rocesses do not fit together because departments operate game-based learning provide variety as a change from the monotony separately from one another of normal, web-based training courses or recorded instructions with • T he alert chain for incidents starts too late or does not supporting PowerPoint slides. The “flight simulator” is one example function consistently; precious time is lost which has proven useful: This online training course allows various •N o focus on sustainable problem-solving or investigating causes; the organization persists with workarounds, which scenarios and problems related to daily work to be simulated on the constantly create new errors desktop and addressed, so as to prepare staff for real operations and thus prevent human error. However, the ICT systems in many companies do not always display the CERTIFICATIONS necessary quality. Coupled with this is the fact that, when a company Quality Academy certification plays a key role in keeping the workforce’s grows, so does the number of internal processes, making it increasingly knowledge of Zero Outage verifiable and up to date. On the one hand, difficult to co-ordinate them all. While standards like ISO 27000 or the certification is an important means of proving the employee’s knowledge, IT Infrastructure Library (ITIL) have significantly increased the degree and on the other, it is also a good indicator for managers, giving them of industrialization in the IT world, they are yet to satisfactorily ensure an overview of the team’s knowledge level. From the overall perspective the reliability and stability of IT systems. These standards only describe of a global quality organization, certification is essential for imparting what quality is, not how it is achieved.Too many different approaches are knowledge across the board and rolling out new standards. taken here, and IT outages continue to occur too frequently. Standardization must thus start right at the beginning of an IT project. A certificate expires after 18 months and must be renewed – thereby Only then is it possible to achieve high process quality. ensuring a continuous focus on up-to-date content. It also facilitates the onboarding process for new staff, who can undertake the training cours- es at any time and get certified in a similar way to a “driver’s license” – giving an important feeling of achievement and also being an ideal way of consolidating the quality mindset and relevant knowhow early on. 8
WHITE PAPER ZERO OUTAGE CLEAR DESCRIPTION AND DOCUMENTATION OF PROCESSES competence areas, a lack of end-to-end responsibility also impacts Inadequate process descriptions are a typical fundamental problem negatively on IT quality. Every process disruption is not only a potential affecting process quality: Highly complex, impractical essays often error source, but also makes it harder to see the whole result in IT staff not knowing, for example, what to do in the case of a picture. Silo mentalities and isolated sub-processes inevitably lead to fault. Such instructions are counterproductive and result in errors and an overall concept relevant to the company’s success being left by the conflicts. Simple, easily comprehensible process descriptions are thus wayside. It is thus imperative to establish consistent processes which needed in order to ensure smooth everyday operations and prevent ICT view IT challenges holistically. Regular training courses ensure the outages from happening in the first place. This also requires a clear necessary process compliance in emergencies. If the individual assignment of roles within the company. process stages have been practiced, a strict procedure will be successful even under intense pressure. CONFIGURATION MANAGEMENT Configuration management is a good basis for improving processes. An orderly Configuration Management Database (CMDB) is a useful indicator of whether processes function correctly and are applied in a disciplined manner in areas such as change management, patch and release management, and monitoring. The aim of configuration management is to document compliance with a configuration unit’s physical and functional requirements, and create full transparency in relation to this, with a view to ensuring every party or department involved with the configuration unit uses the right and appropriate information. SYSTEMATIC PREVENTION There will always be another outage: Anyone who is serious about quality management and wants to improve and standardize their process quality over the long term is reliant on clean, tidy documentation. Only through consistent documentation existing processes can be permanently optimized, and error sources minimized. The Zero Outage strategy covers the following fundamental points in order to make the company’s process landscape more reliable: • Simple process descriptions • Clearly define responsibilities and processes DISTRIBUTION OF EXPERTISE AND RESPONSIBILITY • Regularly simulate emergencies Other frequent problems include breaks in, or ambiguities regarding, • Consistently document and analyze incidents and Faults responsibility when company or even just departmental boundaries are crossed. Overlaps or inconsistencies in collaboration can result in each party relying on the other, with no one ultimately feeling properly High process quality is essential for stable, highly available ICT. Clear responsible. rules and structures, and consistent implementation thereof, are required in order for the IT’s creative potential to freely develop in the Efficient crisis management is required if a fault occurs. Insufficient interests of the company’s objectives – based on the motto: Only those alert chains and cumbersome delegation of tasks and responsibilities who control their processes are not controlled by their processes. unnecessarily prolong the repair process – and therefore also the risk factor for the company’s business activities. Roles and processes must therefore be clearly defined and consistently followed. This also includes a strict dual-control principle, which guarantees error prevention and quality checks in critical matters. Along with unclear 9
WHITE PAPER ZERO OUTAGE PLATFORMS: A COMPANY’S FOUNDATION POSSIBLE ERROR SOURCES Standardized, high-performance, and, most importantly, highly available platforms are pre-requisites for a Zero Outage philosophy. The plat- Despite all precautions, technical faults cannot be discounted forms must, however, also always comply with the latest technological at a platform level. The most common causes of these are: standards and have multiple back-ups. Experience has shown that while • Redundant components mutually disrupting one another technical faults are only the cause of incidents in protected systems in • Defective firmware exceptional cases, they should definitely not be neglected. • Outdated hardware • Defective monitoring •E xcessive complexity as a result of different technologies and versions Faults and outages mean extra work, delays and, in the worst-case scenario, even downtime. That is why it is also important to learn from previous faults and prevent new ones. This requires having as broad a picture as possible of all systems used by the customer, meaning that not only are the systems operated by the IT service provider taken into ac- count, but also the customer and the systems it operates itself. A “critical landscape” should always be created, containing all existing compo- nents – i.e. systems, applications and interfaces – from the customer’s end to the supplier’s end. Clearly defined standards form the basis for maximum availability and reliability. Standardization reduces complexity, which is in turn crucial for preventing or quickly rectifying faults. Fewer spare parts and experts with specialized knowledge are needed, and there are fewer unknown reactions during changes. CHANGE MANAGEMENT Change management plays an important role, because defective changes are today the most common cause of faults. Companies have invested huge sums in tools and training courses. Yet, despite these TECHNICAL BASES efforts and countless books on the topic, most studies show that Zero Outage is based on duplicated data-center technologies. All data between 60 and 70 percent of all change projects at companies still fail. and systems must be available in two identical but physically separate Changes are, however, becoming necessary more and more frequently data centers. If one data center has an outage, the other takes over. In and at shorter the event of hardware defects, a second in-built network component intervals, as rapid technical advancements at the customer’s end are supplies the power. The same applies for storage to prevent hard-drive constantly generating new requirements for speed and memory. Cloud, defects. Further redundancy at a higher layer ensures additional IoT, Big Data etc. all enable new business models, which in turn need minimization of the residual risk. This may be a second, active, complete new systems and platforms. The requirements for mobile usage have server in a second data center, for example. This is the only way of being also increased dramatically. able to offer customers 99.999 percent availability. 10
WHITE PAPER ZERO OUTAGE SECURITY Changes in hardware are often extensive. The most common Protecting against outages through redundancy and uniform standards requirements are: is just one side of the coin. Backing up these systems accordingly is • Existing systems need to be serviced also the order of the day. Companies of all sizes and industries are • Old or defective hardware must be replaced faced with ever increasing security requirements. Any identified • Firmware must be updated, and security loopholes closed security loopholes must be immediately closed, and unknown ones • New systems must be integrated into the IT landscape sought as a preventive measure. Waiting for loopholes to become apparent is grossly negligent, because it gives many hackers and All these points must be planned and coordinated intensively, and copycats more chance to take advan- there needs to be a risk assessment of the consequences of the tage of them. changes. If a system which was functioning flawlessly yesterday stops “Mobile working” is another main gateway for hackers and malware: working today, there are essentially only three possible causes: An appropriate infrastructure for accessing the company’s network Changes in use (increased access, including hacker attacks or externally should be established beforehand, enabling secure access to viruses), physical defects or – and this is by far the most common – the company infrastructure by public networks. Similarly, there need something has been changed in the system or its configuration. to be clear rules as to which company applications can be accessed externally and which cannot. It is also mandatory to clarify with one’s customers, in advance, which access rules exist for their systems and CLEARLY DEFINED STANDARDS FORM THE BASIS applications. FOR MAXIMUM AVAILABILITY AND RELIABILITY. STANDARDISATION REDUCES COMPLEXITY. Zero Outage change management is about consistently minimizing risk when implementing changes and keeping impacts as small as possible. Every permanent change to a customer’s IT landscape is assessed and checked, based on the same criteria. This requires a quality assurance system applied consistently across the company’s entire organizational structure. Successful changes result in change models or templates, which are used to perform similar changes in future. If change models are highly standardized and can be applied globally, procedural optimizations can also quickly be made accessible to all teams around the world. As such, the Zero Outage approach is increasingly being incorporated into change plans. The expected result of every step-in implementing changes is already noted in the change plan. When performing the change, every step is followed by a dual control check to see whether this result was achieved. If an error does occur, a detailed examination is conducted. One common technical cause is a hardware, software or configuration error. This involves getting to the bottom of often complicated situations. As complex environments almost always also involve suppliers with their components, these suppliers are also incorporated and asked to contribute to the analysis. In the case of technical defects, it is also important to ask why the redundancy, which is virtually always set up for potentially critical services, did not work, or why a failover scenario did not take effect. 11
WHITE PAPER ZERO OUTAGE ZERO OUTAGE IN PRACTICE Zero Outage by T-Systems is a comprehensive program required for establishing maximum quality in ICT, which is in turn a pre-requisite for stepping into the digital age and being able to operate successfully in it. In other words, no reliable ICT means no digital transformation. Zero Outage covers a number of technical and procedural measures, Our staff are the backbone of our Zero Outage program, because all of which serve the purpose of ensuring quality. They also always they embody our strategy, and are the best, most credible go hand in hand with the human factor, because staff at the service representatives of quality, both internally and externally. They provider’s end, and of course also at the customers’ and suppliers’ promote the issue with their expertise and precision, so that we can ends, are ultimately the ones who embody the notion of quality and also keep ensuring high quality. apply it on a daily basis. Mai 2019 Contact publisher T-Systems International GmbH T-Systems International GmbH Global Delivery Excellence Hahnstraße 43d Doris Reitter 60528 Frankfurt, Germany Fasanenweg 5 http://www.t-systems.de 12 70771 Leinfelden-Echterdingen, Germany
You can also read