Performance management at Danske Bank - For Internal Use 10.6.2014

Page created by Jaime Griffith
 
CONTINUE READING
For Internal Use

Performance management at Danske Bank

10.6.2014
For Internal Use

Disclaimer

           •The information given can be based on FACTS
           •The information given may NOT apply to your company

           •Don’t   hesitate to ask questions ☺

                                                                  2
For Internal Use

Agenda

                   The Situation
                   The Mainframe Environment
                   Evolution in online transactions
                   Governance & Process & Tools & Competence
                   Optimising to be attractive as a platform
                   Next step in Performance Management - SPSS

                                                                3
For Internal Use

About Danske Bank & the Mainframe environment

 Danske Bank
   Medium size bank with app. 7 mill customers
   Commercial, retail and investment bank
   established 1871
   Branch offices in Northern Europe DK, SE,
   No, Fi, UK, IR, Li, Lt, Es...

 Mainframe Environment                              Evolution transactions/MIPS from 2001-2013
    34000 MIPS
    3 //Sysplexes (G1-3)
           G1 Production (G3 availability centre)
           G2 Development
    Cics, DB2 and MQ based online environment
           100 mill transactions per day
           14 k transaction codes
           3000+ transactions per second (peak)
    100k production batch jobs per day
           Concurrent batch/online
    1500+ developers
           1000+ in DK
           500+ in IN
    Programming languages
           PL/I (primary)
           Cobol
           C, Asm, Java and EGL

                                                                                                 4
For Internal Use

The Situation

Competitive Situation
 Mainframe
OR
 Windows

Mainframe capabilities
  Run without problems
  Provide cost reductions
  Provide new technology

Strengths
  Discipline
  Competence
  Resolve
  Cheapest platform

                            Mainframe is running 70 – 80 % of the production

                                                                               5
For Internal Use

About the Mainframe environment – Design points
     The original design criteria's:
             •One   Bank
             •One   System
             •One   Infrastructure

     Cloned setup
               Cloned systems
               Cloned infrastructure
               Cloned applications

              Workload Distribution                                                   Cics        Cics
               Sysplex Distributor
               Via shared MQ
               Cisco Load balancers (being replaced)                                         MQ
              Cics-regions
               App. 150 Cics regions in production                                    Cics        Cics
                     All Cics 5.1
              Data
               DB2 is the only data storage
               Enables batch and online concurrency (with the right policies)
                     It comes with a cost
              Re-use (save development and infrastructure cost)
               Components
                    Same components are used in online and batch
                    Components are called in via a home-grown SOA-like
                    infrastructure (from 199n)
               High resilience risk impact when key components are changed
                      Some components are used in more than 10,000 transactions
                      Component hierarchy > 40 levels
                      Regulatory requirement for many countries are imbedded in the
                      same programs
                      Code and Test complexity is impacted

                                                                                                         6
For Internal Use

Resilience, Availability & Performance

 Resilience & Availability                                 Z196         5 Km   Z196
  2 Sites
                                                                     GDPS1 –
        1 Production Sysplex                                         G1
        1 Production Availability Sysplex                   M1nn                M1nn
                                                             M1nn                M1nn
        1 Development Sysplex
        Misc. Sysplex’s for insurance & sandbox
  Availability                                                       GDPS3 –
                                                                     G3
        When Production Sysplex is taking “out” we           M3nn                M3nn
        switch to Availability Sysplex                        M3nn                M3nn
        Availability Sysplex is kept current by Qrep
  Performance                                                        GDPS2 –
        DB relies on response times < 0,25 seconds to                G2
        be able to serve customers at peak – bottleneck:     MVnn                MVnn
                                                                                  MVnn
              Cics storage = 31 bit
              Cics setup (not cloud –yet ☺)

                                                                                         7
For Internal Use

About the Mainframe environment

     Key issues with Danske Bank setup

       Failure in components
               Failure in 1 key component can bring the online system down
               Failures are typically
                      Bad performance (CPU)
                      Bad response time

       The effect of bad performance                                               Workload
               Virtual Storage exhaustion in Cics                                  balancing
                       Transaction are to slow                                                        Overload
               CPU starvation                                                                          Control
                       The CPU cost of running the transactions increases with
                       concurrency (up to 50%)                                   Cics          Cics

       The Protection mechanisms
               Workload balancing in the network
               Overload Control
                     Inspection of transaction Queues                                   MQ
               Workload balancing in shared MQ
               WLM Health indicator

                                                                                 Cics          Cics

                                                                                                            8
For Internal Use

Evolution in Danske Bank online transactions
Transaction Types    Description                              Problem:
3270'                Green on black
Simple Web           Simple HTML                              •We can’t buy more virtual storage
                                                                   •More Cics regions will not fix the problem
Integrated Product   Rich transaction spanning several APPS
                                                                   •Usage of 64 bit is the best option
Self Service         One "screen" - one confirmation

                                                                                             Cics        Cics

                                                                                                    MQ

                                                                                             Cics        Cics

                                                                                                                 9
For Internal Use

  Evolution in online transactions

Something would have to be done:
   •Change to another Platform, Language, Container
        •Expensive
        •Risk profile (it might take too long time)
   •Optimize the existing applications in place
        •Lower cost
        •Lower risk

Mill. Transaction/day

                                                      Cics        Cics

                                                             MQ

                                                      Cics        Cics

                                                                         10
For Internal Use

  Evolution in online transactions

Something would have to be done:
   •Change to another Platform, Language, Container
        •Expensive
        •Risk profile (it might take to long time)
   •Optimize the existing applications in place
        •Lower cost
        •Lower risk

Mill. Transaction/day

                                                                    Cics        Cics

                                                                           MQ

                                                                    Cics        Cics
                                                  Cics is not the
                                                    ’end-point’

                                                                                       11
For Internal Use

Danske Bank – Changing the organisation

                         Management
                          Leadership             Management also wanted savings
                                                 ☺
                    •Incentives
                    •Reporting
                           •IT Cost   dahboard
                           •IT Cost   Targets

       Prevention          Control                     Healing

•Education          •Screening                    •Resolve
•Self Control       •Tooling                      •Competence                     Cics        Cics
•Tooling            •Management                   •Management

                                                                                         MQ
1500
                                       5 – 10 system
  developers
                                         programmers

                                       ”A few good                                Cics        Cics
                                         men”

                                                                                                     12
For Internal Use

                   Danske Bank – Changes to the organisation
                                                                      MIPS contract
                                         CIO                             with CIO
                                                                           $$

                          Director     Director     Director

                                       Developm.      Project          Solution
                                        manager       Manager            Architect

                                                                       Technical
Performance              Local DBA       System
                                                                         Architect
                                         Manager
  Center                                               Development
     Of                                                    Model
 Excellence              Local          Developer       Process &
                         Performance                     Guidelines

                                                                                      13
For Internal Use

     Danske Bank – Changes to the development process

                                   CIO

Organisational       Checkpoint
 Event

Solution           Performance &     Construction         Performance
 Performance        Resilience        Review               Analysis
 consultation       Evaluation

                   DBA Review       Performance Evaluation – disadvantages
                                          •3 mw – 3 mm for Performance   Center of Excellence
                                          •Bottleneck for the project

                                                                                                14
For Internal Use

     Danske Bank – Changes to the development process

   What if the local DBA & Performance
    Expert could do the Performance
                 Evaluation ?

    Development     Threshold
        Model        dependent
     Process &       application
      Guidelines    development

     Local DBA

     Local
     Performance

                          Tooling
                   (real time SMF data with a
                              ruler)

                                                        15
Threshold dependant application development
  The concept

•       By categorising the Cost-, Robustness-driving factors we are able determine:
    -       Relevance of optimizations requirements
    -       Operational cost

        Usage                        Categorisation
        intensity                        table                     Performance
                                                                   requirements

•       Performance requirements have been balanced according to
    -        Effectiveness
    -        Cost

                                                                           Effort   Effect
Threshold dependant application development
  The concept – Performance requirements (Cics)

                                                                                              Categorisa
                                                                                               tion table   Performance
                                                                                                            requirements

 Cics key        Very high      High volume      Very frequent   Frequent     Rare          Seldom used
 values          volume (TGV>   (inter           (regional>      (metro>      (Locomotive
                 230400/        city>36000/hou   7200/hour)      1800/hour)   >
                 hour)          r)                                            36/hour)

 Response time   0,1s           0,2s             0,5s            1s           5s            10s

 CPU time        0,05s          0,1s             0,2s            0,5s         2,5s          5s
 #Change mode    5              20               50              500          1000          5000

 Storage         2 Mb           2 Mb             2 Mb            4 Mb         6 Mb          10 Mb
 consumption
 (eudsa)
 #LINK           5              40               100             1.000        5.000         10.000
 #Dispatch       10             30               100             500          1.000         2.000
 #DB2 calls      50             100              300             1.000        5.000         10.000
For Internal Use

                   Danske Bank – Changes to the process

                                     CIO

Organisational         Checkpoint
 Event

Solution             Performance &     Construction     Performance
 Performance          Resilience        Review           Analysis
 consultation         Evaluation

                     DBA Review       What if a ”small” change in a highly re-used component
                                      had a very bad effect…..!!

                                                      Critical Elements
                                                                                               19
Critical elements
•   The mainframe environment in Danske Bank has a high degree of REUSE
     -   Programs and modules are used by many different workloads
           •   Example Customer Information module is called from
                 -   Online
                       •   Customer portal
                       •   Netbank
                       •   Business online
                 -   Batch
                       •   Interest
                       •   Securities
                       •   Behaviour score
     -   Changing Customer Information module can potentially affect all Danske Bank
         production
           • Availability
           • Cost structure

•   Introduction of Critical Elements gaveDanske Bank a pro-active warning
    and approval mechanism for vital programs and modules

•   The development areas are now asking to imbed their own selection of
    important modules to “Critical Elements”
Critical elements at Danske Bank

•   Statistics from Cics and DB2 are
    correlated and prioritzed with the
    following concerns in mind                  Cics   Experienc   DB2
      -   Usage across many development areas          e
      -   High usage

•   Experience based knowledge is
    added                                              Selection
                                                       engine
      -   Commonly used modules like
            •   CPUSYST
            •   USCACHE

•   The list is re-generated every month
                                                       Critical
                                                       elements

•   The list is in-corporated into our
    Change Control System

•   The list is in-corporated into              ADS                CCS
    Application Diagnostic System
Critical Elements
                                                         How does it work !

•   When the developer opens a Critical Element an information                          ADS
    message is given by RDz:
      -   This is a Critical Element – please be aware......
      -   This information is given by RDz
                                                                                       Local DBA
•   The day after the Critical Element is moved to SYST the
    System Manager from the development area is presented                       CCS
    with a report telling that a new version of a Critical Element              Syst
    has been moved to SYSTem Test                                                      Local
                                                                                       Performance
•   MAO is informed when the Critical Element is packaged for
    production implementation
      -   MAO will contact local DBA/Performance expert In the
          development area to verify NFR quality
             •   Performance
             •   Resilience
      -   Typical question could be
             •   Have you verified that the Critical Element is not abending
                 (FADUMP)
                                                                                CCS
             •   Have you verified performance before and after the Critical
                 Element has been modified                                      Prod
             •   What are the changes being implemented
      -   After verification the CCS change is approved by ITSM

                                                                               MAO
For Internal Use

Maintaining and upgrading developer competence
Networks:

•Developer
                                            DENMARK                    INDIA
•Technical Architects
                                           Performance              Performance
•Solution Architects
                                             Center                   Center
                                                Of                       Of
•Local DBA
                                            Excellence               Excellence
     •How to use DB2 correctionly
     •5 all day meeting every year    Local DBA   Local         Local DBA    Local
     •Webcasts                                    Performance                Performance

     •Tests
•Local Performance experts
     •How to measure performance       Development                    Development
     •Optimizing applications           community                      community
     •Optimizing design                 DENMARK                        INDIA
     •Using Tools
     •4 all day meetings every year
     •Bring your own problem
     •Test

                                                                                           23
For Internal Use

                   Danske Bank – Changes to the organisation
                                                                             MIPS contract
                                                CIO                             with CIO
                                                                                  $$
                                                                                             Incentive

                          Director            Director     Director

                                             Developm.       Project          Solution         Rules
                                              manager        Manager            Architect

                                                                              Technical
Performance              Local DBA              System
                                                                                Architect      Tools
                                                Manager
  Center                                                      Development
                                       Competence
     Of                                                           Model
                         Local                 Developer       Process &        Guidance
 Excellence              Performance                            Guidelines

Egg-slicer

                                                                                                         24
For Internal Use

        Healing – Governance and Optimisation - Runtime
                      LoopDetector            Automatic        Shit Detector           ASU time
                        inside Cics            Operator

      // TIME=
       limitations                    Automated                 OverLoad                Automated Test
                                       Threadsafe                 Control

                Mandatory Re-                                                                            No Batch in
                   compile                     Mandatory Re-              Commit                          peak hour
                                                    Bind                 Enforcement

                                                                                       Automatic APA
                                                                                           reports

                                                                                                                       25
For Internal Use

                           Optimisation - Runtime

Some of the GameChangers
1.    Performance Center of Competence
2.    Management backing
3.    Threadsafe
4.    DB2 10
5.    Re-Compile (Architecture level)
6.    Universal Caching System
7.    Consolidation
8.    APA
9.    HIS
10.   Contract

                                                    26
For Internal Use

                                         Optimisation – Runtime
                                                  UCS
                                                      MVS1
                                             UCS
              Application
              program        API Stub                  PC Functions

                            Name/Token
                                                                                   XCF synchronization
                                                   Queue              Cache
                                                   Dataspace          Dataspaces
                                                                                    Garbage Collector

                                                                                                         X
                                                                                                         C
                                                                                                         F

                                                      MVS2
                                             UCS
              Application
              program        API Stub                  PC Functions

                            Name/Token
                                                                                   XCF synchronization
                                                   Queue              Cache
                                                   Dataspace          Dataspaces
                                                                                    Garbage Collector

                                                                                                             27
For Internal Use

                            Optimisation – Runtime
                           DB2 10 and Consolidation

•      From 8 to 4 ZOS systems in //Sysplex
•      DB2 Group from 11 to 6
•      MQ Group from 14 to 4
•      Cics Group from 300+ to 150

                                                      28
For Internal Use

Danske Bank – Optimizing the runtime
To continue to keep low cost we have decided to:

        •Improve        overall system performance through better balancing
                   •Detailled studies have shown great potential

        •Improve        anormality detection
                   •Using SPSS – IT analytics is helping to achieve a faster reaction
                   •Using SPSS we can balace our systems to be more effecient

        •SPSS        and IT analytics is the enabler for Danske Bank

                                                                                        29
For Internal Use

Danske Bank – Optimizing the runtime
To improve overall system performance we have used HIS –
•       In a Un-controlled environment we are able to achieve:

 Report for processor: 00                  Interval:       54.093953 microsecs.
    Cycles in interval:            235.755.942.040     88.817.162.070       37,67%
    Instructions executed:          33.690.832.484     13.145.421.063       39,01%
    Cycles per instruction:                   6,99               6,75
      Mips rate:                               745,06               771,55
      RNI report                    Total      Instruction         Data
        L1 miss per 100 instr.         5,48           3,28            2,19
        L1 miss p/100 problem          4,12           2,37            1,75
        - % From L2 (in CPU)          57,78%         69,04%         40,91%
        - % From L3 (on CHIP)         27,42%         23,80%         32,86%
        - % From L3 (on BOOK)          2,06%          0,00%           5,16%
        - % From L3 (off BOOK)         0,50%          0,00%           1,25%
        - % From L4 (on BOOK)          5,63%          3,89%           8,24%
        - % From L4 (off BOOK)         0,64%          0,54%           0,79%
        - % From Memory(local)         1,38%          0,60%           2,54%
        - % From Memory(remote)        4,55%          2,09%           8,22%
      DAT report                    Total      Instruction         Data
        TLB1 miss per 100 instr.       0,96           0,37            0,59
        - TLB1 cycles pr. miss       118,54         162,99          91,09
        - TLB1 CPU miss %             16,42%          8,62%           7,80%
        TLB2 miss per 100 instr.       0,49
        - PTE % miss of all TLB       51,46%
        - STE % miss of all TLB       13,06%
        - STE/LPS % of all TLB         0,48%

                                                                                     30
For Internal Use

Danske Bank – Optimizing the runtime
To improve overall system performance we have used HIS –
•       In a controlled environment we are able to achieve a higher MIPS
rate
 Report for processor: 01                  Interval:       99.230032 microsecs.
    Cycles in interval:            515.053.492.600    239.845.022.743       46,56%
    Instructions executed:         160.000.349.748    111.401.729.213       69,62%
    Cycles per instruction:                   3,21               2,15
      Mips rate:                             1.622,42            2.422,32
      RNI report                    Total      Instruction         Data
        L1 miss per 100 instr.         3,02           1,50            1,51
        L1 miss p/100 problem          1,06           0,24            0,82
        - % From L2 (in CPU)          66,09%         78,48%         53,76%
        - % From L3 (on CHIP)         25,48%         18,79%         32,13%
        - % From L3 (on BOOK)          1,96%          0,00%           3,91%
        - % From L3 (off BOOK)         0,80%          0,00%           1,61%
        - % From L4 (on BOOK)          4,01%          1,89%           6,12%
        - % From L4 (off BOOK)         0,33%          0,09%           0,58%
        - % From Memory(local)         0,27%          0,09%           0,45%
        - % From Memory(remote)        1,02%          0,63%           1,41%
      DAT report                    Total      Instruction         Data
        TLB1 miss per 100 instr.       0,40           0,14            0,26
        - TLB1 cycles pr. miss        68,88          86,74          59,07
        - TLB1 CPU miss %              8,70%          3,88%           4,82%
        TLB2 miss per 100 instr.       0,14
        - PTE % miss of all TLB       36,37%
        - STE % miss of all TLB        7,17%
        - STE/LPS % of all TLB         0,30%

                                                                                     31
For Internal Use

Danske Bank – Using Business Analytics for System Mgmt.
Pre-conditions:

• ”Old school” Rules     & Regulation will not make it alone
•It is an investment

        •Effort inside DB          3-4 my. in 2013
        •Consultance IBM           3-4 mm. In 2013
        •Software Cost
        •Servers
        •MIPS

•It   will be a long term effort

                                                               32
For Internal Use

Danske Bank – Using Business Analytics for System Mgmt.
Why are Danske Bank doing this

•   We need to detect abnormalities in production much faster than today in:
       •Cics
             •To stop bad transaction behaviour
             •To prevent bad code from coming into production
             •To start or stop Cics regions
       •DB2
             •To alert DBA when access paths is bad
       •Batch
             •To ensure batchjobs on the critical path will be finished in time
             •To prevent bad program/SQL changes from coming into production
       •STC
             •Detact abnormalities in consumption patterns for STCs:
                  •IP stack
                  •MQ
                  •AOC
       •Detect abnormalities in the APPLLOG

                                                                                  33
For Internal Use

Danske Bank – Using Business Analytics for System Mgmt.
The Use Cases in 2013:

•Realtime SMF record scoring:
     •Cics transaction (SMF110)
     •DB2 Plan & Packages (SMF101)
     •SMF30 (STC & Batch)
           •Simulated SMF30
•Application Log
     •We have a consolidated application        log across platforms, systems & applications

•Preventing        NFR violation from coming into production

                                                                                               34
For Internal Use

 Danske Bank – Using Business Analytics for System Mgmt.
                            SPSS modeler client
 The modeling platform:     on Windows

                      IBM SPSS Collaboration
                      Services (Win2dows)

                                                                  Model
Production                                                        calibration          New SMF
                                                                                         data

                                                  IBM SPSS Modeler Server
   TEST                                           (Windows)

                                Data Server
                                                     Filtering

                                                                        Old SMF data
                                                                                                 35
For Internal Use

Danske Bank – Using Business Analytics for System Mgmt.
Runtime Environment

                                How to ”score a
                                model”

                                       DBLaunch
                   DB2
                            Score
                                        SMFSURF     Live SMF

                             SQ
                             L          RDSMON
                         Pack/Unpack

                                         RDS
                                          control
                                          block

                                                               36
For Internal Use

Danske Bank – Using Business Analytics for System Mgmt.
Runtime Environment General Considerations:

•Scoring    every SMF record is pointless
        •Scoring frequency will be based on a prepared usage pattern

                                                 Scoring Policy
                   SMFSURF            Name   Freq       Latest SCORING

                                      ABC    1          GMT timestamp

                                      123    8

                                      XYZ    1500

                                                                         37
For Internal Use

Danske Bank – Using Business Analytics for System Mgmt.

•Why      is it so difficult to get hold of LIVE SMF data ?

•Why      can’t SMFSURF be ”subscribe” to SMF records of a particular type?

•Can     SMF be warehoused in memory ?
        •Maybe in flash ?

                                                                  DBLaunch
                                              DB2
                                                        Score                 SMF data
                                                                   SMFSURF

                                                        SQ
                                                        L
                                                    Pack/Unpack

                                                                                    38
For Internal Use

Danske Bank – Using Business Analytics for System Mgmt.
An example:

               Normal peak day    April 2.

                                                          39
For Internal Use

Summary:
•You can remarkable result in
performance and cost – but you
have to believe ☺

                                 40
You can also read