Linux Systems Capacity Planning - Rodrigo Campos - @xinu USENIX LISA '11 - Boston, MA

Page created by Kirk Lucas

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Linux Systems Capacity Planning - Rodrigo Campos - @xinu USENIX LISA '11 - Boston, MA

Linux Systems Capacity
Planning
Rodrigo Campos
camposr@gmail.com - @xinu
USENIX LISA ’11 - Boston, MA

Agenda

Where, what, why?
Performance monitoring
Capacity Planning
Putting it all together

Where, what, why ?

       75 million internet users
       1,419.6% growth (2000-2011)
       29% increase in unique IPv4 addresses (2010-2011)
       37% population penetration

Sources:
Internet World Stats - http://www.internetworldstats.com/stats15.htm
Akamai’s State of the Internet 2nd Quarter 2011 report - http://www.akamai.com/stateoftheinternet/

Where, what, why ?
High taxes
Shrinking budgets
High Infrastructure costs
Complicated (immature?) procurement processes
Lack of economically feasible hardware options
Lack of technically qualified professionals

Where, what, why ?
Do more with the same infrastructure
Move away from tactical fire fighting
While at it, handle:
  Unpredicted traffic spikes
  High demand events
  Organic growth

Performance Monitoring

Typical system performance metrics
  CPU usage
  IO rates
  Memory usage
  Network traffic

Performance Monitoring

Commonly used tools:
  Sysstat package - iostat, mpstat et al
  Bundled command line utilities - ps, top, uptime
  Time series charts (orcallator’s offspring)
    Many are based on RRD (cacti, torrus, ganglia,
    collectd)

Performance Monitoring

Time series performance data is useful for:
  Troubleshooting
  Simplistic forecasting
  Find trends and seasonal behavior

Performance Monitoring

Performance Monitoring

"Correlation does not imply causation"
Time series methods won’t help you much for:
  Create what-if scenarios
  Fully understand application behavior
  Identify non obvious bottlenecks

Monitoring vs. Modeling
    “The difference between performance
    modeling and performance monitoring is
    like the difference between weather
    prediction and simply watching a weather-
    vane twist in the wind”

Source: http://www,perfdynamics,com/Manifesto/gcaprules,html

Capacity Planning

Not exactly something new...
Can we apply the very same techniques to modern,
distributed systems ?
Should we ?

What’s in a queue ?

Agner Krarup Erlang
Invented the fields of traffic engineering and
queuing theory
1909 - Published “The theory of Probabilities
and Telephone Conversations”

What’s in a queue ?

Allan Scherr (1967) used the
machine repairman problem to
represent a timesharing system
with n terminals

What’s in a queue ?

Dr. Leonard Kleinrock
“Queueing Systems” (1975) - ISBN 0471491101
Created the basic principles of packet switching while
at MIT

What’s in a queue ?

                              (A)   λ               X   (C)
                                                S
     Open/Closed                        W
      Network                               R
A          Arrival Count
λ        Arrival Rate (A/T)
W      Time spent in Queue
R      Residence Time (W+S)
S          Service Time
X     System Throughput (C/T)
C     Completed tasks count

Service Time

Time spent in processing (S)
  Web server response time
  Total Query time
  Time spent in IO operation

System Throughput

Arrival rate (λ) and system throughput (X) are the same
in a steady queue system (i.e. stable queue size)
  Hits per second
  Queries per second
  IOPS

Utilization
 Utilization (ρ) is the amount of time that a queuing node
 (e.g. a server) is busy (B) during the measurement
 period (T)
 Pretty simple, but helps us to get processor share of an
 application using getrusage() output
 Important when you have multicore systems

                    ρ = B/T

Utilization

 CPU bound HPC application running in a two core
 virtualized system
 Every 10 seconds it prints resource utilization data to a
 log file

Utilization
(void)getrusage(RUSAGE_SELF, &ru);
(void)printRusage(&ru);
...
static void printRusage(struct rusage *ru)
{
    fprintf(stderr, "user time = %lf\n",
       (double)ru->ru_utime.tv_sec + (double)ru->ru_utime.tv_usec / 1000000);
    fprintf(stderr, "system time = %lf\n",
       (double)ru->ru_stime.tv_sec + (double)ru->ru_stime.tv_usec / 1000000);
} // end of printRusage

10 seconds wallclock time
377,632 jobs done
user time = 7.028439
system time = 0.008000

Utilization
                 We have 2 cores so we
                  can run 3 application
                instances in each server
                   (200/70.36) = 2.84
 ρ = B/T
 ρ = (7.028+0.008) / 10
 ρ = 70.36%

Little’s Law
 Named after MIT professor John Dutton Conant Little
 The long-term average number of customers in a
 stable system L is equal to the long-term average
 effective arrival rate, λ, multiplied by the average time a
 customer spends in the system, W; or expressed
 algebraically: L = λW
 You can use this to calculate the minimum
 amount of spare workers in any application

Little’s Law

 L = λW
                          tcpdump -vttttt
 λ = 120 hits/s
 W = Round-trip delay + service time
 W = 0.01594 + 0.07834 = 0.09428
 L = 120 * 0.09428 = 11,31

Utilization and Little’s Law

 By substitution, we can get the utilization by multiplying
 the arrival rate and the mean service time

                     ρ = λS

Putting it all together

 Applications write in a log file the service time and
 throughput for most operations
 For Apache:
   %D in mod_log_config (microseconds)
   “ExtendedStatus On” whenever it’s possible
 For nginx:
   $request_time in HttpLogModule (milliseconds)

Putting it all together

Putting it all together

 Generated with HPA: https://github.com/camposr/HTTP-Performance-Analyzer

Putting it all together

  A simple tag collection data store
  For each data operation:
    A 64 bit counter for the number of calls
    An average counter for the service time

Putting it all together
    Method          Call Count   Service Time (ms)

    dbConnect         1,876            11.2
   fetchDatum       19,987,182         12.4
    postDatum        1,285,765         98.4
   deleteDatum       312,873           31.1
    fetchKeys       27,334,983        278.3
  fetchCollection   34,873,194        211.9
 createCollection    118,853          219.4

Putting it all together
                                     Call Count x Service Time

                                                         fetchKeys
                        createCollection
    Service Time (ms)

                                                                 fetchCollection

                         deleteDatum

                        postDatum
                        dbConnect                 fetchDatum
                                            Call Count

Modeling

An abstraction of a complex system
Allows us to observe phenomena that can not be easily
replicated
“Models come from God, data comes from the devil” -
Neil Gunther, PhD.

Modeling
                     Clients

     Requests                           Replies

      Web Server   Application   Database

Modeling
                                  Clients

          Requests                                   Replies

  Cache              Web Server        Application   Database

Modeling

We’re using PDQ in order to model queue circuits
Freely available at:
  http://www.perfdynamics.com/Tools/PDQ.html
Pretty Damn Quick (PDQ) analytically solves queueing
network models of computer and manufacturing
systems, data networks, etc., written in conventional
programming languages.

Modeling

  CreateNode()        Define a queuing center

                    Define a traffic stream of an
  CreateOpen()
                            open circuit
                    Define a traffic stream of a
  CreateClosed()
                          closed circuit
                   Define the service demand for
   SetDemand()
                    each of the queuing centers

Modeling
$httpServiceTime = 0.00019;
$appServiceTime = 0.0012;
$dbServiceTime = 0.00099;
$arrivalRate = 18.762;

pdq::Init("Tag Service");

$pdq::nodes = pdq::CreateNode('HTTP Server',
$pdq::CEN, $pdq::FCFS);
$pdq::nodes = pdq::CreateNode('Application Server',
$pdq::CEN, $pdq::FCFS);
$pdq::nodes = pdq::CreateNode('Database Server',
$pdq::CEN, $pdq::FCFS);

Modeling
                    =======================================
                    ******   PDQ Model OUTPUTS      *******
                    =======================================

 Solution Method: CANON

                    ******   SYSTEM Performance     *******

 Metric                        Value    Unit
 ------                        -----    ----
 Workload: "Application"
 Number in system             1.3379    Requests
 Mean throughput             18.7620    Requests/Seconds
 Response time                0.0713    Seconds
 Stretch factor               1.5970

 Bounds Analysis:
 Max throughput              44.4160    Requests/Seconds
 Min response                 0.0447    Seconds

Systemwide*Requests*/*second*

                                             0"
                                                  10"
                                                          20"
                                                                     30"
                                                                                40"
                                                                                        50"
                                                                                              60"
                                  0.0
                                     009
                                        8"
                                  0.0
                                     010
                                        3"
                                  0.0
                                     010
                                        8"
                                  0.0
                                     011
                                        3"
                                  0.0
                                     011
                                        8"
                                  0.0
                                     012
                                        3"
                                  0.0
                                     012
                                        8"
                                  0.0
                                     013
                                        3"
                                  0.0
                                     013
                                                                                                                                                        Modeling

                                        8"
                                  0.0
                                     014
                                        3"
                                  0.0
                                     014
                                        8"
                                  0.0
                                     015
                                        3"
                                  0.0
                                     015
                                        8"
                                  0.0
                                     016
                                        3"
                                  0.0
                                     016
                                        8"
                                  0.0
                                     017
                                        3"
                                  0.0
                                     017
                                        8"
                                  0.0
                                     018
                                        3"
                                  0.0
                                     018
                                        8"
                                  0.0
                                     019
                                        3"
                                  0.0

Database*Service*7me*(seconds)*
                                     019
                                        8"
                                  0.0
                                     020
                                        3"
                                  0.0
                                     020
                                        8"
                                  0.0
                                     021
                                        3"
                                  0.0
                                     021
                                        8"
                                  0.0
                                     022
                                        3"
                                                                                                    System*Throughput*based*on*Database*Service*Time*

                                  0.0
                                     022
                                        8"
                                  0.0
                                     023
                                        3"
                                  0.0
                                     023
                                        8"
                                  0.0
                                     024
                                        3"
                                  0.0
                                     024
                                        8"
                                  0.0
                                     025
                                        3"

Modeling

Complete makeover of a web collaborative portal
Moving from a commercial-of-the-shelf platform to a
fully customized in-house solution
How high it will fly?

Modeling

Customer Behavior Model Graph (CBMG)
 Analyze user behavior using session logs
 Understand user activity and optimize hotspots
 Optimize application cache algorithms

Modeling
                                          0.08

                   Initial                                Create
                   Page            0.73                  New Topic
     User
     Login
                                          Active
                                          Topics
                               0.3
                                                                     0.8

      Control
                                                     0.6
      Panel
                  Private
                 Messages          0.2
                                                   Unanswer
                                                   ed Topics
        Answer
         Topic

                             0.1

        User
       Logout                                    Read
                                                 Topic

Modeling

Now we can mimic the user behavior in the newly
developed system
The application was instrumented so we know the
service time for every method
Each node in the CBMG is mapped to the application
methods it is related

References
Using a Queuing Model to Analyze the Performance of
Web Servers - Khaled M. ELLEITHY and Anantha
KOMARALINGAM
A capacity planning / queueing theory primer - Ethan
D. Bolker
Analyzing Computer System Performance with
Perl::PDQ - N. J. Gunther
Computer Measurement Group Public Proceedings

Questions answered here

         Thanks for attending !
Rodrigo Campos
camposr@gmail.com
http://twitter.com/xinu
http://capacitricks.posterous.com

You can also read

WRHHD405A Select and apply hair extensions

ANNUAL HOSPITALITY MEMBERSHIP 2018/19 SEASON - Arabian Travel Market

Quantum glasses - Frustration and collective behavior at T = 0 - Markus Müller (ICTP Trieste)

Power through your high school courseload with a responsive Chromebook - Principled Technologies

Term 3 Recreation Adults 18+ 2021 - Geelong - Register your expression of

Report on UBC Connect - EECE 418

Exposure Notifications - Frequently Asked Questions Preliminary - Subject to Modification and Extension - Google

Puskás Arena pitched for victory with d&b.

15.6 Planets Beyond the Solar System

STT Status Peter Wintz (IKP, FZJ) for the STT Group - TRK Session @ PANDA CM 17/3, Novosibirsk, Sep 2017 - GSI Indico

The Six Ideas Toronto Noise Mitigation Initiatives - Presented by NAV CANADA & GTAA - Toronto Pearson

RFI experience for the application of satellite technologies to ERTMS signalling system - Massimiliano Ciaffi Techinical Department Standard CCS

NEW PRODUCTS 2020 www.rose-systemtechnik.com - Phoenix Mecano Inc.

PUBLIC EMERGENCY NOTIFICATION AND INSTRUCTION - Alaska ...

Information required for the smooth functioning of the BCI Quebec Student Exchange Program - 2020-2021 December 18, 2019 - DRE

Information required for the smooth functioning of the BCI student exchange program - 2019-2020 December 14, 2018

Deploying DataStax Enterprise with Mesosphere Datacenter Operating System

COVID-19 Vaccine, mRNA (Moderna) - ASHP

2007 Saturn VUE Green Line Hybrid - Emergency Response Guide

ALERT TAX & EXCHANGE CONTROL - Cliffe ...

HazardWatch FM Approved Fire & Gas Detection System - Exact Oil & Gas Sdn Bhd

BLOCKCHAIN BASED SMART TRAFFIC SYSTEM TO ENHANCE TRAFFIC NEUTRALITY - IRJET

Sentinel RMS SDK v9.7.0 - RELEASE NOTES FOR LINUX (32-BIT AND 64-BIT)

Catalogue 2020 - SANGUIS COUNTING

Morpheus and ICE Deployment on Amazon Web Services (AWS) - Grass Valley

ATLAS LAr Calorimeter Smart RTM - John Hobbs, Dean Schamberger Gene Shafto, Vaneet Singh Stony Brook University Charlie Armijo, Kade Gigliotti ...

Engineering & Management tools - for digital substa ons

What's next Stevie Awards & Nuance earns Technology Partner of the Year

Inspections and Preventive Maintenance - Konecranes

Withholding Tax (WHT) Payments under the Inland Revenue Act No.24 of 2017 - Deloitte

Canadian Sport Institute Pacific and Cycling BC Athlete and Coach Nomination Criteria - Criteria Approved January 20, 2021

Hospitality (Customer Service & Leadership) + Communication + Grooming Skills (HCGS)

SERVICE & TROUBLESHOOTING MANUAL FOR - DISPLAY HOLD CABINETS DHC SERIES

FOSDEM 2021 Libre Endowment Fu

HOLIDAY 2020 CATALOG - Homestead Cards

Your WestJet RBC World Elite Mastercard - RBC Royal Bank

2021 WILA SALES KITŽ Company updates, sales materials, and service resources for WILA partners - Wila USA

Education Welfare Academy Partnership Guide 2020-2021 - February 2021

YOUR FAMILY-OWNED PARCEL SERVICE - Planzer Paket

Reliance Power Ltd Investor Presentation - Confidential

Automated turning at your finger tips - pioneering simplicity - Repose

Appendix 'A' GTR 2018 Timetable Consultation - Consultation Question Responses on behalf of East Herts Council

Welcome!!! First Steps 2020 at Anhalt University - Hochschule ...

2019 Residential and Commercial Solar Photovoltaic Program Kick-off

Guidance on complaints involving insurance issues - Published on website, January 2020 - Housing ...

Content Template Guidelines - Textlocal

PROCEEDINGS OF SPIE Design and space qualification of laser for laser altimeter

Incentive on Connections Engagement (ICE) - Quarter 1 Update - April-June 2021 - UK Power Networks

Intro to Splunk IT Service Intelligence - The Latest and greatest from the new release - Splunk Conf

Winter Services Training 2019 - theihe.org - Institute of Highway Engineers