Brendan Gregg, profile picture

Uploaded byBrendan Gregg

PDF, PPTX39,532 views

Performance Analysis: The USE Method

This document discusses performance analysis methodologies, primarily focusing on the 'use method' for diagnosing system health in cloud computing environments. It illustrates real-world examples of performance issues, specifically regarding network and memory problems, and emphasizes the need for quick assessments to avoid prolonged troubleshooting. Furthermore, the document outlines various tools and techniques for workload characterization and drill-down analysis, advocating for a question-oriented approach to gather insights from system performance metrics.

In this document

Powered by AI

Introduction to performance analysis presented by Brendan Gregg at FISL13, July 2012.

Brendan Gregg discusses his expertise in performance analysis and software development for performance tools.

Presentation of Joyent as a cloud computing provider offering SmartOS and virtualization solutions.

Outline of topics to be covered including problem examples, methodologies, and tools.

Description of a recent cloud performance issue related to database response time and suspected network drops.

Illustration of support levels in performance analysis and traditional vs new methods of approaching network drop issues.

Advantages of dynamic tracing over traditional network packet capture methods for real-time analysis.

Introduction to the USE method for quick system health checks, identifying memory and disk utilization issues.

Comparison of the USE method to other methodologies and its efficiency in problem resolution.

Clarification that performance methodology is a procedure and not a tool or product.

Overview of the evolution of performance analysis methods from the 90s to the present with open source dynamics.

Explanation of how the methodology serves both beginners and experts in performance analysis.

Suggested order for executing performance methodologies to address analysis effectively.

Questions to consider when determining performance issues during support procedures.

Explanation of the USE method focusing on resource utilization, saturation, and errors for system health.

Types of hardware resources involved in the USE method and importance of analyzing server functional diagrams.

Differentiation between I/O and capacity resources in performance analysis.

Types of software resources to evaluate during performance assessments.

Flow diagram for assessing resource health through the USE method.

Guidance on how to interpret utilization, saturation, and errors for effective analysis.

Details on easy and harder combinations of metrics for assessing resource performance.

Essential tools like CPU performance counters and dynamic tracing needed for thorough analysis.

Characterizing workloads to understand and improve performance by identifying load sources.

Method of deep analysis in software layers for understanding latency in performance issues.

Introduction to additional latency-based methodologies for specific database performance analysis.

Introduction to specific tools designed to implement the USE method effectively.

Detailed performance metrics for monitoring CPU utilization and saturation in illumos and Linux systems.

Introduction to Joyent's Cloud Analytics as a supportive tool for performance methodologies.

A look ahead at innovative methodologies being developed for complex performance issues.

Thank you slide with links to further resources and contact information for Brendan Gregg.

Download as PDF, PPTX

Performance Analysis: The USE Method Brendan Gregg Lead Performance Engineer, Joyent brendan.gregg@joyent.com FISL13 July, 2012

whoami • I work at the top of the performance support chain • I also write open source performance tools out of necessity to solve issues • http://github.com/brendangregg • http://www.brendangregg.com/#software • And books (DTrace, Solaris Performance and Tools) • Was Brendan @ Sun Microsystems, Oracle, now Joyent

Joyent • Cloud computing provider • Cloud computing software • SmartOS • host OS, and guest via OS virtualization • Linux, Windows • guest via KVM

Agenda • Example Problem • Performance Methodology • Problem Statement • The USE Method • Workload Characterization • Drill-Down Analysis • Specific Tools

Example Problem • Recent cloud-based performance issue • Customer problem statement: • "Database response time sometimes take multiple seconds. Is the network dropping packets?" • Tested network using traceroute, which showed some packet drops

Example: Support Path • Performance Analysis Top 2nd Level 1st Level Customer Issues

Example: Support Path • Performance Analysis Top my turn 2nd Level "network looks ok, CPU also ok" 1st Level "ran traceroute, can’t reproduce" Customer: "network drops?"

Example: Network Drops • Old fashioned: network packet capture (sniffing) • Performance overhead during capture (CPU, storage) and post-processing (wireshark) • Time consuming to analyze: not real-time

Example: Network Drops • New: dynamic tracing • Efficient: only drop/retransmit paths traced • Context: kernel state readable • Real-time: analysis and summaries # ./tcplistendrop.d TIME 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 [...] SRC-IP 10.17.210.103 10.17.210.108 10.17.210.116 10.17.210.117 10.17.210.112 10.17.210.106 10.12.143.16 PORT 25691 18423 38883 10739 27988 28824 65070 -> -> -> -> -> -> -> DST-IP 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 PORT 80 80 80 80 80 80 80

Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health

Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health • In < 5 minutes, I found: • CPU: ok (light usage) • network: ok (light usage) • memory: available memory was exhausted, and the system was paging • disk: periodic bursts of 100% utilization • The method is simple, fast, directs further analysis

Example: Other Methodologies • Customer was surprised (are you sure?) I used latency analysis to confirm. Details (if interesting): • memory: using both microstate accounting and dynamic tracing to confirm that anonymous pagins were hurting the database; worst case app thread spent 97% of time waiting on disk (data faults). • disk: using dynamic tracing to confirm latency at the application / file system interface; included up to 1000ms fsync() calls. • Different methodology, smaller audience (expertise), more time (1 hour).

Example: Summary • What happened: • customer, 1st and 2nd level support spent much time chasing network packet drops. • What could have happened: • customer or 1st level follows the USE method and quickly discover memory and disk issues • memory: fixable by customer reconfig • disk: could go back to 1st or 2nd level support for confirmation • Faster resolution, frees time

Performance Methodology • Not a tool • Not a product • Is a procedure (documentation)

Performance Methodology • Not a tool -> but tools can be written to help • Not a product -> could be in monitoring solutions • Is a procedure (documentation)

Why Now: past • Performance analysis circa ‘90s, metric-orientated: • Vendor creates metrics and performance tools • Users develop methods to interpret metrics • Common method: "Tools Method" • List available performance tools • For each tool, list useful metrics • For each metric, determine interpretation • Problematic: vendors often don’t provide the best metrics; can be blind to issue types

Why Now: changes • Open Source • Dynamic Tracing • See anything, not just what the vendor gave you • Only practical on open source software • Hardest part is knowing what questions to ask

Why Now: present • Performance analysis now (post dynamic tracing), question-orientated: • Users pose questions • Check if vendor has provided metrics • Develop custom metrics using dynamic tracing • Methodologies pose the questions • What would previously be an academic exercise is now practical

Methology Audience • Beginners: provides a starting point • Experts: provides a checklist/reminder

Performance Methodolgies • Suggested order of execution: 1.Problem Statement 2.The USE Method 3.Workload Characterization 4.Drill-Down Analysis (Latency)

Problem Statement • Typical support procedure (1st Methodology): 1.What makes you think there is a problem? 2.Has this system ever performed well? 3.What changed? Software? Hardware? Load? 4.Can the performance degradation be expressed in terms of latency or run time? 5.Does the problem affect other people or applications? 6.What is the environment? What software and hardware is used? Versions? Configuration?

The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization • Saturation • Errors

The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization: time resource was busy, or degree used • Saturation: degree of queued extra work • Errors: any errors Saturation X Errors Utilization

The USE Method: Hardware Resources • CPUs • Main Memory • Network Interfaces • Storage Devices • Controllers • Interconnects

The USE Method: Hardware Resources • A great way to determine resources is to find (or draw) the server functional diagram • The hardware team at vendors should have these • Analyze every component in the data path

The USE Method: Functional Diagrams, Generic Example Memory Bus DRAM CPU Interconnect CPU 1 DRAM CPU 2 I/O Bus I/O Bridge I/O Controller Expander Interconnect Network Controller Interface Transports Disk Disk Port Port

The USE Method: Resource Types • There are two different resource types, each define utilization differently: • I/O Resource: eg, network interface • utilization: time resource was busy. current IOPS / max or current throughput / max can be used in some cases • Capacity Resource: eg, main memory • utilization: space consumed • Storage devices act as both resource types

The USE Method: Software Resources • Mutex Locks • Thread Pools • Process/Thread Capacity • File Descriptor Capacity

The USE Method: Flow Diagram Choose Resource Errors Present? Y N High Utilization? Y N N Saturation? Y Problem Identified

The USE Method: Interpretation • Utilization • 100% usually a bottleneck • 70%+ often a bottleneck for I/O resources, especially when high priority work cannot easily interrupt lower priority work (eg, disks) • Beware of time intervals. 60% utilized over 5 minutes may mean 100% utilized for 3 minutes then idle • Best examined per-device (unbalanced workloads)

The USE Method: Interpretation • Saturation • Any non-zero value adds latency • Errors • Should be obvious

The USE Method: Easy Combinations Resource Type CPU utilization CPU saturation Memory utilization Memory saturation Network Interface utilization Storage Device I/O utilization Storage Device I/O saturation Storage Device I/O errors Metric

The USE Method: Easy Combinations Resource Type Metric CPU utilization CPU utilization CPU saturation run-queue length Memory utilization Memory saturation paging or swapping Network Interface utilization Storage Device I/O utilization available memory RX/TX tput/bandwidth device busy percent Storage Device I/O saturation wait queue length Storage Device I/O errors device errors

The USE Method: Harder Combinations Resource Type CPU errors Network saturation Storage Controller utilization CPU Interconnect utilization Mem. Interconnect saturation I/O Interconnect saturation Metric

The USE Method: Harder Combinations Resource Type Metric CPU errors eg, correctable CPU cache ECC events Network saturation "nocanputs", buffering Storage Controller utilization CPU Interconnect utilization active vs max controller IOPS and tput per port tput / max bandwidth Mem. Interconnect saturation memory stall cycles I/O Interconnect bus throughput / max saturation bandwidth

The USE Method: tools • To be thorough, you will need to use: • CPU performance counters • For bus and interconnect activity; eg, perf events, cpustat • Dynamic Tracing • For missing saturation and error metrics; eg, DTrace • Both can get tricky; tools can be developed to help • Please, no more top variants! ... unless it is interconnect-top or bus-top • I’ve written dozens of open source tools for both CPC and DTrace; much more can be done

Workload Characterization • May use as a 3rd Methodology • Characterize workload by: • who is causing the load? PID, UID, IP addr, ... • why is the load called? code path • what is the load? IOPS, tput, type • how is the load changing over time? • Best performance wins are from eliminating unnecessary work • Identifies class of issues that are load-based, not architecture-based

Drill-Down Analysis • May use as a 4th Methodology • Peel away software layers to drill down on the issue • Eg, software stack I/O latency analysis: Application System Call Interface File System Block Device Interface Storage Device Drivers Storage Devices

Drill-Down Analysis: Open Source • With Dynamic Tracing, all function entry & return points can be traced, with nanosecond timestamps. • One Strategy is to measure latency pairs, to search for the source; eg, A->B & C->D: static int arc_cksum_equal(arc_buf_t *buf) A{ zio_cksum_t zc; int equal; C mutex_enter(&buf->b_hdr->b_freeze_lock); fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc); D equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc); mutex_exit(&buf->b_hdr->b_freeze_lock); B} return (equal);

Other Methodologies • Method R • A latency-based analysis approach for Oracle databases. See "Optimizing Oracle Performance" by Cary Millsap and Jeff Holt (2003) • Experimental approaches • Can be very useful: eg, validating network throughput using iperf

Specific Tools for the USE Method

illumos-based • http://dtrace.org/blogs/brendan/2012/03/01/the-usemethod-solaris-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat 1, "idl"; system-wide: vmstat 1, "id"; per-process:prstat -c 1 ("CPU" == recent), prstat mLc 1 ("USR" + "SYS"); per-kernel-thread: lockstat -Ii rate, DTrace profile stack() Saturation system-wide: uptime, load averages; vmstat 1, "r"; DTrace dispqlen.d (DTT) for a better "vmstat r"; per-process: prstat -mLc 1, "LAT" Errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling) Saturation system-wide: vmstat 1, "sr" (bad now), "w" (was very bad); vmstat -p 1, "api" (anon page ins == pain), "apo"; per-process: prstat -mLc 1, "DFL"; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname CPU CPU Memory • ... etc for all combinations (would span a dozen slides)

Linux-based • http://dtrace.org/blogs/brendan/2012/03/07/the-usemethod-linux-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat -P ALL 1, "%idle"; sar -P ALL, "%idle"; system-wide: vmstat 1, "id"; sar -u, "%idle"; dstat -c, "idl"; per-process:top, "%CPU"; htop, "CPU%"; ps -o pcpu; pidstat 1, "%CPU"; per-kernel-thread: top/htop ("K" to toggle), where VIRT == 0 (heuristic). [1] Saturation system-wide: vmstat 1, "r" > CPU count [2]; sar -q, "runq-sz" > CPU count; dstat -p, "run" > CPU count; perprocess: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3] Errors perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s "04Ah Single-bit ECC Errors Recorded by Scrubber" [4] CPU CPU • ... etc for all combinations (would span a dozen slides)

Products • Earlier I said methodologies could be supported by monitoring solutions • At Joyent we develop Cloud Analytics:

Future • Methodologies for advanced performance issues • I recently worked a complex KVM bandwidth issue where no current methodologies really worked • Innovative methods based on open source + dynamic tracing • Less performance mystery. Less guesswork. • Better use of resources (price/performance) • Easier for beginners to get started

Thank you • Resources: • http://dtrace.org/blogs/brendan • http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ • http://dtrace.org/blogs/brendan/tag/usemethod/ • http://dtrace.org/blogs/brendan/2011/12/18/visualizing-deviceutilization/ - ideas if you are a monitoring solution developer • brendan@joyent.com

Performance Analysis: The USE Method Brendan Gregg Lead Performance Engineer, Joyent brendan.gregg@joyent.com FISL13 July, 2012

whoami • I work at the top of the performance support chain • I also write open source performance tools out of necessity to solve issues • http://github.com/brendangregg • http://www.brendangregg.com/#software • And books (DTrace, Solaris Performance and Tools) • Was Brendan @ Sun Microsystems, Oracle, now Joyent

Joyent • Cloud computing provider • Cloud computing software • SmartOS • host OS, and guest via OS virtualization • Linux, Windows • guest via KVM

Agenda • Example Problem • Performance Methodology • Problem Statement • The USE Method • Workload Characterization • Drill-Down Analysis • Specific Tools

Example Problem • Recent cloud-based performance issue • Customer problem statement: • "Database response time sometimes take multiple seconds. Is the network dropping packets?" • Tested network using traceroute, which showed some packet drops

Example: Support Path • Performance Analysis Top 2nd Level 1st Level Customer Issues

Example: Support Path • Performance Analysis Top my turn 2nd Level "network looks ok, CPU also ok" 1st Level "ran traceroute, can’t reproduce" Customer: "network drops?"

Example: Network Drops • Old fashioned: network packet capture (sniffing) • Performance overhead during capture (CPU, storage) and post-processing (wireshark) • Time consuming to analyze: not real-time

Example: Network Drops • New: dynamic tracing • Efficient: only drop/retransmit paths traced • Context: kernel state readable • Real-time: analysis and summaries # ./tcplistendrop.d TIME 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 [...] SRC-IP 10.17.210.103 10.17.210.108 10.17.210.116 10.17.210.117 10.17.210.112 10.17.210.106 10.12.143.16 PORT 25691 18423 38883 10739 27988 28824 65070 -> -> -> -> -> -> -> DST-IP 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 PORT 80 80 80 80 80 80 80

Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health

Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health • In < 5 minutes, I found: • CPU: ok (light usage) • network: ok (light usage) • memory: available memory was exhausted, and the system was paging • disk: periodic bursts of 100% utilization • The method is simple, fast, directs further analysis

Example: Other Methodologies • Customer was surprised (are you sure?) I used latency analysis to confirm. Details (if interesting): • memory: using both microstate accounting and dynamic tracing to confirm that anonymous pagins were hurting the database; worst case app thread spent 97% of time waiting on disk (data faults). • disk: using dynamic tracing to confirm latency at the application / file system interface; included up to 1000ms fsync() calls. • Different methodology, smaller audience (expertise), more time (1 hour).

Example: Summary • What happened: • customer, 1st and 2nd level support spent much time chasing network packet drops. • What could have happened: • customer or 1st level follows the USE method and quickly discover memory and disk issues • memory: fixable by customer reconfig • disk: could go back to 1st or 2nd level support for confirmation • Faster resolution, frees time

Performance Methodology • Not a tool • Not a product • Is a procedure (documentation)

Performance Methodology • Not a tool -> but tools can be written to help • Not a product -> could be in monitoring solutions • Is a procedure (documentation)

Why Now: past • Performance analysis circa ‘90s, metric-orientated: • Vendor creates metrics and performance tools • Users develop methods to interpret metrics • Common method: "Tools Method" • List available performance tools • For each tool, list useful metrics • For each metric, determine interpretation • Problematic: vendors often don’t provide the best metrics; can be blind to issue types

Why Now: changes • Open Source • Dynamic Tracing • See anything, not just what the vendor gave you • Only practical on open source software • Hardest part is knowing what questions to ask

Why Now: present • Performance analysis now (post dynamic tracing), question-orientated: • Users pose questions • Check if vendor has provided metrics • Develop custom metrics using dynamic tracing • Methodologies pose the questions • What would previously be an academic exercise is now practical

Methology Audience • Beginners: provides a starting point • Experts: provides a checklist/reminder

Performance Methodolgies • Suggested order of execution: 1.Problem Statement 2.The USE Method 3.Workload Characterization 4.Drill-Down Analysis (Latency)

Problem Statement • Typical support procedure (1st Methodology): 1.What makes you think there is a problem? 2.Has this system ever performed well? 3.What changed? Software? Hardware? Load? 4.Can the performance degradation be expressed in terms of latency or run time? 5.Does the problem affect other people or applications? 6.What is the environment? What software and hardware is used? Versions? Configuration?

The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization • Saturation • Errors

The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization: time resource was busy, or degree used • Saturation: degree of queued extra work • Errors: any errors Saturation X Errors Utilization

The USE Method: Hardware Resources • CPUs • Main Memory • Network Interfaces • Storage Devices • Controllers • Interconnects

The USE Method: Hardware Resources • A great way to determine resources is to find (or draw) the server functional diagram • The hardware team at vendors should have these • Analyze every component in the data path

The USE Method: Functional Diagrams, Generic Example Memory Bus DRAM CPU Interconnect CPU 1 DRAM CPU 2 I/O Bus I/O Bridge I/O Controller Expander Interconnect Network Controller Interface Transports Disk Disk Port Port

The USE Method: Resource Types • There are two different resource types, each define utilization differently: • I/O Resource: eg, network interface • utilization: time resource was busy. current IOPS / max or current throughput / max can be used in some cases • Capacity Resource: eg, main memory • utilization: space consumed • Storage devices act as both resource types

The USE Method: Software Resources • Mutex Locks • Thread Pools • Process/Thread Capacity • File Descriptor Capacity

The USE Method: Flow Diagram Choose Resource Errors Present? Y N High Utilization? Y N N Saturation? Y Problem Identified

The USE Method: Interpretation • Utilization • 100% usually a bottleneck • 70%+ often a bottleneck for I/O resources, especially when high priority work cannot easily interrupt lower priority work (eg, disks) • Beware of time intervals. 60% utilized over 5 minutes may mean 100% utilized for 3 minutes then idle • Best examined per-device (unbalanced workloads)

The USE Method: Interpretation • Saturation • Any non-zero value adds latency • Errors • Should be obvious

The USE Method: Easy Combinations Resource Type CPU utilization CPU saturation Memory utilization Memory saturation Network Interface utilization Storage Device I/O utilization Storage Device I/O saturation Storage Device I/O errors Metric

The USE Method: Easy Combinations Resource Type Metric CPU utilization CPU utilization CPU saturation run-queue length Memory utilization Memory saturation paging or swapping Network Interface utilization Storage Device I/O utilization available memory RX/TX tput/bandwidth device busy percent Storage Device I/O saturation wait queue length Storage Device I/O errors device errors

The USE Method: Harder Combinations Resource Type CPU errors Network saturation Storage Controller utilization CPU Interconnect utilization Mem. Interconnect saturation I/O Interconnect saturation Metric

The USE Method: Harder Combinations Resource Type Metric CPU errors eg, correctable CPU cache ECC events Network saturation "nocanputs", buffering Storage Controller utilization CPU Interconnect utilization active vs max controller IOPS and tput per port tput / max bandwidth Mem. Interconnect saturation memory stall cycles I/O Interconnect bus throughput / max saturation bandwidth

The USE Method: tools • To be thorough, you will need to use: • CPU performance counters • For bus and interconnect activity; eg, perf events, cpustat • Dynamic Tracing • For missing saturation and error metrics; eg, DTrace • Both can get tricky; tools can be developed to help • Please, no more top variants! ... unless it is interconnect-top or bus-top • I’ve written dozens of open source tools for both CPC and DTrace; much more can be done

Workload Characterization • May use as a 3rd Methodology • Characterize workload by: • who is causing the load? PID, UID, IP addr, ... • why is the load called? code path • what is the load? IOPS, tput, type • how is the load changing over time? • Best performance wins are from eliminating unnecessary work • Identifies class of issues that are load-based, not architecture-based

Drill-Down Analysis • May use as a 4th Methodology • Peel away software layers to drill down on the issue • Eg, software stack I/O latency analysis: Application System Call Interface File System Block Device Interface Storage Device Drivers Storage Devices

Drill-Down Analysis: Open Source • With Dynamic Tracing, all function entry & return points can be traced, with nanosecond timestamps. • One Strategy is to measure latency pairs, to search for the source; eg, A->B & C->D: static int arc_cksum_equal(arc_buf_t *buf) A{ zio_cksum_t zc; int equal; C mutex_enter(&buf->b_hdr->b_freeze_lock); fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc); D equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc); mutex_exit(&buf->b_hdr->b_freeze_lock); B} return (equal);

Other Methodologies • Method R • A latency-based analysis approach for Oracle databases. See "Optimizing Oracle Performance" by Cary Millsap and Jeff Holt (2003) • Experimental approaches • Can be very useful: eg, validating network throughput using iperf

Specific Tools for the USE Method

illumos-based • http://dtrace.org/blogs/brendan/2012/03/01/the-usemethod-solaris-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat 1, "idl"; system-wide: vmstat 1, "id"; per-process:prstat -c 1 ("CPU" == recent), prstat mLc 1 ("USR" + "SYS"); per-kernel-thread: lockstat -Ii rate, DTrace profile stack() Saturation system-wide: uptime, load averages; vmstat 1, "r"; DTrace dispqlen.d (DTT) for a better "vmstat r"; per-process: prstat -mLc 1, "LAT" Errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling) Saturation system-wide: vmstat 1, "sr" (bad now), "w" (was very bad); vmstat -p 1, "api" (anon page ins == pain), "apo"; per-process: prstat -mLc 1, "DFL"; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname CPU CPU Memory • ... etc for all combinations (would span a dozen slides)

Linux-based • http://dtrace.org/blogs/brendan/2012/03/07/the-usemethod-linux-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat -P ALL 1, "%idle"; sar -P ALL, "%idle"; system-wide: vmstat 1, "id"; sar -u, "%idle"; dstat -c, "idl"; per-process:top, "%CPU"; htop, "CPU%"; ps -o pcpu; pidstat 1, "%CPU"; per-kernel-thread: top/htop ("K" to toggle), where VIRT == 0 (heuristic). [1] Saturation system-wide: vmstat 1, "r" > CPU count [2]; sar -q, "runq-sz" > CPU count; dstat -p, "run" > CPU count; perprocess: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3] Errors perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s "04Ah Single-bit ECC Errors Recorded by Scrubber" [4] CPU CPU • ... etc for all combinations (would span a dozen slides)

Products • Earlier I said methodologies could be supported by monitoring solutions • At Joyent we develop Cloud Analytics:

Future • Methodologies for advanced performance issues • I recently worked a complex KVM bandwidth issue where no current methodologies really worked • Innovative methods based on open source + dynamic tracing • Less performance mystery. Less guesswork. • Better use of resources (price/performance) • Easier for beginners to get started

Thank you • Resources: • http://dtrace.org/blogs/brendan • http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ • http://dtrace.org/blogs/brendan/tag/usemethod/ • http://dtrace.org/blogs/brendan/2011/12/18/visualizing-deviceutilization/ - ideas if you are a monitoring solution developer • brendan@joyent.com

Recommended

Handling Redis failover with ZooKeeper

KEY

Handling Redis failover with ZooKeeper

YOW2021 Computing Performance

PDF

YOW2021 Computing Performance

byBrendan Gregg

YOW2018 Cloud Performance Root Cause Analysis at Netflix

PDF

YOW2018 Cloud Performance Root Cause Analysis at Netflix

byBrendan Gregg

Airflow introduction

PDF

Airflow introduction

byChandler Huang

Airflow tutorials hands_on

PDF

Airflow tutorials hands_on

Apache Airflow overview

PPTX

Apache Airflow overview

byNikolayGrishchenkov

Apache Cassandra

PPTX

Apache Cassandra

byRutuja Gholap

[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka

PDF

[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka

GraphFrames: Graph Queries In Spark SQL

PDF

GraphFrames: Graph Queries In Spark SQL

Velocity 2015 linux perf tools

PDF

Velocity 2015 linux perf tools

byBrendan Gregg

How Prometheus Store the Data

PDF

How Prometheus Store the Data

PostgreSQL Performance Tuning

PDF

PostgreSQL Performance Tuning

byelliando dias

Stop the Guessing: Performance Methodologies for Production Systems

PDF

Stop the Guessing: Performance Methodologies for Production Systems

byBrendan Gregg

Introducing Apache Airflow and how we are using it

PDF

Introducing Apache Airflow and how we are using it

Blazing Performance with Flame Graphs

PDF

Blazing Performance with Flame Graphs

byBrendan Gregg

Apache Airflow Architecture

PDF

Apache Airflow Architecture

byGerard Toonstra

Grafana

PPTX

Grafana

Building an analytics workflow using Apache Airflow

PDF

Building an analytics workflow using Apache Airflow

Apache Airflow

PDF

Apache Airflow

Building robust CDC pipeline with Apache Hudi and Debezium

PDF

Building robust CDC pipeline with Apache Hudi and Debezium

Apache kafka performance(latency)_benchmark_v0.3

PDF

Apache kafka performance(latency)_benchmark_v0.3

bySANG WON PARK

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...

PDF

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...

Apache Airflow

PDF

Apache Airflow

bySumit Maheshwari

jemalloc 세미나

PDF

jemalloc 세미나

Airflow presentation

PDF

Airflow presentation

Airflow presentation

PPTX

Airflow presentation

byAnant Corporation

Hadoop Hand-on Lab: Installing Hadoop 2

PDF

Hadoop Hand-on Lab: Installing Hadoop 2

byIMC Institute

Introduction to Apache Flink - Fast and reliable big data processing

PDF

Introduction to Apache Flink - Fast and reliable big data processing

byTill Rohrmann

Linux Performance Analysis: New Tools and Old Secrets

PDF

Linux Performance Analysis: New Tools and Old Secrets

byBrendan Gregg

Linux Systems Performance 2016

PDF

Linux Systems Performance 2016

byBrendan Gregg

More Related Content

Handling Redis failover with ZooKeeper

KEY

Handling Redis failover with ZooKeeper

YOW2021 Computing Performance

PDF

YOW2021 Computing Performance

byBrendan Gregg

YOW2018 Cloud Performance Root Cause Analysis at Netflix

PDF

YOW2018 Cloud Performance Root Cause Analysis at Netflix

byBrendan Gregg

Airflow introduction

PDF

Airflow introduction

byChandler Huang

Airflow tutorials hands_on

PDF

Airflow tutorials hands_on

Apache Airflow overview

PPTX

Apache Airflow overview

byNikolayGrishchenkov

Apache Cassandra

PPTX

Apache Cassandra

byRutuja Gholap

[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka

PDF

[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka

Handling Redis failover with ZooKeeper

Handling Redis failover with ZooKeeper

YOW2021 Computing Performance

YOW2021 Computing Performance

byBrendan Gregg

YOW2018 Cloud Performance Root Cause Analysis at Netflix

YOW2018 Cloud Performance Root Cause Analysis at Netflix

byBrendan Gregg

Airflow introduction

Airflow introduction

byChandler Huang

Airflow tutorials hands_on

Airflow tutorials hands_on

Apache Airflow overview

Apache Airflow overview

byNikolayGrishchenkov

Apache Cassandra

Apache Cassandra

byRutuja Gholap

[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka

[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka

What's hot

GraphFrames: Graph Queries In Spark SQL

PDF

GraphFrames: Graph Queries In Spark SQL

Velocity 2015 linux perf tools

PDF

Velocity 2015 linux perf tools

byBrendan Gregg

How Prometheus Store the Data

PDF

How Prometheus Store the Data

PostgreSQL Performance Tuning

PDF

PostgreSQL Performance Tuning

byelliando dias

Stop the Guessing: Performance Methodologies for Production Systems

PDF

Stop the Guessing: Performance Methodologies for Production Systems

byBrendan Gregg

Introducing Apache Airflow and how we are using it

PDF

Introducing Apache Airflow and how we are using it

Blazing Performance with Flame Graphs

PDF

Blazing Performance with Flame Graphs

byBrendan Gregg

Apache Airflow Architecture

PDF

Apache Airflow Architecture

byGerard Toonstra

Grafana

PPTX

Grafana

Building an analytics workflow using Apache Airflow

PDF

Building an analytics workflow using Apache Airflow

Apache Airflow

PDF

Apache Airflow

Building robust CDC pipeline with Apache Hudi and Debezium

PDF

Building robust CDC pipeline with Apache Hudi and Debezium

Apache kafka performance(latency)_benchmark_v0.3

PDF

Apache kafka performance(latency)_benchmark_v0.3

bySANG WON PARK

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...

PDF

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...

Apache Airflow

PDF

Apache Airflow

bySumit Maheshwari

jemalloc 세미나

PDF

jemalloc 세미나

Airflow presentation

PDF

Airflow presentation

Airflow presentation

PPTX

Airflow presentation

byAnant Corporation

Hadoop Hand-on Lab: Installing Hadoop 2

PDF

Hadoop Hand-on Lab: Installing Hadoop 2

byIMC Institute

Introduction to Apache Flink - Fast and reliable big data processing

PDF

Introduction to Apache Flink - Fast and reliable big data processing

byTill Rohrmann

GraphFrames: Graph Queries In Spark SQL

GraphFrames: Graph Queries In Spark SQL

Velocity 2015 linux perf tools

Velocity 2015 linux perf tools

byBrendan Gregg

How Prometheus Store the Data

How Prometheus Store the Data

PostgreSQL Performance Tuning

PostgreSQL Performance Tuning

byelliando dias

Stop the Guessing: Performance Methodologies for Production Systems

Stop the Guessing: Performance Methodologies for Production Systems

byBrendan Gregg

Introducing Apache Airflow and how we are using it

Introducing Apache Airflow and how we are using it

Blazing Performance with Flame Graphs

Blazing Performance with Flame Graphs

byBrendan Gregg

Apache Airflow Architecture

Apache Airflow Architecture

byGerard Toonstra

Grafana

Grafana

Building an analytics workflow using Apache Airflow

Building an analytics workflow using Apache Airflow

Apache Airflow

Apache Airflow

Building robust CDC pipeline with Apache Hudi and Debezium

Building robust CDC pipeline with Apache Hudi and Debezium

Apache kafka performance(latency)_benchmark_v0.3

Apache kafka performance(latency)_benchmark_v0.3

bySANG WON PARK

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...

Apache Airflow

Apache Airflow

bySumit Maheshwari

jemalloc 세미나

jemalloc 세미나

Airflow presentation

Airflow presentation

Airflow presentation

Airflow presentation

byAnant Corporation

Hadoop Hand-on Lab: Installing Hadoop 2

Hadoop Hand-on Lab: Installing Hadoop 2

byIMC Institute

Introduction to Apache Flink - Fast and reliable big data processing

Introduction to Apache Flink - Fast and reliable big data processing

byTill Rohrmann

Viewers also liked

Linux Performance Analysis: New Tools and Old Secrets

PDF

Linux Performance Analysis: New Tools and Old Secrets

byBrendan Gregg

Linux Systems Performance 2016

PDF

Linux Systems Performance 2016

byBrendan Gregg

Broken Linux Performance Tools 2016

PPTX

Broken Linux Performance Tools 2016

byBrendan Gregg

BPF: Tracing and more

PDF

BPF: Tracing and more

byBrendan Gregg

Linux Profiling at Netflix

PDF

Linux Profiling at Netflix

byBrendan Gregg

Lisa12 methodologies

PDF

Lisa12 methodologies

byBrendan Gregg

Performance Analysis: new tools and concepts from the cloud

PDF

Performance Analysis: new tools and concepts from the cloud

byBrendan Gregg

SREcon 2016 Performance Checklists for SREs

PDF

SREcon 2016 Performance Checklists for SREs

byBrendan Gregg

Linux 4.x Tracing: Performance Analysis with bcc/BPF

PDF

Linux 4.x Tracing: Performance Analysis with bcc/BPF

byBrendan Gregg

DTrace Topics: Introduction

PDF

DTrace Topics: Introduction

byBrendan Gregg

ACM Applicative System Methodology 2016

PDF

ACM Applicative System Methodology 2016

byBrendan Gregg

Linux 4.x Tracing Tools: Using BPF Superpowers

PDF

Linux 4.x Tracing Tools: Using BPF Superpowers

byBrendan Gregg

Linux Performance Analysis and Tools

PDF

Linux Performance Analysis and Tools

byBrendan Gregg

The New Systems Performance

PDF

The New Systems Performance

byBrendan Gregg

Performance analysis 2013

PPTX

Performance analysis 2013

byKerry Harrison

From DTrace to Linux

PDF

From DTrace to Linux

byBrendan Gregg

Open Source Systems Performance

PDF

Open Source Systems Performance

byBrendan Gregg

Systems Performance: Enterprise and the Cloud

PDF

Systems Performance: Enterprise and the Cloud

byBrendan Gregg

Designing Tracing Tools

PDF

Designing Tracing Tools

byBrendan Gregg

MeetBSD2014 Performance Analysis

PDF

MeetBSD2014 Performance Analysis

byBrendan Gregg

Linux Performance Analysis: New Tools and Old Secrets

Linux Performance Analysis: New Tools and Old Secrets

byBrendan Gregg

Linux Systems Performance 2016

Linux Systems Performance 2016

byBrendan Gregg

Broken Linux Performance Tools 2016

Broken Linux Performance Tools 2016

byBrendan Gregg

BPF: Tracing and more

BPF: Tracing and more

byBrendan Gregg

Linux Profiling at Netflix

Linux Profiling at Netflix

byBrendan Gregg

Lisa12 methodologies

Lisa12 methodologies

byBrendan Gregg

Performance Analysis: new tools and concepts from the cloud

Performance Analysis: new tools and concepts from the cloud

byBrendan Gregg

SREcon 2016 Performance Checklists for SREs

SREcon 2016 Performance Checklists for SREs

byBrendan Gregg

Linux 4.x Tracing: Performance Analysis with bcc/BPF

Linux 4.x Tracing: Performance Analysis with bcc/BPF

byBrendan Gregg

DTrace Topics: Introduction

DTrace Topics: Introduction

byBrendan Gregg

ACM Applicative System Methodology 2016

ACM Applicative System Methodology 2016

byBrendan Gregg

Linux 4.x Tracing Tools: Using BPF Superpowers

Linux 4.x Tracing Tools: Using BPF Superpowers

byBrendan Gregg

Linux Performance Analysis and Tools

Linux Performance Analysis and Tools

byBrendan Gregg

The New Systems Performance

The New Systems Performance

byBrendan Gregg

Performance analysis 2013

Performance analysis 2013

byKerry Harrison

From DTrace to Linux

From DTrace to Linux

byBrendan Gregg

Open Source Systems Performance

Open Source Systems Performance

byBrendan Gregg

Systems Performance: Enterprise and the Cloud

Systems Performance: Enterprise and the Cloud

byBrendan Gregg

Designing Tracing Tools

Designing Tracing Tools

byBrendan Gregg

MeetBSD2014 Performance Analysis

MeetBSD2014 Performance Analysis

byBrendan Gregg

Similar to Performance Analysis: The USE Method

Monitorama 2015 Netflix Instance Analysis

PDF

Monitorama 2015 Netflix Instance Analysis

byBrendan Gregg

Software Performance

PPT

Software Performance

byPrabhanshu Saraswat

Performance Forensics - Understanding Application Performance

PPTX

Performance Forensics - Understanding Application Performance

byAlois Reitbauer

Performance monitoring and call tracing in microservice environments

PDF

Performance monitoring and call tracing in microservice environments

byMartin Gutenbrunner

Become a Performance Diagnostics Hero

PDF

Become a Performance Diagnostics Hero

LISA2010 visualizations

PDF

LISA2010 visualizations

byBrendan Gregg

Introduction: What is Performance Testing?

DOCX

Introduction: What is Performance Testing?

Performance testing with your eyes wide open geekweek 2018

PDF

Performance testing with your eyes wide open geekweek 2018

Performance engineering methodologies

PDF

Performance engineering methodologies

byManeesh Chaturvedi

7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar

PPTX

7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar

Software Analytics: Data Analytics for Software Engineering

PDF

Software Analytics: Data Analytics for Software Engineering

Performance

PDF

Performance

byChristophe Marchal

Door to perfomance testing

PDF

Door to perfomance testing

byDharshana Kasun Warusavitharana

Quick and dirty performance analysis

PPTX

Quick and dirty performance analysis

byChris Kernaghan

Pinpoint Ceph Bottleneck Out of Cluster Behavior Mists - Yingxin Cheng

PDF

Pinpoint Ceph Bottleneck Out of Cluster Behavior Mists - Yingxin Cheng

byCeph Community

Interpreting Performance Test Results

PPTX

Interpreting Performance Test Results

byEric Proegler

Performance testing basics

PPTX

Performance testing basics

Jee performance tuning existing applications

PPTX

Jee performance tuning existing applications

byShivnarayan Varma

How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...

PDF

How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...

byRed Hat Developers

Common Sense Performance Indicators in the Cloud

PDF

Common Sense Performance Indicators in the Cloud

Monitorama 2015 Netflix Instance Analysis

Monitorama 2015 Netflix Instance Analysis

byBrendan Gregg

Software Performance

Software Performance

byPrabhanshu Saraswat

Performance Forensics - Understanding Application Performance

Performance Forensics - Understanding Application Performance

byAlois Reitbauer

Performance monitoring and call tracing in microservice environments

Performance monitoring and call tracing in microservice environments

byMartin Gutenbrunner

Become a Performance Diagnostics Hero

Become a Performance Diagnostics Hero

LISA2010 visualizations

LISA2010 visualizations

byBrendan Gregg

Introduction: What is Performance Testing?

Introduction: What is Performance Testing?

Performance testing with your eyes wide open geekweek 2018

Performance testing with your eyes wide open geekweek 2018

Performance engineering methodologies

Performance engineering methodologies

byManeesh Chaturvedi

7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar

7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar

Software Analytics: Data Analytics for Software Engineering

Software Analytics: Data Analytics for Software Engineering

Performance

Performance

byChristophe Marchal

Door to perfomance testing

Door to perfomance testing

byDharshana Kasun Warusavitharana

Quick and dirty performance analysis

Quick and dirty performance analysis

byChris Kernaghan

Pinpoint Ceph Bottleneck Out of Cluster Behavior Mists - Yingxin Cheng

Pinpoint Ceph Bottleneck Out of Cluster Behavior Mists - Yingxin Cheng

byCeph Community

Interpreting Performance Test Results

Interpreting Performance Test Results

byEric Proegler

Performance testing basics

Performance testing basics

Jee performance tuning existing applications

Jee performance tuning existing applications

byShivnarayan Varma

How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...

How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...

byRed Hat Developers

Common Sense Performance Indicators in the Cloud

Common Sense Performance Indicators in the Cloud

More from Brendan Gregg

IntelON 2021 Processor Benchmarking

PDF

IntelON 2021 Processor Benchmarking

byBrendan Gregg

Performance Wins with eBPF: Getting Started (2021)

PDF

Performance Wins with eBPF: Getting Started (2021)

byBrendan Gregg

Systems@Scale 2021 BPF Performance Getting Started

PDF

Systems@Scale 2021 BPF Performance Getting Started

byBrendan Gregg

Computing Performance: On the Horizon (2021)

PDF

Computing Performance: On the Horizon (2021)

byBrendan Gregg

BPF Internals (eBPF)

PDF

BPF Internals (eBPF)

byBrendan Gregg

Performance Wins with BPF: Getting Started

PDF

Performance Wins with BPF: Getting Started

byBrendan Gregg

YOW2020 Linux Systems Performance

PDF

YOW2020 Linux Systems Performance

byBrendan Gregg

re:Invent 2019 BPF Performance Analysis at Netflix

PDF

re:Invent 2019 BPF Performance Analysis at Netflix

byBrendan Gregg

UM2019 Extended BPF: A New Type of Software

PDF

UM2019 Extended BPF: A New Type of Software

byBrendan Gregg

LISA2019 Linux Systems Performance

PDF

LISA2019 Linux Systems Performance

byBrendan Gregg

LPC2019 BPF Tracing Tools

PDF

LPC2019 BPF Tracing Tools

byBrendan Gregg

LSFMM 2019 BPF Observability

PDF

LSFMM 2019 BPF Observability

byBrendan Gregg

YOW2018 CTO Summit: Working at netflix

PDF

YOW2018 CTO Summit: Working at netflix

byBrendan Gregg

eBPF Perf Tools 2019

PDF

eBPF Perf Tools 2019

byBrendan Gregg

BPF Tools 2017

PDF

BPF Tools 2017

byBrendan Gregg

NetConf 2018 BPF Observability

PDF

NetConf 2018 BPF Observability

byBrendan Gregg

FlameScope 2018

PDF

FlameScope 2018

byBrendan Gregg

ATO Linux Performance 2018

PDF

ATO Linux Performance 2018

byBrendan Gregg

Linux Performance 2018 (PerconaLive keynote)

PDF

Linux Performance 2018 (PerconaLive keynote)

byBrendan Gregg

How Netflix Tunes EC2 Instances for Performance

PDF

How Netflix Tunes EC2 Instances for Performance

byBrendan Gregg

IntelON 2021 Processor Benchmarking

IntelON 2021 Processor Benchmarking

byBrendan Gregg

Performance Wins with eBPF: Getting Started (2021)

Performance Wins with eBPF: Getting Started (2021)

byBrendan Gregg

Systems@Scale 2021 BPF Performance Getting Started

Systems@Scale 2021 BPF Performance Getting Started

byBrendan Gregg

Computing Performance: On the Horizon (2021)

Computing Performance: On the Horizon (2021)

byBrendan Gregg

BPF Internals (eBPF)

BPF Internals (eBPF)

byBrendan Gregg

Performance Wins with BPF: Getting Started

Performance Wins with BPF: Getting Started

byBrendan Gregg

YOW2020 Linux Systems Performance

YOW2020 Linux Systems Performance

byBrendan Gregg

re:Invent 2019 BPF Performance Analysis at Netflix

re:Invent 2019 BPF Performance Analysis at Netflix

byBrendan Gregg

UM2019 Extended BPF: A New Type of Software

UM2019 Extended BPF: A New Type of Software

byBrendan Gregg

LISA2019 Linux Systems Performance

LISA2019 Linux Systems Performance

byBrendan Gregg

LPC2019 BPF Tracing Tools

LPC2019 BPF Tracing Tools

byBrendan Gregg

LSFMM 2019 BPF Observability

LSFMM 2019 BPF Observability

byBrendan Gregg

YOW2018 CTO Summit: Working at netflix

YOW2018 CTO Summit: Working at netflix

byBrendan Gregg

eBPF Perf Tools 2019

eBPF Perf Tools 2019

byBrendan Gregg

BPF Tools 2017

BPF Tools 2017

byBrendan Gregg

NetConf 2018 BPF Observability

NetConf 2018 BPF Observability

byBrendan Gregg

FlameScope 2018

FlameScope 2018

byBrendan Gregg

ATO Linux Performance 2018

ATO Linux Performance 2018

byBrendan Gregg

Linux Performance 2018 (PerconaLive keynote)

Linux Performance 2018 (PerconaLive keynote)

byBrendan Gregg

How Netflix Tunes EC2 Instances for Performance

How Netflix Tunes EC2 Instances for Performance

byBrendan Gregg

Recently uploaded

Manage Basic Storage in RHEL - RHCSA (RH134).pdf

PDF

Manage Basic Storage in RHEL - RHCSA (RH134).pdf

byLinuxCert Guru

Getting started with Agent Framework.pdf

PDF

Getting started with Agent Framework.pdf

The Accel 2025 Globalscape: Race for compute

PDF

The Accel 2025 Globalscape: Race for compute

byPhilippe Botteri

Reset RHEL Root User Password - RHCSA.pdf

PDF

Reset RHEL Root User Password - RHCSA.pdf

byLinuxCert Guru

Run Containers in RHEL - RHCSA (RH134).pdf

PDF

Run Containers in RHEL - RHCSA (RH134).pdf

byLinuxCert Guru

Discover - Assemble - and Gain Insights into your Content with SharePoint Con...

PPTX

Discover - Assemble - and Gain Insights into your Content with SharePoint Con...

byDrew Madelung

Career Blueprint: Mentor Tracks & Career Clinic - Part 2

PDF

Career Blueprint: Mentor Tracks & Career Clinic - Part 2

Upskill to Agentic Automation - Accelerating Your Job Search using AI

PDF

Upskill to Agentic Automation - Accelerating Your Job Search using AI

Implement Advanced Storage in RHEL - RHCSA (RH134).pdf

PDF

Implement Advanced Storage in RHEL - RHCSA (RH134).pdf

byLinuxCert Guru

Open Source SecurityCon 2025 in Atlanta - Transparency Exchange API: Where To...

PDF

Open Source SecurityCon 2025 in Atlanta - Transparency Exchange API: Where To...

byPavel Shukhman

Improve Command Line Productivity - RHCSA (RH134).pdf

PDF

Improve Command Line Productivity - RHCSA (RH134).pdf

byLinuxCert Guru

Opening Plenary - Esri UK Welsh Conference 2025

PPTX

Opening Plenary - Esri UK Welsh Conference 2025

ENTSO-E's Response to the European Commission Call for Evidence on the Strate...

PDF

ENTSO-E's Response to the European Commission Call for Evidence on the Strate...

byReynir Orn Bachmann Gudmundsson

Explaining ourselves – people, computers and AI

PPTX

Explaining ourselves – people, computers and AI

How Design Systems and AI Agents Accelerate Product Delivery Sixt

PPTX

How Design Systems and AI Agents Accelerate Product Delivery Sixt

byBoyan Dimitrov

Manage Logical Volume in RHEL - RHCSA (RH134).pdf

PDF

Manage Logical Volume in RHEL - RHCSA (RH134).pdf

byLinuxCert Guru

Tune System Performance - RHCSA (RH134).pdf

PDF

Tune System Performance - RHCSA (RH134).pdf

byLinuxCert Guru

Career Blueprint - Future Career Vision & Success Stories - 2025 - Part 1

PPTX

Career Blueprint - Future Career Vision & Success Stories - 2025 - Part 1

The Future of Database Diagnostics is a Conversation with Oracle AHF Fleet In...

PDF

The Future of Database Diagnostics is a Conversation with Oracle AHF Fleet In...

#MakeAIMatter for HR Professionals | AI Transformation Workshop by Tekdi Tech...

PDF

#MakeAIMatter for HR Professionals | AI Transformation Workshop by Tekdi Tech...

Manage Basic Storage in RHEL - RHCSA (RH134).pdf

Manage Basic Storage in RHEL - RHCSA (RH134).pdf

byLinuxCert Guru

Getting started with Agent Framework.pdf

Getting started with Agent Framework.pdf

The Accel 2025 Globalscape: Race for compute

The Accel 2025 Globalscape: Race for compute

byPhilippe Botteri

Reset RHEL Root User Password - RHCSA.pdf

Reset RHEL Root User Password - RHCSA.pdf

byLinuxCert Guru

Run Containers in RHEL - RHCSA (RH134).pdf

Run Containers in RHEL - RHCSA (RH134).pdf

byLinuxCert Guru

Discover - Assemble - and Gain Insights into your Content with SharePoint Con...

Discover - Assemble - and Gain Insights into your Content with SharePoint Con...

byDrew Madelung

Career Blueprint: Mentor Tracks & Career Clinic - Part 2

Career Blueprint: Mentor Tracks & Career Clinic - Part 2

Upskill to Agentic Automation - Accelerating Your Job Search using AI

Upskill to Agentic Automation - Accelerating Your Job Search using AI

Implement Advanced Storage in RHEL - RHCSA (RH134).pdf

Implement Advanced Storage in RHEL - RHCSA (RH134).pdf

byLinuxCert Guru

Open Source SecurityCon 2025 in Atlanta - Transparency Exchange API: Where To...

Open Source SecurityCon 2025 in Atlanta - Transparency Exchange API: Where To...

byPavel Shukhman

Improve Command Line Productivity - RHCSA (RH134).pdf

Improve Command Line Productivity - RHCSA (RH134).pdf

byLinuxCert Guru

Opening Plenary - Esri UK Welsh Conference 2025

Opening Plenary - Esri UK Welsh Conference 2025

ENTSO-E's Response to the European Commission Call for Evidence on the Strate...

ENTSO-E's Response to the European Commission Call for Evidence on the Strate...

byReynir Orn Bachmann Gudmundsson

Explaining ourselves – people, computers and AI

Explaining ourselves – people, computers and AI

How Design Systems and AI Agents Accelerate Product Delivery Sixt

How Design Systems and AI Agents Accelerate Product Delivery Sixt

byBoyan Dimitrov

Manage Logical Volume in RHEL - RHCSA (RH134).pdf

Manage Logical Volume in RHEL - RHCSA (RH134).pdf

byLinuxCert Guru

Tune System Performance - RHCSA (RH134).pdf

Tune System Performance - RHCSA (RH134).pdf

byLinuxCert Guru

Career Blueprint - Future Career Vision & Success Stories - 2025 - Part 1

Career Blueprint - Future Career Vision & Success Stories - 2025 - Part 1

The Future of Database Diagnostics is a Conversation with Oracle AHF Fleet In...

The Future of Database Diagnostics is a Conversation with Oracle AHF Fleet In...

#MakeAIMatter for HR Professionals | AI Transformation Workshop by Tekdi Tech...

#MakeAIMatter for HR Professionals | AI Transformation Workshop by Tekdi Tech...

Performance Analysis: The USE Method

1.
Performance Analysis: The USE Method Brendan Gregg Lead Performance Engineer, Joyent brendan.gregg@joyent.com FISL13 July, 2012
2.
whoami • I work at the top of the performance support chain • I also write open source performance tools out of necessity to solve issues • http://github.com/brendangregg • http://www.brendangregg.com/#software • And books (DTrace, Solaris Performance and Tools) • Was Brendan @ Sun Microsystems, Oracle, now Joyent
3.
Joyent • Cloud computing provider • Cloud computing software • SmartOS • host OS, and guest via OS virtualization • Linux, Windows • guest via KVM
4.
Agenda • Example Problem • Performance Methodology • Problem Statement • The USE Method • Workload Characterization • Drill-Down Analysis • Specific Tools
5.
Example Problem • Recent cloud-based performance issue • Customer problem statement: • "Database response time sometimes take multiple seconds. Is the network dropping packets?" • Tested network using traceroute, which showed some packet drops
6.
Example: Support Path • Performance Analysis Top 2nd Level 1st Level Customer Issues
7.
Example: Support Path • Performance Analysis Top my turn 2nd Level "network looks ok, CPU also ok" 1st Level "ran traceroute, can’t reproduce" Customer: "network drops?"
8.
Example: Network Drops • Old fashioned: network packet capture (sniffing) • Performance overhead during capture (CPU, storage) and post-processing (wireshark) • Time consuming to analyze: not real-time
9.
Example: Network Drops • New: dynamic tracing • Efficient: only drop/retransmit paths traced • Context: kernel state readable • Real-time: analysis and summaries # ./tcplistendrop.d TIME 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 [...] SRC-IP 10.17.210.103 10.17.210.108 10.17.210.116 10.17.210.117 10.17.210.112 10.17.210.106 10.12.143.16 PORT 25691 18423 38883 10739 27988 28824 65070 -> -> -> -> -> -> -> DST-IP 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 PORT 80 80 80 80 80 80 80
10.
Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health
11.
Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health • In < 5 minutes, I found: • CPU: ok (light usage) • network: ok (light usage) • memory: available memory was exhausted, and the system was paging • disk: periodic bursts of 100% utilization • The method is simple, fast, directs further analysis
12.
Example: Other Methodologies • Customer was surprised (are you sure?) I used latency analysis to confirm. Details (if interesting): • memory: using both microstate accounting and dynamic tracing to confirm that anonymous pagins were hurting the database; worst case app thread spent 97% of time waiting on disk (data faults). • disk: using dynamic tracing to confirm latency at the application / file system interface; included up to 1000ms fsync() calls. • Different methodology, smaller audience (expertise), more time (1 hour).
13.
Example: Summary • What happened: • customer, 1st and 2nd level support spent much time chasing network packet drops. • What could have happened: • customer or 1st level follows the USE method and quickly discover memory and disk issues • memory: fixable by customer reconfig • disk: could go back to 1st or 2nd level support for confirmation • Faster resolution, frees time
14.
Performance Methodology • Not a tool • Not a product • Is a procedure (documentation)
15.
Performance Methodology • Not a tool -> but tools can be written to help • Not a product -> could be in monitoring solutions • Is a procedure (documentation)
16.
Why Now: past • Performance analysis circa ‘90s, metric-orientated: • Vendor creates metrics and performance tools • Users develop methods to interpret metrics • Common method: "Tools Method" • List available performance tools • For each tool, list useful metrics • For each metric, determine interpretation • Problematic: vendors often don’t provide the best metrics; can be blind to issue types
17.
Why Now: changes • Open Source • Dynamic Tracing • See anything, not just what the vendor gave you • Only practical on open source software • Hardest part is knowing what questions to ask
18.
Why Now: present • Performance analysis now (post dynamic tracing), question-orientated: • Users pose questions • Check if vendor has provided metrics • Develop custom metrics using dynamic tracing • Methodologies pose the questions • What would previously be an academic exercise is now practical
19.
Methology Audience • Beginners: provides a starting point • Experts: provides a checklist/reminder
20.
Performance Methodolgies • Suggested order of execution: 1.Problem Statement 2.The USE Method 3.Workload Characterization 4.Drill-Down Analysis (Latency)
21.
Problem Statement • Typical support procedure (1st Methodology): 1.What makes you think there is a problem? 2.Has this system ever performed well? 3.What changed? Software? Hardware? Load? 4.Can the performance degradation be expressed in terms of latency or run time? 5.Does the problem affect other people or applications? 6.What is the environment? What software and hardware is used? Versions? Configuration?
22.
The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization • Saturation • Errors
23.
The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization: time resource was busy, or degree used • Saturation: degree of queued extra work • Errors: any errors Saturation X Errors Utilization
24.
The USE Method: Hardware Resources • CPUs • Main Memory • Network Interfaces • Storage Devices • Controllers • Interconnects
25.
The USE Method: Hardware Resources • A great way to determine resources is to find (or draw) the server functional diagram • The hardware team at vendors should have these • Analyze every component in the data path
26.
The USE Method: Functional Diagrams, Generic Example Memory Bus DRAM CPU Interconnect CPU 1 DRAM CPU 2 I/O Bus I/O Bridge I/O Controller Expander Interconnect Network Controller Interface Transports Disk Disk Port Port
27.
The USE Method: Resource Types • There are two different resource types, each define utilization differently: • I/O Resource: eg, network interface • utilization: time resource was busy. current IOPS / max or current throughput / max can be used in some cases • Capacity Resource: eg, main memory • utilization: space consumed • Storage devices act as both resource types
28.
The USE Method: Software Resources • Mutex Locks • Thread Pools • Process/Thread Capacity • File Descriptor Capacity
29.
The USE Method: Flow Diagram Choose Resource Errors Present? Y N High Utilization? Y N N Saturation? Y Problem Identified
30.
The USE Method: Interpretation • Utilization • 100% usually a bottleneck • 70%+ often a bottleneck for I/O resources, especially when high priority work cannot easily interrupt lower priority work (eg, disks) • Beware of time intervals. 60% utilized over 5 minutes may mean 100% utilized for 3 minutes then idle • Best examined per-device (unbalanced workloads)
31.
The USE Method: Interpretation • Saturation • Any non-zero value adds latency • Errors • Should be obvious
32.
The USE Method: Easy Combinations Resource Type CPU utilization CPU saturation Memory utilization Memory saturation Network Interface utilization Storage Device I/O utilization Storage Device I/O saturation Storage Device I/O errors Metric
33.
The USE Method: Easy Combinations Resource Type Metric CPU utilization CPU utilization CPU saturation run-queue length Memory utilization Memory saturation paging or swapping Network Interface utilization Storage Device I/O utilization available memory RX/TX tput/bandwidth device busy percent Storage Device I/O saturation wait queue length Storage Device I/O errors device errors
34.
The USE Method: Harder Combinations Resource Type CPU errors Network saturation Storage Controller utilization CPU Interconnect utilization Mem. Interconnect saturation I/O Interconnect saturation Metric
35.
The USE Method: Harder Combinations Resource Type Metric CPU errors eg, correctable CPU cache ECC events Network saturation "nocanputs", buffering Storage Controller utilization CPU Interconnect utilization active vs max controller IOPS and tput per port tput / max bandwidth Mem. Interconnect saturation memory stall cycles I/O Interconnect bus throughput / max saturation bandwidth
36.
The USE Method: tools • To be thorough, you will need to use: • CPU performance counters • For bus and interconnect activity; eg, perf events, cpustat • Dynamic Tracing • For missing saturation and error metrics; eg, DTrace • Both can get tricky; tools can be developed to help • Please, no more top variants! ... unless it is interconnect-top or bus-top • I’ve written dozens of open source tools for both CPC and DTrace; much more can be done
37.
Workload Characterization • May use as a 3rd Methodology • Characterize workload by: • who is causing the load? PID, UID, IP addr, ... • why is the load called? code path • what is the load? IOPS, tput, type • how is the load changing over time? • Best performance wins are from eliminating unnecessary work • Identifies class of issues that are load-based, not architecture-based
38.
Drill-Down Analysis • May use as a 4th Methodology • Peel away software layers to drill down on the issue • Eg, software stack I/O latency analysis: Application System Call Interface File System Block Device Interface Storage Device Drivers Storage Devices
39.
Drill-Down Analysis: Open Source • With Dynamic Tracing, all function entry & return points can be traced, with nanosecond timestamps. • One Strategy is to measure latency pairs, to search for the source; eg, A->B & C->D: static int arc_cksum_equal(arc_buf_t *buf) A{ zio_cksum_t zc; int equal; C mutex_enter(&buf->b_hdr->b_freeze_lock); fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc); D equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc); mutex_exit(&buf->b_hdr->b_freeze_lock); B} return (equal);
40.
Other Methodologies • Method R • A latency-based analysis approach for Oracle databases. See "Optimizing Oracle Performance" by Cary Millsap and Jeff Holt (2003) • Experimental approaches • Can be very useful: eg, validating network throughput using iperf
41.
Specific Tools for the USE Method
42.
illumos-based • http://dtrace.org/blogs/brendan/2012/03/01/the-usemethod-solaris-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat 1, "idl"; system-wide: vmstat 1, "id"; per-process:prstat -c 1 ("CPU" == recent), prstat mLc 1 ("USR" + "SYS"); per-kernel-thread: lockstat -Ii rate, DTrace profile stack() Saturation system-wide: uptime, load averages; vmstat 1, "r"; DTrace dispqlen.d (DTT) for a better "vmstat r"; per-process: prstat -mLc 1, "LAT" Errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling) Saturation system-wide: vmstat 1, "sr" (bad now), "w" (was very bad); vmstat -p 1, "api" (anon page ins == pain), "apo"; per-process: prstat -mLc 1, "DFL"; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname CPU CPU Memory • ... etc for all combinations (would span a dozen slides)
43.
Linux-based • http://dtrace.org/blogs/brendan/2012/03/07/the-usemethod-linux-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat -P ALL 1, "%idle"; sar -P ALL, "%idle"; system-wide: vmstat 1, "id"; sar -u, "%idle"; dstat -c, "idl"; per-process:top, "%CPU"; htop, "CPU%"; ps -o pcpu; pidstat 1, "%CPU"; per-kernel-thread: top/htop ("K" to toggle), where VIRT == 0 (heuristic). [1] Saturation system-wide: vmstat 1, "r" > CPU count [2]; sar -q, "runq-sz" > CPU count; dstat -p, "run" > CPU count; perprocess: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3] Errors perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s "04Ah Single-bit ECC Errors Recorded by Scrubber" [4] CPU CPU • ... etc for all combinations (would span a dozen slides)
44.
Products • Earlier I said methodologies could be supported by monitoring solutions • At Joyent we develop Cloud Analytics:
45.
Future • Methodologies for advanced performance issues • I recently worked a complex KVM bandwidth issue where no current methodologies really worked • Innovative methods based on open source + dynamic tracing • Less performance mystery. Less guesswork. • Better use of resources (price/performance) • Easier for beginners to get started
46.
Thank you • Resources: • http://dtrace.org/blogs/brendan • http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ • http://dtrace.org/blogs/brendan/tag/usemethod/ • http://dtrace.org/blogs/brendan/2011/12/18/visualizing-deviceutilization/ - ideas if you are a monitoring solution developer • brendan@joyent.com