[フレーム]
PDF, PPTX39,532 views

Performance Analysis: The USE Method

This document discusses performance analysis methodologies, primarily focusing on the 'use method' for diagnosing system health in cloud computing environments. It illustrates real-world examples of performance issues, specifically regarding network and memory problems, and emphasizes the need for quick assessments to avoid prolonged troubleshooting. Furthermore, the document outlines various tools and techniques for workload characterization and drill-down analysis, advocating for a question-oriented approach to gather insights from system performance metrics.

In this document
Powered by AI

Introduction to performance analysis presented by Brendan Gregg at FISL13, July 2012.

Brendan Gregg discusses his expertise in performance analysis and software development for performance tools.

Presentation of Joyent as a cloud computing provider offering SmartOS and virtualization solutions.

Outline of topics to be covered including problem examples, methodologies, and tools.

Description of a recent cloud performance issue related to database response time and suspected network drops.

Illustration of support levels in performance analysis and traditional vs new methods of approaching network drop issues.

Advantages of dynamic tracing over traditional network packet capture methods for real-time analysis.

Introduction to the USE method for quick system health checks, identifying memory and disk utilization issues.

Comparison of the USE method to other methodologies and its efficiency in problem resolution.

Clarification that performance methodology is a procedure and not a tool or product.

Overview of the evolution of performance analysis methods from the 90s to the present with open source dynamics.

Explanation of how the methodology serves both beginners and experts in performance analysis.

Suggested order for executing performance methodologies to address analysis effectively.

Questions to consider when determining performance issues during support procedures.

Explanation of the USE method focusing on resource utilization, saturation, and errors for system health.

Types of hardware resources involved in the USE method and importance of analyzing server functional diagrams.

Differentiation between I/O and capacity resources in performance analysis.

Types of software resources to evaluate during performance assessments.

Flow diagram for assessing resource health through the USE method.

Guidance on how to interpret utilization, saturation, and errors for effective analysis.

Details on easy and harder combinations of metrics for assessing resource performance.

Essential tools like CPU performance counters and dynamic tracing needed for thorough analysis.

Characterizing workloads to understand and improve performance by identifying load sources.

Method of deep analysis in software layers for understanding latency in performance issues.

Introduction to additional latency-based methodologies for specific database performance analysis.

Introduction to specific tools designed to implement the USE method effectively.

Detailed performance metrics for monitoring CPU utilization and saturation in illumos and Linux systems.

Introduction to Joyent's Cloud Analytics as a supportive tool for performance methodologies.

A look ahead at innovative methodologies being developed for complex performance issues.

Thank you slide with links to further resources and contact information for Brendan Gregg.

Embed presentation

Download as PDF, PPTX
42 / 46
illumos-based • http://dtrace.org/blogs/brendan/2012/03/01/the-usemethod-solaris-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat 1, "idl"; system-wide: vmstat 1, "id"; per-process:prstat -c 1 ("CPU" == recent), prstat mLc 1 ("USR" + "SYS"); per-kernel-thread: lockstat -Ii rate, DTrace profile stack() Saturation system-wide: uptime, load averages; vmstat 1, "r"; DTrace dispqlen.d (DTT) for a better "vmstat r"; per-process: prstat -mLc 1, "LAT" Errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling) Saturation system-wide: vmstat 1, "sr" (bad now), "w" (was very bad); vmstat -p 1, "api" (anon page ins == pain), "apo"; per-process: prstat -mLc 1, "DFL"; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname CPU CPU Memory • ... etc for all combinations (would span a dozen slides)
43 / 46
Linux-based • http://dtrace.org/blogs/brendan/2012/03/07/the-usemethod-linux-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat -P ALL 1, "%idle"; sar -P ALL, "%idle"; system-wide: vmstat 1, "id"; sar -u, "%idle"; dstat -c, "idl"; per-process:top, "%CPU"; htop, "CPU%"; ps -o pcpu; pidstat 1, "%CPU"; per-kernel-thread: top/htop ("K" to toggle), where VIRT == 0 (heuristic). [1] Saturation system-wide: vmstat 1, "r" > CPU count [2]; sar -q, "runq-sz" > CPU count; dstat -p, "run" > CPU count; perprocess: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3] Errors perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s "04Ah Single-bit ECC Errors Recorded by Scrubber" [4] CPU CPU • ... etc for all combinations (would span a dozen slides)
Performance Analysis: The USE Method Brendan Gregg Lead Performance Engineer, Joyent brendan.gregg@joyent.com FISL13 July, 2012
whoami • I work at the top of the performance support chain • I also write open source performance tools out of necessity to solve issues • http://github.com/brendangregg • http://www.brendangregg.com/#software • And books (DTrace, Solaris Performance and Tools) • Was Brendan @ Sun Microsystems, Oracle, now Joyent
Joyent • Cloud computing provider • Cloud computing software • SmartOS • host OS, and guest via OS virtualization • Linux, Windows • guest via KVM
Agenda • Example Problem • Performance Methodology • Problem Statement • The USE Method • Workload Characterization • Drill-Down Analysis • Specific Tools
Example Problem • Recent cloud-based performance issue • Customer problem statement: • "Database response time sometimes take multiple seconds. Is the network dropping packets?" • Tested network using traceroute, which showed some packet drops
Example: Support Path • Performance Analysis Top 2nd Level 1st Level Customer Issues
Example: Support Path • Performance Analysis Top my turn 2nd Level "network looks ok, CPU also ok" 1st Level "ran traceroute, can’t reproduce" Customer: "network drops?"
Example: Network Drops • Old fashioned: network packet capture (sniffing) • Performance overhead during capture (CPU, storage) and post-processing (wireshark) • Time consuming to analyze: not real-time
Example: Network Drops • New: dynamic tracing • Efficient: only drop/retransmit paths traced • Context: kernel state readable • Real-time: analysis and summaries # ./tcplistendrop.d TIME 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 [...] SRC-IP 10.17.210.103 10.17.210.108 10.17.210.116 10.17.210.117 10.17.210.112 10.17.210.106 10.12.143.16 PORT 25691 18423 38883 10739 27988 28824 65070 -> -> -> -> -> -> -> DST-IP 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 PORT 80 80 80 80 80 80 80
Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health
Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health • In < 5 minutes, I found: • CPU: ok (light usage) • network: ok (light usage) • memory: available memory was exhausted, and the system was paging • disk: periodic bursts of 100% utilization • The method is simple, fast, directs further analysis
Example: Other Methodologies • Customer was surprised (are you sure?) I used latency analysis to confirm. Details (if interesting): • memory: using both microstate accounting and dynamic tracing to confirm that anonymous pagins were hurting the database; worst case app thread spent 97% of time waiting on disk (data faults). • disk: using dynamic tracing to confirm latency at the application / file system interface; included up to 1000ms fsync() calls. • Different methodology, smaller audience (expertise), more time (1 hour).
Example: Summary • What happened: • customer, 1st and 2nd level support spent much time chasing network packet drops. • What could have happened: • customer or 1st level follows the USE method and quickly discover memory and disk issues • memory: fixable by customer reconfig • disk: could go back to 1st or 2nd level support for confirmation • Faster resolution, frees time
Performance Methodology • Not a tool • Not a product • Is a procedure (documentation)
Performance Methodology • Not a tool -> but tools can be written to help • Not a product -> could be in monitoring solutions • Is a procedure (documentation)
Why Now: past • Performance analysis circa ‘90s, metric-orientated: • Vendor creates metrics and performance tools • Users develop methods to interpret metrics • Common method: "Tools Method" • List available performance tools • For each tool, list useful metrics • For each metric, determine interpretation • Problematic: vendors often don’t provide the best metrics; can be blind to issue types
Why Now: changes • Open Source • Dynamic Tracing • See anything, not just what the vendor gave you • Only practical on open source software • Hardest part is knowing what questions to ask
Why Now: present • Performance analysis now (post dynamic tracing), question-orientated: • Users pose questions • Check if vendor has provided metrics • Develop custom metrics using dynamic tracing • Methodologies pose the questions • What would previously be an academic exercise is now practical
Methology Audience • Beginners: provides a starting point • Experts: provides a checklist/reminder
Performance Methodolgies • Suggested order of execution: 1.Problem Statement 2.The USE Method 3.Workload Characterization 4.Drill-Down Analysis (Latency)
Problem Statement • Typical support procedure (1st Methodology): 1.What makes you think there is a problem? 2.Has this system ever performed well? 3.What changed? Software? Hardware? Load? 4.Can the performance degradation be expressed in terms of latency or run time? 5.Does the problem affect other people or applications? 6.What is the environment? What software and hardware is used? Versions? Configuration?
The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization • Saturation • Errors
The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization: time resource was busy, or degree used • Saturation: degree of queued extra work • Errors: any errors Saturation X Errors Utilization
The USE Method: Hardware Resources • CPUs • Main Memory • Network Interfaces • Storage Devices • Controllers • Interconnects
The USE Method: Hardware Resources • A great way to determine resources is to find (or draw) the server functional diagram • The hardware team at vendors should have these • Analyze every component in the data path
The USE Method: Functional Diagrams, Generic Example Memory Bus DRAM CPU Interconnect CPU 1 DRAM CPU 2 I/O Bus I/O Bridge I/O Controller Expander Interconnect Network Controller Interface Transports Disk Disk Port Port
The USE Method: Resource Types • There are two different resource types, each define utilization differently: • I/O Resource: eg, network interface • utilization: time resource was busy. current IOPS / max or current throughput / max can be used in some cases • Capacity Resource: eg, main memory • utilization: space consumed • Storage devices act as both resource types
The USE Method: Software Resources • Mutex Locks • Thread Pools • Process/Thread Capacity • File Descriptor Capacity
The USE Method: Flow Diagram Choose Resource Errors Present? Y N High Utilization? Y N N Saturation? Y Problem Identified
The USE Method: Interpretation • Utilization • 100% usually a bottleneck • 70%+ often a bottleneck for I/O resources, especially when high priority work cannot easily interrupt lower priority work (eg, disks) • Beware of time intervals. 60% utilized over 5 minutes may mean 100% utilized for 3 minutes then idle • Best examined per-device (unbalanced workloads)
The USE Method: Interpretation • Saturation • Any non-zero value adds latency • Errors • Should be obvious
The USE Method: Easy Combinations Resource Type CPU utilization CPU saturation Memory utilization Memory saturation Network Interface utilization Storage Device I/O utilization Storage Device I/O saturation Storage Device I/O errors Metric
The USE Method: Easy Combinations Resource Type Metric CPU utilization CPU utilization CPU saturation run-queue length Memory utilization Memory saturation paging or swapping Network Interface utilization Storage Device I/O utilization available memory RX/TX tput/bandwidth device busy percent Storage Device I/O saturation wait queue length Storage Device I/O errors device errors
The USE Method: Harder Combinations Resource Type CPU errors Network saturation Storage Controller utilization CPU Interconnect utilization Mem. Interconnect saturation I/O Interconnect saturation Metric
The USE Method: Harder Combinations Resource Type Metric CPU errors eg, correctable CPU cache ECC events Network saturation "nocanputs", buffering Storage Controller utilization CPU Interconnect utilization active vs max controller IOPS and tput per port tput / max bandwidth Mem. Interconnect saturation memory stall cycles I/O Interconnect bus throughput / max saturation bandwidth
The USE Method: tools • To be thorough, you will need to use: • CPU performance counters • For bus and interconnect activity; eg, perf events, cpustat • Dynamic Tracing • For missing saturation and error metrics; eg, DTrace • Both can get tricky; tools can be developed to help • Please, no more top variants! ... unless it is interconnect-top or bus-top • I’ve written dozens of open source tools for both CPC and DTrace; much more can be done
Workload Characterization • May use as a 3rd Methodology • Characterize workload by: • who is causing the load? PID, UID, IP addr, ... • why is the load called? code path • what is the load? IOPS, tput, type • how is the load changing over time? • Best performance wins are from eliminating unnecessary work • Identifies class of issues that are load-based, not architecture-based
Drill-Down Analysis • May use as a 4th Methodology • Peel away software layers to drill down on the issue • Eg, software stack I/O latency analysis: Application System Call Interface File System Block Device Interface Storage Device Drivers Storage Devices
Drill-Down Analysis: Open Source • With Dynamic Tracing, all function entry & return points can be traced, with nanosecond timestamps. • One Strategy is to measure latency pairs, to search for the source; eg, A->B & C->D: static int arc_cksum_equal(arc_buf_t *buf) A{ zio_cksum_t zc; int equal; C mutex_enter(&buf->b_hdr->b_freeze_lock); fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc); D equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc); mutex_exit(&buf->b_hdr->b_freeze_lock); B} return (equal);
Other Methodologies • Method R • A latency-based analysis approach for Oracle databases. See "Optimizing Oracle Performance" by Cary Millsap and Jeff Holt (2003) • Experimental approaches • Can be very useful: eg, validating network throughput using iperf
Specific Tools for the USE Method
illumos-based • http://dtrace.org/blogs/brendan/2012/03/01/the-usemethod-solaris-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat 1, "idl"; system-wide: vmstat 1, "id"; per-process:prstat -c 1 ("CPU" == recent), prstat mLc 1 ("USR" + "SYS"); per-kernel-thread: lockstat -Ii rate, DTrace profile stack() Saturation system-wide: uptime, load averages; vmstat 1, "r"; DTrace dispqlen.d (DTT) for a better "vmstat r"; per-process: prstat -mLc 1, "LAT" Errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling) Saturation system-wide: vmstat 1, "sr" (bad now), "w" (was very bad); vmstat -p 1, "api" (anon page ins == pain), "apo"; per-process: prstat -mLc 1, "DFL"; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname CPU CPU Memory • ... etc for all combinations (would span a dozen slides)
Linux-based • http://dtrace.org/blogs/brendan/2012/03/07/the-usemethod-linux-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat -P ALL 1, "%idle"; sar -P ALL, "%idle"; system-wide: vmstat 1, "id"; sar -u, "%idle"; dstat -c, "idl"; per-process:top, "%CPU"; htop, "CPU%"; ps -o pcpu; pidstat 1, "%CPU"; per-kernel-thread: top/htop ("K" to toggle), where VIRT == 0 (heuristic). [1] Saturation system-wide: vmstat 1, "r" > CPU count [2]; sar -q, "runq-sz" > CPU count; dstat -p, "run" > CPU count; perprocess: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3] Errors perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s "04Ah Single-bit ECC Errors Recorded by Scrubber" [4] CPU CPU • ... etc for all combinations (would span a dozen slides)
Products • Earlier I said methodologies could be supported by monitoring solutions • At Joyent we develop Cloud Analytics:
Future • Methodologies for advanced performance issues • I recently worked a complex KVM bandwidth issue where no current methodologies really worked • Innovative methods based on open source + dynamic tracing • Less performance mystery. Less guesswork. • Better use of resources (price/performance) • Easier for beginners to get started
Thank you • Resources: • http://dtrace.org/blogs/brendan • http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ • http://dtrace.org/blogs/brendan/tag/usemethod/ • http://dtrace.org/blogs/brendan/2011/12/18/visualizing-deviceutilization/ - ideas if you are a monitoring solution developer • brendan@joyent.com

More Related Content

Handling Redis failover with ZooKeeper
KEY
Handling Redis failover with ZooKeeper
YOW2021 Computing Performance
PDF
YOW2021 Computing Performance
YOW2018 Cloud Performance Root Cause Analysis at Netflix
PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Airflow introduction
PDF
Airflow introduction
Airflow tutorials hands_on
PDF
Airflow tutorials hands_on
Apache Airflow overview
PPTX
Apache Airflow overview
Apache Cassandra
PPTX
Apache Cassandra
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka
PDF
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka
Handling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeper
YOW2021 Computing Performance
YOW2021 Computing Performance
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Airflow introduction
Airflow introduction
Airflow tutorials hands_on
Airflow tutorials hands_on
Apache Airflow overview
Apache Airflow overview
Apache Cassandra
Apache Cassandra
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka
[2019] 바르게, 빠르게! Reactive를 품은 Spring Kafka

What's hot

GraphFrames: Graph Queries In Spark SQL
PDF
GraphFrames: Graph Queries In Spark SQL
Velocity 2015 linux perf tools
PDF
Velocity 2015 linux perf tools
How Prometheus Store the Data
PDF
How Prometheus Store the Data
PostgreSQL Performance Tuning
PDF
PostgreSQL Performance Tuning
Stop the Guessing: Performance Methodologies for Production Systems
PDF
Stop the Guessing: Performance Methodologies for Production Systems
Introducing Apache Airflow and how we are using it
PDF
Introducing Apache Airflow and how we are using it
Blazing Performance with Flame Graphs
PDF
Blazing Performance with Flame Graphs
Apache Airflow Architecture
PDF
Apache Airflow Architecture
Grafana
PPTX
Grafana
Building an analytics workflow using Apache Airflow
PDF
Building an analytics workflow using Apache Airflow
Apache Airflow
PDF
Apache Airflow
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Apache kafka performance(latency)_benchmark_v0.3
PDF
Apache kafka performance(latency)_benchmark_v0.3
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Apache Airflow
PDF
Apache Airflow
jemalloc 세미나
PDF
jemalloc 세미나
Airflow presentation
PDF
Airflow presentation
Airflow presentation
PPTX
Airflow presentation
Hadoop Hand-on Lab: Installing Hadoop 2
PDF
Hadoop Hand-on Lab: Installing Hadoop 2
Introduction to Apache Flink - Fast and reliable big data processing
PDF
Introduction to Apache Flink - Fast and reliable big data processing
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
Velocity 2015 linux perf tools
Velocity 2015 linux perf tools
How Prometheus Store the Data
How Prometheus Store the Data
PostgreSQL Performance Tuning
PostgreSQL Performance Tuning
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
Blazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
Apache Airflow Architecture
Apache Airflow Architecture
Grafana
Grafana
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Apache Airflow
Apache Airflow
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Apache Airflow
Apache Airflow
jemalloc 세미나
jemalloc 세미나
Airflow presentation
Airflow presentation
Airflow presentation
Airflow presentation
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing

Viewers also liked

Linux Performance Analysis: New Tools and Old Secrets
PDF
Linux Performance Analysis: New Tools and Old Secrets
Linux Systems Performance 2016
PDF
Linux Systems Performance 2016
Broken Linux Performance Tools 2016
PPTX
Broken Linux Performance Tools 2016
BPF: Tracing and more
PDF
BPF: Tracing and more
Linux Profiling at Netflix
PDF
Linux Profiling at Netflix
Lisa12 methodologies
PDF
Lisa12 methodologies
Performance Analysis: new tools and concepts from the cloud
PDF
Performance Analysis: new tools and concepts from the cloud
SREcon 2016 Performance Checklists for SREs
PDF
SREcon 2016 Performance Checklists for SREs
Linux 4.x Tracing: Performance Analysis with bcc/BPF
PDF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
DTrace Topics: Introduction
PDF
DTrace Topics: Introduction
ACM Applicative System Methodology 2016
PDF
ACM Applicative System Methodology 2016
Linux 4.x Tracing Tools: Using BPF Superpowers
PDF
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux Performance Analysis and Tools
PDF
Linux Performance Analysis and Tools
The New Systems Performance
PDF
The New Systems Performance
Performance analysis 2013
PPTX
Performance analysis 2013
From DTrace to Linux
PDF
From DTrace to Linux
Open Source Systems Performance
PDF
Open Source Systems Performance
Systems Performance: Enterprise and the Cloud
PDF
Systems Performance: Enterprise and the Cloud
Designing Tracing Tools
PDF
Designing Tracing Tools
MeetBSD2014 Performance Analysis
PDF
MeetBSD2014 Performance Analysis
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
Linux Systems Performance 2016
Linux Systems Performance 2016
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
BPF: Tracing and more
BPF: Tracing and more
Linux Profiling at Netflix
Linux Profiling at Netflix
Lisa12 methodologies
Lisa12 methodologies
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
DTrace Topics: Introduction
DTrace Topics: Introduction
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux Performance Analysis and Tools
Linux Performance Analysis and Tools
The New Systems Performance
The New Systems Performance
Performance analysis 2013
Performance analysis 2013
From DTrace to Linux
From DTrace to Linux
Open Source Systems Performance
Open Source Systems Performance
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
Designing Tracing Tools
Designing Tracing Tools
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis

Similar to Performance Analysis: The USE Method

Monitorama 2015 Netflix Instance Analysis
PDF
Monitorama 2015 Netflix Instance Analysis
Software Performance
PPT
Software Performance
Performance Forensics - Understanding Application Performance
PPTX
Performance Forensics - Understanding Application Performance
Performance monitoring and call tracing in microservice environments
PDF
Performance monitoring and call tracing in microservice environments
Become a Performance Diagnostics Hero
PDF
Become a Performance Diagnostics Hero
LISA2010 visualizations
PDF
LISA2010 visualizations
Introduction: What is Performance Testing?
DOCX
Introduction: What is Performance Testing?
Performance testing with your eyes wide open geekweek 2018
PDF
Performance testing with your eyes wide open geekweek 2018
Performance engineering methodologies
PDF
Performance engineering methodologies
7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar
PPTX
7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar
Software Analytics: Data Analytics for Software Engineering
PDF
Software Analytics: Data Analytics for Software Engineering
Performance
PDF
Performance
Door to perfomance testing
PDF
Door to perfomance testing
Quick and dirty performance analysis
PPTX
Quick and dirty performance analysis
Pinpoint Ceph Bottleneck Out of Cluster Behavior Mists - Yingxin Cheng
PDF
Pinpoint Ceph Bottleneck Out of Cluster Behavior Mists - Yingxin Cheng
Interpreting Performance Test Results
PPTX
Interpreting Performance Test Results
Performance testing basics
PPTX
Performance testing basics
Jee performance tuning existing applications
PPTX
Jee performance tuning existing applications
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
PDF
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
Common Sense Performance Indicators in the Cloud
PDF
Common Sense Performance Indicators in the Cloud
Monitorama 2015 Netflix Instance Analysis
Monitorama 2015 Netflix Instance Analysis
Software Performance
Software Performance
Performance Forensics - Understanding Application Performance
Performance Forensics - Understanding Application Performance
Performance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environments
Become a Performance Diagnostics Hero
Become a Performance Diagnostics Hero
LISA2010 visualizations
LISA2010 visualizations
Introduction: What is Performance Testing?
Introduction: What is Performance Testing?
Performance testing with your eyes wide open geekweek 2018
Performance testing with your eyes wide open geekweek 2018
Performance engineering methodologies
Performance engineering methodologies
7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar
7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Performance
Performance
Door to perfomance testing
Door to perfomance testing
Quick and dirty performance analysis
Quick and dirty performance analysis
Pinpoint Ceph Bottleneck Out of Cluster Behavior Mists - Yingxin Cheng
Pinpoint Ceph Bottleneck Out of Cluster Behavior Mists - Yingxin Cheng
Interpreting Performance Test Results
Interpreting Performance Test Results
Performance testing basics
Performance testing basics
Jee performance tuning existing applications
Jee performance tuning existing applications
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
Common Sense Performance Indicators in the Cloud
Common Sense Performance Indicators in the Cloud

More from Brendan Gregg

IntelON 2021 Processor Benchmarking
PDF
IntelON 2021 Processor Benchmarking
Performance Wins with eBPF: Getting Started (2021)
PDF
Performance Wins with eBPF: Getting Started (2021)
Systems@Scale 2021 BPF Performance Getting Started
PDF
Systems@Scale 2021 BPF Performance Getting Started
Computing Performance: On the Horizon (2021)
PDF
Computing Performance: On the Horizon (2021)
BPF Internals (eBPF)
PDF
BPF Internals (eBPF)
Performance Wins with BPF: Getting Started
PDF
Performance Wins with BPF: Getting Started
YOW2020 Linux Systems Performance
PDF
YOW2020 Linux Systems Performance
re:Invent 2019 BPF Performance Analysis at Netflix
PDF
re:Invent 2019 BPF Performance Analysis at Netflix
UM2019 Extended BPF: A New Type of Software
PDF
UM2019 Extended BPF: A New Type of Software
LISA2019 Linux Systems Performance
PDF
LISA2019 Linux Systems Performance
LPC2019 BPF Tracing Tools
PDF
LPC2019 BPF Tracing Tools
LSFMM 2019 BPF Observability
PDF
LSFMM 2019 BPF Observability
YOW2018 CTO Summit: Working at netflix
PDF
YOW2018 CTO Summit: Working at netflix
eBPF Perf Tools 2019
PDF
eBPF Perf Tools 2019
BPF Tools 2017
PDF
BPF Tools 2017
NetConf 2018 BPF Observability
PDF
NetConf 2018 BPF Observability
FlameScope 2018
PDF
FlameScope 2018
ATO Linux Performance 2018
PDF
ATO Linux Performance 2018
Linux Performance 2018 (PerconaLive keynote)
PDF
Linux Performance 2018 (PerconaLive keynote)
How Netflix Tunes EC2 Instances for Performance
PDF
How Netflix Tunes EC2 Instances for Performance
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
BPF Internals (eBPF)
BPF Internals (eBPF)
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
YOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
LISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
LSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
eBPF Perf Tools 2019
eBPF Perf Tools 2019
BPF Tools 2017
BPF Tools 2017
NetConf 2018 BPF Observability
NetConf 2018 BPF Observability
FlameScope 2018
FlameScope 2018
ATO Linux Performance 2018
ATO Linux Performance 2018
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance

Recently uploaded

Manage Basic Storage in RHEL - RHCSA (RH134).pdf
PDF
Manage Basic Storage in RHEL - RHCSA (RH134).pdf
Getting started with Agent Framework.pdf
PDF
Getting started with Agent Framework.pdf
The Accel 2025 Globalscape: Race for compute
PDF
The Accel 2025 Globalscape: Race for compute
Reset RHEL Root User Password - RHCSA.pdf
PDF
Reset RHEL Root User Password - RHCSA.pdf
Run Containers in RHEL - RHCSA (RH134).pdf
PDF
Run Containers in RHEL - RHCSA (RH134).pdf
Discover - Assemble - and Gain Insights into your Content with SharePoint Con...
PPTX
Discover - Assemble - and Gain Insights into your Content with SharePoint Con...
Career Blueprint: Mentor Tracks & Career Clinic - Part 2
PDF
Career Blueprint: Mentor Tracks & Career Clinic - Part 2
Upskill to Agentic Automation - Accelerating Your Job Search using AI
PDF
Upskill to Agentic Automation - Accelerating Your Job Search using AI
Implement Advanced Storage in RHEL - RHCSA (RH134).pdf
PDF
Implement Advanced Storage in RHEL - RHCSA (RH134).pdf
Open Source SecurityCon 2025 in Atlanta - Transparency Exchange API: Where To...
PDF
Open Source SecurityCon 2025 in Atlanta - Transparency Exchange API: Where To...
Improve Command Line Productivity - RHCSA (RH134).pdf
PDF
Improve Command Line Productivity - RHCSA (RH134).pdf
Opening Plenary - Esri UK Welsh Conference 2025
PPTX
Opening Plenary - Esri UK Welsh Conference 2025
ENTSO-E's Response to the European Commission Call for Evidence on the Strate...
PDF
ENTSO-E's Response to the European Commission Call for Evidence on the Strate...
Explaining ourselves – people, computers and AI
PPTX
Explaining ourselves – people, computers and AI
How Design Systems and AI Agents Accelerate Product Delivery Sixt
PPTX
How Design Systems and AI Agents Accelerate Product Delivery Sixt
Manage Logical Volume in RHEL - RHCSA (RH134).pdf
PDF
Manage Logical Volume in RHEL - RHCSA (RH134).pdf
Tune System Performance - RHCSA (RH134).pdf
PDF
Tune System Performance - RHCSA (RH134).pdf
Career Blueprint - Future Career Vision & Success Stories - 2025 - Part 1
PPTX
Career Blueprint - Future Career Vision & Success Stories - 2025 - Part 1
The Future of Database Diagnostics is a Conversation with Oracle AHF Fleet In...
PDF
The Future of Database Diagnostics is a Conversation with Oracle AHF Fleet In...
#MakeAIMatter for HR Professionals | AI Transformation Workshop by Tekdi Tech...
PDF
#MakeAIMatter for HR Professionals | AI Transformation Workshop by Tekdi Tech...
Manage Basic Storage in RHEL - RHCSA (RH134).pdf
Manage Basic Storage in RHEL - RHCSA (RH134).pdf
Getting started with Agent Framework.pdf
Getting started with Agent Framework.pdf
The Accel 2025 Globalscape: Race for compute
The Accel 2025 Globalscape: Race for compute
Reset RHEL Root User Password - RHCSA.pdf
Reset RHEL Root User Password - RHCSA.pdf
Run Containers in RHEL - RHCSA (RH134).pdf
Run Containers in RHEL - RHCSA (RH134).pdf
Discover - Assemble - and Gain Insights into your Content with SharePoint Con...
Discover - Assemble - and Gain Insights into your Content with SharePoint Con...
Career Blueprint: Mentor Tracks & Career Clinic - Part 2
Career Blueprint: Mentor Tracks & Career Clinic - Part 2
Upskill to Agentic Automation - Accelerating Your Job Search using AI
Upskill to Agentic Automation - Accelerating Your Job Search using AI
Implement Advanced Storage in RHEL - RHCSA (RH134).pdf
Implement Advanced Storage in RHEL - RHCSA (RH134).pdf
Open Source SecurityCon 2025 in Atlanta - Transparency Exchange API: Where To...
Open Source SecurityCon 2025 in Atlanta - Transparency Exchange API: Where To...
Improve Command Line Productivity - RHCSA (RH134).pdf
Improve Command Line Productivity - RHCSA (RH134).pdf
Opening Plenary - Esri UK Welsh Conference 2025
Opening Plenary - Esri UK Welsh Conference 2025
ENTSO-E's Response to the European Commission Call for Evidence on the Strate...
ENTSO-E's Response to the European Commission Call for Evidence on the Strate...
Explaining ourselves – people, computers and AI
Explaining ourselves – people, computers and AI
How Design Systems and AI Agents Accelerate Product Delivery Sixt
How Design Systems and AI Agents Accelerate Product Delivery Sixt
Manage Logical Volume in RHEL - RHCSA (RH134).pdf
Manage Logical Volume in RHEL - RHCSA (RH134).pdf
Tune System Performance - RHCSA (RH134).pdf
Tune System Performance - RHCSA (RH134).pdf
Career Blueprint - Future Career Vision & Success Stories - 2025 - Part 1
Career Blueprint - Future Career Vision & Success Stories - 2025 - Part 1
The Future of Database Diagnostics is a Conversation with Oracle AHF Fleet In...
The Future of Database Diagnostics is a Conversation with Oracle AHF Fleet In...
#MakeAIMatter for HR Professionals | AI Transformation Workshop by Tekdi Tech...
#MakeAIMatter for HR Professionals | AI Transformation Workshop by Tekdi Tech...

Performance Analysis: The USE Method

  • 1.
    Performance Analysis: The USE Method Brendan Gregg Lead Performance Engineer, Joyent brendan.gregg@joyent.com FISL13 July, 2012
  • 2.
    whoami • I work at the top of the performance support chain • I also write open source performance tools out of necessity to solve issues • http://github.com/brendangregg • http://www.brendangregg.com/#software • And books (DTrace, Solaris Performance and Tools) • Was Brendan @ Sun Microsystems, Oracle, now Joyent
  • 3.
    Joyent • Cloud computing provider • Cloud computing software • SmartOS • host OS, and guest via OS virtualization • Linux, Windows • guest via KVM
  • 4.
    Agenda • Example Problem • Performance Methodology • Problem Statement • The USE Method • Workload Characterization • Drill-Down Analysis • Specific Tools
  • 5.
    Example Problem • Recent cloud-based performance issue • Customer problem statement: • "Database response time sometimes take multiple seconds. Is the network dropping packets?" • Tested network using traceroute, which showed some packet drops
  • 6.
    Example: Support Path • Performance Analysis Top 2nd Level 1st Level Customer Issues
  • 7.
    Example: Support Path • Performance Analysis Top my turn 2nd Level "network looks ok, CPU also ok" 1st Level "ran traceroute, can’t reproduce" Customer: "network drops?"
  • 8.
    Example: Network Drops • Old fashioned: network packet capture (sniffing) • Performance overhead during capture (CPU, storage) and post-processing (wireshark) • Time consuming to analyze: not real-time
  • 9.
    Example: Network Drops • New: dynamic tracing • Efficient: only drop/retransmit paths traced • Context: kernel state readable • Real-time: analysis and summaries # ./tcplistendrop.d TIME 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 [...] SRC-IP 10.17.210.103 10.17.210.108 10.17.210.116 10.17.210.117 10.17.210.112 10.17.210.106 10.12.143.16 PORT 25691 18423 38883 10739 27988 28824 65070 -> -> -> -> -> -> -> DST-IP 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 PORT 80 80 80 80 80 80 80
  • 10.
    Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health
  • 11.
    Example: Methodology • Instead of network drop analysis, I began with the USE method to check system health • In < 5 minutes, I found: • CPU: ok (light usage) • network: ok (light usage) • memory: available memory was exhausted, and the system was paging • disk: periodic bursts of 100% utilization • The method is simple, fast, directs further analysis
  • 12.
    Example: Other Methodologies • Customer was surprised (are you sure?) I used latency analysis to confirm. Details (if interesting): • memory: using both microstate accounting and dynamic tracing to confirm that anonymous pagins were hurting the database; worst case app thread spent 97% of time waiting on disk (data faults). • disk: using dynamic tracing to confirm latency at the application / file system interface; included up to 1000ms fsync() calls. • Different methodology, smaller audience (expertise), more time (1 hour).
  • 13.
    Example: Summary • What happened: • customer, 1st and 2nd level support spent much time chasing network packet drops. • What could have happened: • customer or 1st level follows the USE method and quickly discover memory and disk issues • memory: fixable by customer reconfig • disk: could go back to 1st or 2nd level support for confirmation • Faster resolution, frees time
  • 14.
    Performance Methodology • Not a tool • Not a product • Is a procedure (documentation)
  • 15.
    Performance Methodology • Not a tool -> but tools can be written to help • Not a product -> could be in monitoring solutions • Is a procedure (documentation)
  • 16.
    Why Now: past • Performance analysis circa ‘90s, metric-orientated: • Vendor creates metrics and performance tools • Users develop methods to interpret metrics • Common method: "Tools Method" • List available performance tools • For each tool, list useful metrics • For each metric, determine interpretation • Problematic: vendors often don’t provide the best metrics; can be blind to issue types
  • 17.
    Why Now: changes • Open Source • Dynamic Tracing • See anything, not just what the vendor gave you • Only practical on open source software • Hardest part is knowing what questions to ask
  • 18.
    Why Now: present • Performance analysis now (post dynamic tracing), question-orientated: • Users pose questions • Check if vendor has provided metrics • Develop custom metrics using dynamic tracing • Methodologies pose the questions • What would previously be an academic exercise is now practical
  • 19.
    Methology Audience • Beginners: provides a starting point • Experts: provides a checklist/reminder
  • 20.
    Performance Methodolgies • Suggested order of execution: 1.Problem Statement 2.The USE Method 3.Workload Characterization 4.Drill-Down Analysis (Latency)
  • 21.
    Problem Statement • Typical support procedure (1st Methodology): 1.What makes you think there is a problem? 2.Has this system ever performed well? 3.What changed? Software? Hardware? Load? 4.Can the performance degradation be expressed in terms of latency or run time? 5.Does the problem affect other people or applications? 6.What is the environment? What software and hardware is used? Versions? Configuration?
  • 22.
    The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization • Saturation • Errors
  • 23.
    The USE Method • Quick System Health Check (2nd Methodology): • For every resource, check: • Utilization: time resource was busy, or degree used • Saturation: degree of queued extra work • Errors: any errors Saturation X Errors Utilization
  • 24.
    The USE Method: Hardware Resources • CPUs • Main Memory • Network Interfaces • Storage Devices • Controllers • Interconnects
  • 25.
    The USE Method: Hardware Resources • A great way to determine resources is to find (or draw) the server functional diagram • The hardware team at vendors should have these • Analyze every component in the data path
  • 26.
    The USE Method: Functional Diagrams, Generic Example Memory Bus DRAM CPU Interconnect CPU 1 DRAM CPU 2 I/O Bus I/O Bridge I/O Controller Expander Interconnect Network Controller Interface Transports Disk Disk Port Port
  • 27.
    The USE Method: Resource Types • There are two different resource types, each define utilization differently: • I/O Resource: eg, network interface • utilization: time resource was busy. current IOPS / max or current throughput / max can be used in some cases • Capacity Resource: eg, main memory • utilization: space consumed • Storage devices act as both resource types
  • 28.
    The USE Method: Software Resources • Mutex Locks • Thread Pools • Process/Thread Capacity • File Descriptor Capacity
  • 29.
    The USE Method: Flow Diagram Choose Resource Errors Present? Y N High Utilization? Y N N Saturation? Y Problem Identified
  • 30.
    The USE Method: Interpretation • Utilization • 100% usually a bottleneck • 70%+ often a bottleneck for I/O resources, especially when high priority work cannot easily interrupt lower priority work (eg, disks) • Beware of time intervals. 60% utilized over 5 minutes may mean 100% utilized for 3 minutes then idle • Best examined per-device (unbalanced workloads)
  • 31.
    The USE Method: Interpretation • Saturation • Any non-zero value adds latency • Errors • Should be obvious
  • 32.
    The USE Method: Easy Combinations Resource Type CPU utilization CPU saturation Memory utilization Memory saturation Network Interface utilization Storage Device I/O utilization Storage Device I/O saturation Storage Device I/O errors Metric
  • 33.
    The USE Method: Easy Combinations Resource Type Metric CPU utilization CPU utilization CPU saturation run-queue length Memory utilization Memory saturation paging or swapping Network Interface utilization Storage Device I/O utilization available memory RX/TX tput/bandwidth device busy percent Storage Device I/O saturation wait queue length Storage Device I/O errors device errors
  • 34.
    The USE Method: Harder Combinations Resource Type CPU errors Network saturation Storage Controller utilization CPU Interconnect utilization Mem. Interconnect saturation I/O Interconnect saturation Metric
  • 35.
    The USE Method: Harder Combinations Resource Type Metric CPU errors eg, correctable CPU cache ECC events Network saturation "nocanputs", buffering Storage Controller utilization CPU Interconnect utilization active vs max controller IOPS and tput per port tput / max bandwidth Mem. Interconnect saturation memory stall cycles I/O Interconnect bus throughput / max saturation bandwidth
  • 36.
    The USE Method: tools • To be thorough, you will need to use: • CPU performance counters • For bus and interconnect activity; eg, perf events, cpustat • Dynamic Tracing • For missing saturation and error metrics; eg, DTrace • Both can get tricky; tools can be developed to help • Please, no more top variants! ... unless it is interconnect-top or bus-top • I’ve written dozens of open source tools for both CPC and DTrace; much more can be done
  • 37.
    Workload Characterization • May use as a 3rd Methodology • Characterize workload by: • who is causing the load? PID, UID, IP addr, ... • why is the load called? code path • what is the load? IOPS, tput, type • how is the load changing over time? • Best performance wins are from eliminating unnecessary work • Identifies class of issues that are load-based, not architecture-based
  • 38.
    Drill-Down Analysis • May use as a 4th Methodology • Peel away software layers to drill down on the issue • Eg, software stack I/O latency analysis: Application System Call Interface File System Block Device Interface Storage Device Drivers Storage Devices
  • 39.
    Drill-Down Analysis: Open Source • With Dynamic Tracing, all function entry & return points can be traced, with nanosecond timestamps. • One Strategy is to measure latency pairs, to search for the source; eg, A->B & C->D: static int arc_cksum_equal(arc_buf_t *buf) A{ zio_cksum_t zc; int equal; C mutex_enter(&buf->b_hdr->b_freeze_lock); fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc); D equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc); mutex_exit(&buf->b_hdr->b_freeze_lock); B} return (equal);
  • 40.
    Other Methodologies • Method R • A latency-based analysis approach for Oracle databases. See "Optimizing Oracle Performance" by Cary Millsap and Jeff Holt (2003) • Experimental approaches • Can be very useful: eg, validating network throughput using iperf
  • 41.
    Specific Tools for the USE Method
  • 42.
    illumos-based • http://dtrace.org/blogs/brendan/2012/03/01/the-usemethod-solaris-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat 1, "idl"; system-wide: vmstat 1, "id"; per-process:prstat -c 1 ("CPU" == recent), prstat mLc 1 ("USR" + "SYS"); per-kernel-thread: lockstat -Ii rate, DTrace profile stack() Saturation system-wide: uptime, load averages; vmstat 1, "r"; DTrace dispqlen.d (DTT) for a better "vmstat r"; per-process: prstat -mLc 1, "LAT" Errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling) Saturation system-wide: vmstat 1, "sr" (bad now), "w" (was very bad); vmstat -p 1, "api" (anon page ins == pain), "apo"; per-process: prstat -mLc 1, "DFL"; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname CPU CPU Memory • ... etc for all combinations (would span a dozen slides)
  • 43.
    Linux-based • http://dtrace.org/blogs/brendan/2012/03/07/the-usemethod-linux-performance-checklist/ Resource Type Metric CPU Utilization per-cpu: mpstat -P ALL 1, "%idle"; sar -P ALL, "%idle"; system-wide: vmstat 1, "id"; sar -u, "%idle"; dstat -c, "idl"; per-process:top, "%CPU"; htop, "CPU%"; ps -o pcpu; pidstat 1, "%CPU"; per-kernel-thread: top/htop ("K" to toggle), where VIRT == 0 (heuristic). [1] Saturation system-wide: vmstat 1, "r" > CPU count [2]; sar -q, "runq-sz" > CPU count; dstat -p, "run" > CPU count; perprocess: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3] Errors perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s "04Ah Single-bit ECC Errors Recorded by Scrubber" [4] CPU CPU • ... etc for all combinations (would span a dozen slides)
  • 44.
    Products • Earlier I said methodologies could be supported by monitoring solutions • At Joyent we develop Cloud Analytics:
  • 45.
    Future • Methodologies for advanced performance issues • I recently worked a complex KVM bandwidth issue where no current methodologies really worked • Innovative methods based on open source + dynamic tracing • Less performance mystery. Less guesswork. • Better use of resources (price/performance) • Easier for beginners to get started
  • 46.
    Thank you • Resources: • http://dtrace.org/blogs/brendan • http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ • http://dtrace.org/blogs/brendan/tag/usemethod/ • http://dtrace.org/blogs/brendan/2011/12/18/visualizing-deviceutilization/ - ideas if you are a monitoring solution developer • brendan@joyent.com

AltStyle によって変換されたページ (->オリジナル) /