How I Use Iostat and Vmstat for Performance Analysis

System Performance Analysis

As stated by Andy Hunt in his book Pragmatic Thinking and Learning, the most significant difference between a newbie programmer and an expert is the abilibity to sense the context of a problem while solving it. What a newbie needs is only a list of step-by-step operations which are context independent, while an expert has to know details as many as possible. Thus, as an expert of system performance analyzing and tuning, you need to use tools such as vmstat and iostat to get details. In order to be such an expert, you need a step-by-step guide.

Here is the guide.

Why Two?

One common question is that why using two tools instead of one? vmstat and iostat are both powerful and have some overlapping on, e.g., statistic of CPU usage. I chose the two because other people do so. More specifically, see the following vmstat sample:

# vmstat 1 3

procs   -----------memory---------- ---swap-- -----io---- --system-- ----cpu-----
 r  b   swpd   free     buff   cache  si   so   bi    bo   in   cs   us sy  id wa
 0  0      0 22880024 1122008 6301332  0    0    0     5    1    0    0  0 100  0
 1  0      0 22879884 1122008 6301332  0    0    0     0    7  189    0  0 100  0
 0  0      0 22879884 1122008 6301332  0    0    0     4    5  154    0  0 100  0

and iostat sample:

# iostat sdb -xdk 1 3

Linux 2.6.26-2-amd64 (R900)     01/29/2014  _x86_64_

Device:  rrqm/s   wrqm/s  r/s     w/s      rkB/s   wkB/s  avgrq-sz  avgqu-sz await  svctm  %util
sdb      0.01     6.99    0.05    0.25     0.47    28.98   191.90     0.05  178.20   2.52   0.08

Device:  rrqm/s   wrqm/s  r/s     w/s      rkB/s   wkB/s  avgrq-sz  avgqu-sz await  svctm  %util
sdb      0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:  rrqm/s   wrqm/s  r/s     w/s      rkB/s   wkB/s  avgrq-sz  avgqu-sz await  svctm  %util
sdb      0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

We use “-d” option in iostat to ignore CPU statistics, which has already been shown in vmstat. As we can see, the output of vmstat includes OS status (r for running process, b for blocking process, in for interrupt, cs for context switch), as well as memory and CPU status.

Meanwhile, the output of iostat focuses on disk device. Note that the data of iostat comes from the block layer, which is under the page cache layer.

As a summary, we use vmstat to get OS, Memory, CPU status, and use iostat to get Disk status. Network is not considered here.

My Own Step-by-step List

  1. Is the I/O heavy?

    Check the sum of w/s and r/s. The larger, the heavier I/O. Also check %util, the more, the heavier. If it is close to 100, then the I/O is definitely significant. It should be noted that during writing, if the disk is the bottleneck (%util is 100% for a long time), but the applications keep writing, as long as the dirty pages exceeds 30% of memory, the system will block all the write system call, no matter sync or async, and focuses on writing to the disk. Once this occurs, the entire system is slow as hell. Note that, in my opinion, the %util should be similar with wa in vmstat.

  2. How many processes are doing I/O concurrently?

    Check b in vmstat log. If the value is large, then the concurrence is at a high level.

  3. Is the I/O sequential or random?

    Check rrqm/s and r/s. If rrqm/s is large, then there’re many sequential write. If r/s is large, then random writes. Same for wrqm/s and w/s. Also Check avgrq-sz. The larger, the more likely to be sequential. It would be better to get the distribution of different sizes of I/O.

  4. Are the I/O requests bursting or balanced?

    Check await, svctm and %util. svctm is usually a constant which depends on the device. If await is much larger than svctm, it means the queue is long and thus I/O is heavy (recheck w/s and r/s to ensure). At the same time, if the %util is NOT large, it means the I/O is bursting. One saying is that if await is larger than 10ms, the latency is considered to be long. Note that if the data indicates that bursting does exist, it may not be caused by application’s behavior, but by the buffering mechanism of the OS.

  5. What is the read/write ratio?

    It is easy to get from w/s and r/s. Useful if the device has different performance for read and write.

  6. How about latency and throughput?

    Check svctm for latency, and rkB/s, wkB/s for throughput. If the I/O is heavy but throughput is low, it is likely that most of the I/O are random. Recheck that. Burst may also affect the latency.

  7. Find the bottleneck

    The bottleneck could be: the device, the CPU, the I/O scheduler, the file system, the application, or other. If the %util is approaching 100%, the disk is likely to be the bottleneck. If the %util is below 100%, but await is far larger than svctm, bursting is likely and usually the application is the one that should be blamed. Similarily, if the I/O is much random, you should also check the application. I don’t think you can figure out the OS as the bottleneck, since the data is below the layer of I/O scheduler and FS. iostat is at the device level. In order to get more info, you can try strace together.

Interpretation of Some Fields

There are a lot of articals and blogs on this. Just check following ones:

Something needs to be noted:

  • rrqm/s means how many I/O requests are merged. For example, 100 read requests are merged to 2, then rrqm/s is 100, and r/s is 2.
  • %util can be calculated as (r/s + w/s) * svctim / 1000ms * 100.
  • avgqu-sz: this one is a little bit tricky. Some one said there’s a bug in calculating the queue size, like in here, that the value is 10 times larger. Even given such explaination, I myself cannot understand the meaning of the value. Should it be the average number of requests that one must wait? It seems not, since it is calculated by “total waiting time / 1000ms”, what’s that?

In How Linux iostat computes its results, it mentioned as following:

avgqu-sz is computed from the last field in the file – the one that has “memory” – divided by the milliseconds elapsed. Hence the units cancel out and you just get the average number of operations in progress during the time period. The name (short for “average queue size”) is a little bit ambiguous. This value doesn’t show how many operations were queued but not yet being serviced – it shows how many were either in the queue waiting, or being serviced. The exact wording of the kernel documentation is “…as requests are given to appropriate struct request_queue and decremented as they finish.”

However, the explaination is also not easy to understand. At least, it shows that avgqu-sz does NOT mean “average queue size”, which means that it’s really ambiguous and hard to explain to others. So, I just ignore it and suggest you do the same.

iostat is not Paracea

From When iostat Leads You Astray:

… looking at how hard the disks are rattling, as we did above using iostat(1M), tells us very little about what the target application is actually experiencing. Application I/O can be inflated or deflated by the file system by the time it reaches disks, making difficult at best a direct correlation between disk and application I/O. Disk I/O also includes requests from other file system components, such as prefetch, the background flusher, on-disk layout metadata, and other users of the file system (other applications). Even if you find an issue at the disk level, it’s hard to tell how much that matters to the application in question.

More Details on Fields

A good reference to explain each fields in iostat can be found here: Monitoring IO performance using iostat & pt-diskstats, on MySQL conference.

[ben@lab ~]$ cat /proc/diskstats

   7       0 loop0 0 0 0 0 0 0 0 0 0 0 0
   7       1 loop1 0 0 0 0 0 0 0 0 0 0 0
   7       2 loop2 0 0 0 0 0 0 0 0 0 0 0
   7       3 loop3 0 0 0 0 0 0 0 0 0 0 0
   7       4 loop4 0 0 0 0 0 0 0 0 0 0 0
   7       5 loop5 0 0 0 0 0 0 0 0 0 0 0
   7       6 loop6 0 0 0 0 0 0 0 0 0 0 0
   7       7 loop7 0 0 0 0 0 0 0 0 0 0 0
   8       0 sda 44783 15470 2257302 1210711 85999 54224 1808924 6087675 0 1087763 7298349
   8       1 sda1 463 163 4176 6464 2 0 4 1 0 6215 6465
   8       2 sda2 267 31 2136 4146 0 0 0 0 0 4053 4146
   8       3 sda3 43885 15276 2249646 1197369 73520 54224 1808920 5575620 0 654552 6772954
  11       0 sr0 0 0 0 0 0 0 0 0 0 0 0
 253       0 dm­0 42736 0 1796226 1391325 15414 0 187656 3199366 0 304001 4590697
 253       1 dm­1 16476 0 449218 530482 113707 0 1572032 4549033 0 838217 5079524
 253       2 dm­2 574 0 3410 15473 3747 0 49232 185560 0 61399 201034There are 11 fields:
  • Field 1 – read_IOs: Total number of reads completed (requests)
  • Field 2 – read_merges: Total number of reads merged (requests)
  • Field 3 – read_sectors: Total number of sectors read (sectors)
  • Field 4 – read_ticks: Total time spent reading (milliseconds)
  • Field 5 – write_IOs: Total number of writes completed (requests)
  • Field 6 – write_merges: Total number of writes merged (requests)
  • Field 7 – write_sectors: Total number of sectors written (sectors)
  • Field 8 – write_ticks: Total time spent writing (milliseconds)
  • Field 9 – in_flight: The number of I/Os currently in flight. It does not include I/O requests that are in the queue but not yet issued to the device driver. (requests)
  • Field 10 – io_ticks: This value counts the time during which the device has had I/O requests queued. (milliseconds)
  • Field 11 – time_in_queue: The number of I/Os in progress (field 9) times the number of milliseconds spent doing I/O since the last update of this field. (milliseconds)

How these fields are used for calculation:

  • rrqm/s (requests) : delta[read_merges(f2)] / interval
  • wrqm/s (requests) : delta[write_merges(f6)] / interval
  • r/s (requests) : delta[read_IOs(f1)] / interval
  • w/s   (requests) : delta[write_IOs(f5)] / interval
  • rkB/s   (kB/mB) : (delta[read_sectors(f3)] / interval) / conversion factor
  • wkB/s   (kB/mB) : (delta[write_sectors(f7)] / interval)  / conversion factor
  • avgrqsz (sectors) : delta[read_sectors(f3) + write_sectors(f7)] / delta[read_IOs(f1) + write_IOs(f5)], or 0.0 if no IO
  • avgqusz (requests) : (delta[time_in_queue(f11)] / interval) / 1000.0
  • await  (ms) : delta[read_ticks(f4) + write_ticks(f8)] / delta[read_IOs(f1) + write_IOs(f5)], or 0.0 if no IO
  • r_await (ms) : delta[read_ticks(f4)] / delta[read_IOs(f1)], or 0.0 if no read IOs
  • w_await (ms) : delta[write_ticks(f8)] / delta[write_IOs(f5)], or 0.0 if no write IOs
  • svctm  (ms) : ((delta[read_IOs(f1) + write_IOs(f5)] * HZ) / interval) / (delta[IO_ticks(f10)] / interval), or 0.0 if tput = 0
  • %util  (percent) : ((delta[IO_ticks(f10)] / interval) / 10) / devices

Note: HZ is 1000 on most systems. svctm field will be removed in a future sysstat version.

Other Reference on iostat

Other Tools for Performance Analysis

Some Tools for Data Visualization