作为一个年轻教师,在第一次备课的时候,我会向国内外著名的教授学习,去他们的主页找课件,看他们都会讲哪些topic,每个topic如何组织,各topic如何关联,课上问哪些问题,课后做哪些lab和homework,怎么出考卷,考哪些内容,等等。
在接下来几年的教学实践中,渐渐发现一些新的问题,比如:
这些问题如何解决?不幸的是,这次Google没法找到答案,唯一的方法只有不断摸索,积极尝试,获取反馈,修正后再尝试…然后不断迭代,一点点改进。
试想,如果可以直接与国内外最顶尖的教授们当面讨论这些问题,那该多好!
所以,当我得知Corenell大学的Robbert Van Renesse和UCSD的Geoffrey M. Voelker两位计算机系统界的知名教授受MSRA的邀请来中国专门为计算机系统教学举办为期三天的Workshop时,果断报名参加——这是我第二次参加这个Workshop,尽管如此,我的收获还是远远超过了预期。
Workshop的具体形式,是让所有参会者每人上一节课,指定Topic,然后所有人对上课的内容和形式提出自己的看法,一起改进。毫无疑问,这是一种最直接的指导方式,就像看着学生写代码一样,直接从实践中予以指导。
老师们上课认真,下面的学生更认真,不断提问:“这个概念之前没有提过,用它解释新的概念,学生如何理解?”“这段材料组织可以更好,我一般这么上,…”“这里的内容不够准确,Step-6应该在Step-8的后面…” 整个Workshop就在这样的气氛中进行,每个人都是听者,每个人也都是讲者,这些topic都是自己熟悉的内容,听别人用另一种方式来讲往往会碰撞出新的思路,有时针对一个问题的讨论可以非常激烈,这种讨论能够启发出更多的思考——这是仅靠一个人不可能做到的。
三天的讨论很密集,一共有18个topic,3次discussion,全部围绕操作系统中的概念,以及教学中遇到的实际问题。中午的午饭也很不错,再次感谢MSRA!
Workshop结束后,抓紧时间再向“可能是中国高校讲PBFT最明白”的陈康老师请教。了解到陈老师为了讲清楚PBFT,参考了国外好几个大学的讲法,融会贯通后写了30页的配图讲稿,并和学生反复讨论以保证讲法的清晰可理解,不禁感叹自己在教学方面做的远远不够,同时期待陈老师厚积薄发的新书能够早日完成。
最后,拿到了本次workshop的证书。介于本次workshop非常成功,明年不但会继续举办,而且会进一步改进形式,对topic的讨论会更深入,会形成对特定topic的教学思路参考,以及课程与lab更合理的设置和配合等。时间就在学期末的考试周左右,非常期待。
(这次的Workshop由教育部高等学校计算机类专业教学指导委员会和微软亚洲研究院联合主办,图灵奖得主John Hopcroft教授担任总顾问,Cornell大学的Robbert van Renesse教授和UCSD的Geoffrey M. Voelker教授担任联合主席)
]]>First, let’s see the The Open Group Base Specifications Issue 6, IEEE Std 1003.1, 2004 Edition. This is POSIX.
DESCRIPTION
The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected.
If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion.
RATIONALE
The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of “buffer cache”, “system crash”, “physical write”, and “non-volatile storage” are not defined here, the wording has to be more abstract.
If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. This could be valid in the case where the system cannot assure non-volatile storage under any circumstances or when the system is highly fault-tolerant and the functionality is not required. In the middle ground between these extremes, fsync() might or might not actually cause data to be written where it is safe from a power failure. The conformance document should identify at least that one configuration exists (and how to obtain that configuration) where this can be assured for at least some files that the user can select to use for critical data. It is not intended that an exhaustive list is required, but rather sufficient information is provided so that if critical data needs to be saved, the user can determine how the system is to be configured to allow the data to be written to non-volatile storage.
It is reasonable to assert that the key aspects of fsync() are unreasonable to test in a test suite. That does not make the function any less valuable, just more difficult to test. A formal conformance test should probably force a system crash (power shutdown) during the test for this condition, but it needs to be done in such a way that automated testing does not require this to be done except when a formal record of the results is being made. It would also not be unreasonable to omit testing for fsync(), allowing it to be treated as a quality-of-implementation issue.
Second, let’s see the man page of fsync from Ubuntu-8.04 (Hardy):
DESCRIPTION
fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) where that file resides. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
NOTES
Applications that access databases or log files often write a tiny data fragment (e.g., one line in a log file) and then call fsync() immediately in order to ensure that the written data is physically stored on the harddisk. Unfortunately, fsync() will always initiate two write operations: one for the newly written data and another one in order to update the modification time stored in the inode. If the modification time is not a part of the transaction concept fdatasync() can be used to avoid unnecessary inode disk write operations.
If the underlying hard disk has write caching enabled, then the data may not really be on permanent storage when fsync() / fdatasync() return.
When an ext2 file system is mounted with the sync option, directory entries are also implicitly synced by fsync().
On kernels before 2.4, fsync() on big files can be inefficient. An alternative might be to use the O_SYNC flag to open(2).
In Linux 2.2 and earlier, fdatasync() is equivalent to fsync(), and so has no performance advantage.
In the man page of fsync from Ubuntu 14.04:
DESCRIPTION
fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
NOTES
On some UNIX systems (but not Linux), fd must be a writable file descriptor.
In Linux 2.2 and earlier, fdatasync() is equivalent to fsync(), and so has no performance advantage.
The fsync() implementations in older kernels and lesser used filesystems does not know how to flush disk caches. In these cases disk caches need to be disabled using hdparm(8) or sdparm(8) to guarantee safe operation.
Sync is a file system operation to flush data to disk for persistency (durability). There are many variants of sync(), including fsync, fdatasync, aio_sync, sync_file_range, etc. These procedures are similar but with subtle differences.
fsync() is easy to understand. Given a fd, it flush all the dirty data of the file to disk, as well as metadata in inode. However, it doesn’t flush the directory entry of the file. Thus if you create a file (in some directory of course), write some data, fsync() and close it, then the machine crashes. After reboot, you may not find the file in the directory, because the directory’s data is not flushed. Thus, it is recommended to flush a directory after creating or deleting a file in it. However, a directory can only be opened in read-only mode, which could be flushed… or not? The answer is: a directory can be flushed even if it is read-only. More details can be found in this article: Everything You Always Wanted to Know About Fsync().
fdatasync() is similar with fsync(). The only difference is that it doesn’t flush some metadata to disk if not necessary. Which metadata are necessary? Size, for example. Which ones are not necessary? mtime, atime, etc. Typically, using fdatasync can improve performance since it can avoid at least one disk write to inode, which usually incurs a long seek time. IMHO, fdatasync can replace fsync in most cases.
sync_file_range() can be used to flush a range of a file instead of all. It is introduced from Linux 2.6.17, but is actually a non-stand API. It has three flags: WAIT_BEFORE
, WRITE
and WAIT_AFTER
. None of these operations writes out the file’s metadata, recall that fdatasync just selectively not writes metadata. Note that the man page says: “This system call is extremely dangerous and should not be used in portable programs.” There’s also little infomation on Google. I guess it is not used widely yet.
aio_sync() is a part of Linux AIO mechanism. It just puts a sync request to the queue and returns. it does not wait for I/O completion. Its arguments include an aiocb struct
, which contains aio_sigevent
which indicates the desired type of asynchronous notification at completion. I guess it can be used to decouple ordering and durability. It only ensures ordering but doesn’t need to flush disk actually. The flushing operations can be batched for performance.
Disk write buffer will violate user’s assumptions on durability (e.g., ATM transaction lost on power failure) and write order (e.g., journal COMMIT before the transaction has actually done).
Disk write buffer is a small amount of memory embedded in a disk drive. Once a disk receives a write operation, it can signal the invoker “write done” as long as the data is written on the buffer instead of on the platter. Thus, there’s a time window between the “claimed done” and the “actual done”. In this way, the invoker can move forward instead of waiting for a long time, since writing to platter is much slower. Another benefit is that the disk can merge and re-order disk writes within the buffer to reduce unnecessary disk header movement and improve the performance. Experiment shows that for some scenarios, the write buffer can improve disk write performance by an order of magnitude.
From the invoker’s perspective, enabling write buffer could bring two “features”, if not bugs.
Here comes the problem: the two features violate two assumptions of an invoker.
First, the durability is not ensured. Once there’s a power crash within the window, the data would be lost, but the user thought the data is safe. Consider this case: a user takes some money from an ATM. The ATM writes the transaction to disk, and then prints a receipt to the user. At that moment, the ATM crashes. Once it reboots, it found that the transaction is not on the disk. However, from the user’s perspective, the transaction is done.
Second, the order is not ensured. If an invoker writes “A”, “B” and “C” in order, after a power failure, the disk may have only “C” but lose “A” and “B”. It could be very harmful in following case: in a journal file system, a COMMIT will be written to disk if and only if a write transaction has been done, which means the COMMIT must be written to platter after the data and metadata have been written to platter. If the order is not guaranteed, the file system recovery procedure cannot work properly.
That’s why someone might suggest to disable the disk write buffer.
]]>As stated by Andy Hunt in his book Pragmatic Thinking and Learning, the most significant difference between a newbie programmer and an expert is the abilibity to sense the context
of a problem while solving it. What a newbie needs is only a list of step-by-step operations which are context independent, while an expert has to know details as many as possible. Thus, as an expert of system performance analyzing and tuning, you need to use tools such as vmstat
and iostat
to get details. In order to be such an expert, you need a step-by-step guide.
Here is the guide.
One common question is that why using two tools instead of one? vmstat and iostat are both powerful and have some overlapping on, e.g., statistic of CPU usage. I chose the two because other people do so. More specifically, see the following vmstat sample:
# vmstat 1 3
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 22880024 1122008 6301332 0 0 0 5 1 0 0 0 100 0
1 0 0 22879884 1122008 6301332 0 0 0 0 7 189 0 0 100 0
0 0 0 22879884 1122008 6301332 0 0 0 4 5 154 0 0 100 0
and iostat sample:
# iostat sdb -xdk 1 3
Linux 2.6.26-2-amd64 (R900) 01/29/2014 _x86_64_
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.01 6.99 0.05 0.25 0.47 28.98 191.90 0.05 178.20 2.52 0.08
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
We use “-d” option in iostat to ignore CPU statistics, which has already been shown in vmstat. As we can see, the output of vmstat includes OS status (r
for running process, b
for blocking process, in
for interrupt, cs
for context switch), as well as memory and CPU status.
Meanwhile, the output of iostat focuses on disk device. Note that the data of iostat comes from the block layer, which is under the page cache layer.
As a summary, we use vmstat to get OS, Memory, CPU status, and use iostat to get Disk status. Network is not considered here.
Check the sum of w/s
and r/s
. The larger, the heavier I/O.
Also check %util
, the more, the heavier. If it is close to 100, then the I/O is definitely significant.
It should be noted that during writing, if the disk is the bottleneck (%util is 100% for a long time), but the applications keep writing, as long as the dirty pages exceeds 30% of memory, the system will block all the write system call, no matter sync or async, and focuses on writing to the disk. Once this occurs, the entire system is slow as hell.
Note that, in my opinion, the %util should be similar with wa
in vmstat.
Check b
in vmstat log. If the value is large, then the concurrence is at a high level.
Check rrqm/s
and r/s
. If rrqm/s is large, then there’re many sequential write. If r/s is large, then random writes. Same for wrqm/s
and w/s
.
Also Check avgrq-sz
. The larger, the more likely to be sequential.
It would be better to get the distribution of different sizes of I/O.
Check await
, svctm
and %util
. svctm is usually a constant which depends on the device. If await is much larger than svctm, it means the queue is long and thus I/O is heavy (recheck w/s and r/s to ensure). At the same time, if the %util is NOT large, it means the I/O is bursting.
One saying is that if await is larger than 10ms, the latency is considered to be long.
Note that if the data indicates that bursting does exist, it may not be caused by application’s behavior, but by the buffering mechanism of the OS.
It is easy to get from w/s
and r/s
. Useful if the device has different performance for read and write.
Check svctm
for latency, and rkB/s
, wkB/s
for throughput.
If the I/O is heavy but throughput is low, it is likely that most of the I/O are random. Recheck that.
Burst may also affect the latency.
The bottleneck could be: the device, the CPU, the I/O scheduler, the file system, the application, or other.
If the %util
is approaching 100%, the disk is likely to be the bottleneck.
If the %util is below 100%, but await
is far larger than svctm
, bursting is likely and usually the application is the one that should be blamed.
Similarily, if the I/O is much random, you should also check the application.
I don’t think you can figure out the OS as the bottleneck, since the data is below the layer of I/O scheduler and FS. iostat is at the device level.
In order to get more info, you can try strace together.
There are a lot of articals and blogs on this. Just check following ones:
Something needs to be noted:
rrqm/s
means how many I/O requests are merged. For example, 100 read requests are merged to 2, then rrqm/s is 100, and r/s is 2.%util
can be calculated as (r/s + w/s) * svctim / 1000ms * 100.avgqu-sz
: this one is a little bit tricky. Some one said there’s a bug in calculating the queue size, like in here, that the value is 10 times larger. Even given such explaination, I myself cannot understand the meaning of the value. Should it be the average number of requests that one must wait? It seems not, since it is calculated by “total waiting time / 1000ms”, what’s that?In How Linux iostat computes its results, it mentioned as following:
avgqu-sz is computed from the last field in the file – the one that has “memory” – divided by the milliseconds elapsed. Hence the units cancel out and you just get the average number of operations in progress during the time period. The name (short for “average queue size”) is a little bit ambiguous. This value doesn’t show how many operations were queued but not yet being serviced – it shows how many were either in the queue waiting, or being serviced. The exact wording of the kernel documentation is “…as requests are given to appropriate struct request_queue and decremented as they finish.”
However, the explaination is also not easy to understand. At least, it shows that avgqu-sz does NOT mean “average queue size”, which means that it’s really ambiguous and hard to explain to others. So, I just ignore it and suggest you do the same.
From When iostat Leads You Astray:
… looking at how hard the disks are rattling, as we did above using iostat(1M), tells us very little about what the target application is actually experiencing. Application I/O can be inflated or deflated by the file system by the time it reaches disks, making difficult at best a direct correlation between disk and application I/O. Disk I/O also includes requests from other file system components, such as prefetch, the background flusher, on-disk layout metadata, and other users of the file system (other applications). Even if you find an issue at the disk level, it’s hard to tell how much that matters to the application in question.
A good reference to explain each fields in iostat can be found here: Monitoring IO performance using iostat & pt-diskstats, on MySQL conference.
[ben@lab ~]$ cat /proc/diskstats
7 0 loop0 0 0 0 0 0 0 0 0 0 0 0
7 1 loop1 0 0 0 0 0 0 0 0 0 0 0
7 2 loop2 0 0 0 0 0 0 0 0 0 0 0
7 3 loop3 0 0 0 0 0 0 0 0 0 0 0
7 4 loop4 0 0 0 0 0 0 0 0 0 0 0
7 5 loop5 0 0 0 0 0 0 0 0 0 0 0
7 6 loop6 0 0 0 0 0 0 0 0 0 0 0
7 7 loop7 0 0 0 0 0 0 0 0 0 0 0
8 0 sda 44783 15470 2257302 1210711 85999 54224 1808924 6087675 0 1087763 7298349
8 1 sda1 463 163 4176 6464 2 0 4 1 0 6215 6465
8 2 sda2 267 31 2136 4146 0 0 0 0 0 4053 4146
8 3 sda3 43885 15276 2249646 1197369 73520 54224 1808920 5575620 0 654552 6772954
11 0 sr0 0 0 0 0 0 0 0 0 0 0 0
253 0 dm0 42736 0 1796226 1391325 15414 0 187656 3199366 0 304001 4590697
253 1 dm1 16476 0 449218 530482 113707 0 1572032 4549033 0 838217 5079524
253 2 dm2 574 0 3410 15473 3747 0 49232 185560 0 61399 201034There are 11 fields:
How these fields are used for calculation:
Note: HZ is 1000 on most systems. svctm field will be removed in a future sysstat version.
TOCTTOR
, short for Time of Coding To Time of Result
, is too long, which is really annoying and disrupting. I tried to shrink the time and now I have successfully shrink the time to less than 30s on my 2010 MBP (with SSD). Here is how.
Yes, it is fast enough. VMware does a great job on performance improvement these years. Using it can significantly reduce the time of rebooting, from more than 60s to about 10s.
A good news is that VMware now supports EPT emulation
, so I can run Xen on VMware, and run HVM on Xen. It also supports virtual serial port
, which is essential for Xen debugging.
I edit and compile the source code on another machine with powerful CPUs, and rsync
the binary to VMware for testing. The compiling time is not long since I only modify a small part of Xen. The compiling and rsyncing take about another 10s.
Another benifit of using rsync is that you can keep reading and modifying the code when rebooting the test environment in VMware. It’s another kind of parallelization.
Since my test is issued through HVM, I have to create two HVMs every time. The command xl save/restore
can save a lot of time. Now the HVM creating time is shrinked from 30s to 5s.
The developping environment is Debian 7. VMware version is workstation-9 on PC and Fusion-5 on Mac.
# apt-get install build-dep linux
# apt-get install linux-kernel-3.2 # the code and patch is now in /usr/src
# make localmodconfig
# make menuconfig # add TAP/TUN and Xen device drivers
# make -j8
# make modules_install
# make install
# mkinitramfs 3.2.46-rt67 -o /boot/initrd.img-3.2.46-rt67
A common problem is that the X window fails to run, with error like:
A solution is to add GRUB_CMDLINE_LINUX="nopat"
in /etc/default/grub, as shown later.
# apt-get build-dep xen
# apt-get install bridge-utils
# make xen -j8
# make tools -j8
# make install-xen
# make install-tools PYTHON_PREFIX_ARG=
# update-grub
# vi /etc/fstab # echo "xenfs /proc/xen xenfs defaults 0 0"
# vi ~/.bashrc # add "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64"
Frist, modify GRUB_CMDLINE_XEN
part.
# /etc/default/grub of domain-0
GRUB_DEFAULT=8
GRUB_TIMEOUT=2
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_XEN="loglvl=all guest_loglvl=all com1=115200,8n1 console=com1"
GRUB_CMDLINE_LINUX="nopat"
Second, add a serial device in VMware setting. It’s easy.
kernel = "hvmloader"
builder = 'hvm'
memory = 256
name = "vm1"
cpus = "1"
vif = [ 'bridge=xenbr0' ]
disk = [ 'file:/root/VMs/vm1.ubuntu-8.04.img,hda,w' ]
boot = "c"
sdl = 0
vnc = 1
vncpasswd = ''
stdvga = 0
serial = 'pty'
tsc_mode = 0
Add console=tty1 console=ttyS0
and serial
terminal
part in grub.
# /boot/grub/menu.lst in HVM
default 0
serial --unit=0 --speed=115200
terminal --timeout=5 serial console
timeout 1
title Ubuntu 8.04.4 LTS, kernel 2.6.24-26-generic
root (hd0,0)
kernel /boot/vmlinuz-2.6.24-26-generic root=UUID=e60586d4-53b2-4392-939
0-2c4a131c073d ro console=tty1 console=ttyS0
initrd /boot/initrd.img-2.6.24-26-generic
Note, the serial = 'pty'
part in HVM config file is also essential for serial console.