read vs. mmap (or io vs. page faults)

Tue Jun 22 12:09:23 PDT 2004

 (current removed, but I'm leaving this on question@ since it contains
 some useful information).
:This is, sort of, self-perpetuating -- as long as mmap is slower/less
:reliable, applications will be hesitant to use it, thus there will be
:little insentive to improve it. :-(

 Well, again, this is an incorrect perception. Your use of mmap() to
 process huge linear data sets is not what mmap() is best at doing, on
 *any* operating system, and not what people use mmap() for most of the
 time. There are major hardware related overheads to the use of mmap(),
 on *ANY* operating system, that cannot be circumvented. You have no
 choice but to allocate the pages for a page table, to populate the pages
 with pte's, you must invalidate the pages in the tlb whenever you modify
 a page table entry (e.g. invlpg instruction for IA32, which on a P2 is
 extremely expensive), and if you are processing huge data sets you also
 have to remove the page table entry from the page table when the
 underlying data page is reused due to the dataset being larger then
 main memory. There are overheads related to each of these issues, and 
 overheads related the algorithms the operating system *MUST* use to
 figure out which pages to remove (on the fly) when the data set does 
 not fit in main memory, and there are overheads related to the heuristics
 the operating system employs to try to predict the memory usage pattern
 to perform some read-ahead.
 These are hardware and software issues that cannot simply be wished away. 
 No matter how much you want the concept of memory mapping to be 'free',
 it isn't. Memory mapping and management are complex operations for
 any operating system, always have been, and always will be.
:I'd rather call attention to my slower -- CPU-bound boxes. On them, the
:total CPU time spent computing md5 of a file is less with mmap -- by a
:noticable margin. But because the CPU is underutilized, the elapsed "wall
:clock" time is higher.
::As far as the cache-using statistics, having to do a cache-cache copy
:doubles the cache used, stealing it from other processes/kernel tasks.

 But it is also not relevant for this case because the L2 cache is
 typically much larger (128K-2MB) then the 8-32K you might use for
 your local buffer. What you are complaining about here is going
 to wind up being mere microseconds over a multi-minute run.
 It's really important, and I can't stress this enough, to not simply 
 assume what the performance impact of a particular operation will be
 by the way it feels to you. Your assumptions are all skewed... you
 are assuming that copying is always bad (it isn't), that copying is
 always horrendously expensive (it isn't), that memory mapping is always
 cheap (it isn't cheap), and that a small bit of cache pollution will have
 a huge penalty in time (it doesn't necessary, certainly not for a 
 reasonably sized user buffer). 
 I've already told you how to measure these things. Do me a favor and just
 run this dd on all of your FreeBSD boxes:
 dd if=/dev/zero of=/dev/null bs=32k count=8192
 The resulting bytes/sec that it reports is a good guestimate of the
 cost of a memory copy (the actual copy rate will be faster since the
 times include the read and write system calls, but it's still a reasonable
 basis). So in the case of my absolute fastest machine
 (an AMD64 3200+ tweaked up a bit):
 268435456 bytes transferred in 0.058354 secs (4600128729 bytes/sec)
 That means, basically, that it costs 1 second of cpu to copy 4.6 GBytes
 of data. On my slowest box, a C3 VIA Samuel 2 cpu (roughly equivalent
 to a P2/400Mhz):
 268435456 bytes transferred in 0.394222 secs (680924559 bytes/sec)
 So the cost is 1 second to copy 680 MBytes of data on my slowest box.
:Here, again, is from my first comparision on the P2 400MHz:
::	stdio: 56.837u 34.115s 2:06.61 71.8% 66+193k 11253+0io 3pf+0w
:	mmap: 72.463u 7.534s 2:34.62 51.7% 5+186k 105+0io 22328pf+0w

 Well, the cpu utilization is only 71.8% for the read case, so the box
 is obviously I/O bound already.
 The real question you should be asking is not why mmap is only using
 51.7% of the cpu, but why stdio is only using 71.8% of the cpu. If
 you want to make your processing program more efficient, 'fix' stdio
 first. You need to:
 (1) Figure out the rate at which your processing program reads data in
	the best case. You can do this by timing it on a data set that fits
	in memory (so no disk I/O is done). Note that it might be bursty,
	so the average rate along does not precisely quanity the amount of
	buffering that will be needed.
 (2) If your hard drive is faster then the datarate, then determine if
	the overhead of doing double-buffering is worth keeping the
	processing program populated with data on demand. The overhead
	of doing double buffering is something akin to:
	dd if=file bs=1mb | dd bs=32k > /dev/null
 (3) Figure out how much buffering is required to keep the processing
	program supplied with data (achieving either 100% cpu utilization or
	100% I/O utilization).
	#!/bin/csh
	#
	dd if=file bs=1mb | dd bs=32k | your_processing_program
		 ^^^^^^ ^^^^^ try different buffer sizes to try
					to achieve 100% cpu utilization or
					100% I/O utilization on the drive.
	time ./scriptfile
 (4) If this overhead is small enough (less then the 37% of available cpu
	you have in the stdio case), then you can use it to front-end your
	processing script and achieve an improvement, despite the extra
	copying that id does.
	(Again, in my last email I gave you the 'dd' lines that you can use
	to determine exactly what the copying overhead for a dataset would be,
	and gave you live examples showing that, usually, it's quite small
	compared to the total run time of a typical processing program).
 Don't just assume that copying is bad, or that extra stages are bad, 
 because the reality is that they might not be in an I/O bound situation.
 You have to measure the actual overhead to see what the actual cost is.
 My backup script uses dd to double buffer for precisely this reason,
 though in my case I do it because 'dump' output it quite bursty and
 sometimes it blocks waiting for gzip when, really, it shouldn't have to.
 Here is a section out of my backup script:
	ssh $host -l operator $sshopts "dump ${level}auCbf 32 64 - $i" | \
		dd obs=1m | dd obs=1m | gzip -6 > $file.tmp
 I would never, ever expect the operating system to buffer that much 
 data ahead of a program, nor should the OS do that, so I do it myself.
 The cost is a pittance. I waste 1% of the cpu in order to gain about
 18% in real time by allowing dump to more fully utilize the disk it is
 dumping.
:Or is P2 400MHz not modern? May be, but the very modern Sparcs, on which
:FreeBSD intends to run are not much faster.

 A 400 MHz P2 is 1/3 as fast as the LOWEST END AMD XP cpu you can buy
 today, and 5-10 times slower then higher-end Intel and AMD cpus.
 I would say that that makes it 'not modern'.
 We aren't talking 15% here. We are talking 300%-1000%.
:= The mmap interface is not supposed to be more efficient, per say.
:= Why would it be?
::Puzzling question. Because the kernel is supplied with more information
:-- it knows, that I only plan to _read_ from the memory (PROT_READ),
:the total size of what I plan to read (mmap's len, optionally,
:madvise's len), and (optionally), that I plan to read sequentially
:(MADV_SEQUENTIONAL).

 Well, this is not correct. The kernel has just as much information
 when you use read().
 Furthermore, you are making the assumption that the kernel should
 read-ahead an arbitrary amount of data. It could very well be that
 the burstiness of your processing program requires a megabyte or more
 worth of read-ahead to keep the cpu saturated.
 The kernel will never do this, because dedicating that much memory to
 a single I/O stream is virtually guarenteed to be detrimental to the
 rest of the system (everything else running on the system). 
 The kernel will not do this, but you certainly can, either by
 double-buffering the stream or by following Julian's excellent suggestion
 to fork() a helper thread to read that far ahead.
:Mmap also needs no CPU data-cache to read. If the device is capable of
:writing to memory directly (DMA?), the CPU does not need to be involved
:at all, while with read the data still has to go from the DMA-filled
:kernel buffer to the application buffer -- there being two copies of it
:in cache instead of none for just storing or one copy for processing.

 In most cases the CPU is not involved at all when you mmap() data until
 you access it via the mmap(). However, that does not mean that the memory
 subsystem is not involved. The CPU must still load the data you access
 into the L1/L2 caches from main memory when you access it, so the memory
 overhead is still there and still (typically) 5 times greater then the
 additional memory overhead required to do a buffer copy in the read() 
 case. When you add in the overhead of processing the data, which is 
 typically 10-50 times the cost of reading it in the first place, then
 the 'waste' from the extra buffer copy winds up being in the noise.
 So, as I said in my previous email, it comes down to how much it costs
 to do a local copy within the L2 cache (the read() case), verses how
 much extra overhead is involved in the mmap case. And, as I stated
 previously, L1 and L2 cache bandwidth is so high these days that it
 really doesn't take all that much overhead to match (and then exceed)
 the time it takes to do the local copy.
:Also, in case of RAM shortage, mmap-ed pages can be just dropped, while
:the too large buffer needs to be written into swap.

 Huh? No, that isn't true. Your too-large buffer might still only be
 a megabyte, whereas your mmap()'d data might be a gigabyte. Since you
 are utilizing the buffer over and over again its pages are NOT likely
 to ever be written to swap.
:And mmap requires no application buffers -- win, win, and win. Is there
:an inherent "lose" somewhere, I don't see? Like:

 Again, you aren't listening to what I said about how the L1/L2 cache
 works. You really have to listen. APPLICATION BUFFERS WHICH EASILY 
 FIT IN THE L2 CACHE COST VIRTUALLY NOTHING ON A MODERN CPU! I even
 gave you a 'dd' test you could perform on FreeBSD to measure the cost.
 It is almost impossible to beat 'virtually nothing'.
:A database, that returns results 15%, nay, even 5% faster is also a
:better database.
:...
:What are we arguing about? Who wouldn't take a 2.2GHz processor over a
:2GHz one -- other things being equal -- and they are?
:..
:	-mi

 Which is part of the problem. You are not taking into account cost
 considerations when you say that. You are paying a premium to buy
 a cpu that is only 15% faster. If it were free, or cost a pittance,
 I would take the 2.2GHz cpu. But it isn't free, and for a high-end cpu
 15% can be 400ドル (or more) that's why it generally isn't worth it for
 a mere 15%. The money can be spent on other things that are just 
 as important: memory, another disk (double your disk throughput),
 GigE network card, even a whole new machine so you now have two 
 slightly slower machines (200%) rather then one slightly faster machine
 (115%).
					-Matt
					Matthew Dillon 
					<dillon at backplane.com>