IIRC, AIUI, IMO, IIAC, AFAIK, etc.. etc.. to all of below.
On Wed, 14 Jun 2000, David Murphy wrote:
> Executive summary: buffering is one thing, filesystem organisation is
> You tell me - you introduced the term 'serial disk i/o' 8)
serial means more or less the same as sequential (to me). but to be
specific i mean:
serial I/O == raw I/O == character device I/O.
The I/O goes nearly straight from the device drivers to an internal
IO buffer to userspace memory and vice versa (it seems so on linux
no data buffering, no filesystem, no vfs. (no vat...). Just a char
device that plugs you straight into the device driver.
> Direct I/O, i.e. I/O to a filesystem which bypasses the OS
> filesystem cache,
what do you mean by "filesystem cache"?
block/page/buffer cache? -> data cache.
an optimised list describing mappings between inodes (or vnodes),
directory entries, blocks, extents, etc.. ?? -> metadata cache.
(dentry's in linux *i think*.. maybe kate can enlighten me)
I've been arguing that the metadata cache is not the cause of
slowness, the db is probably in one or two gigantic files so the
metadata cache has an easy task at choosing what info to cache. The
only thing it can really do is to try keep the file layed out as
contigiously as possible.
The slowness is in the data cache. From the point of view of it, it
sees that within a range of blocks ( range*blocksize >> allowed data
cache) the usage pattern is extremely complex (big database). In
order for the data cache to correctly predict that usage pattern it
must have unacceptably complex heurastics.. better then for the data
cache to get completely out of the way -> raw I/O.
conversely for app's with the most simple usage, ie totally
sequential access such as reading a very large video file, there is
also no point for the data cache to be involved. The data cache
cannot magically increase the throughput of the medium. It can
perhaps reduce jitter, but it can't help adding latency. At worst it
has a significant effect on throughput.
So again, let the data cache get out of the way -> raw I/O. The app
(say a streaming video server) can do it's own prefetching if needed
- therefore having ultimate control over latency - and the app has
the full throughput of that device available to it.
> It was developed because, while raw disks are the ultimate in
> performance, they are more work to administer than filesystems,
> for obvious reasons.
never having worked with raw I/O: in what way is it more difficult to
maintain? i would have thought easier. You just point oracle at a raw
I/O logical volume and forget about it until oracle starts telling
you that it's running short, at which point you either extend the LV
or give it a fresh lv.
(guff-o-meter reading off the dial here: paul knows nowt about real
life raw I/O or oracle.)
i can imagine programming an app to use raw I/O would be a
big/difficult job though.
> Sounds to me like a variation on a theme - instead of only caching
> certain transactions as in VxFS, limit the size of the cache. I'd
> guess that changing a filesystem would be considered more conservative
> than changing the paging stuff in the kernel.
that was indeed the reasoning donald gave for the patch. i think the
% was configurable too.
> Ah, but you don't, 'cos it uses extents, not indirect blocks and
> fragments and things, and extents can be big.
but inside the extent you must surely still use fragments/blocks?
with the same dereferencing overhead as always. ('cept now the extent
is an extra layer). there must be some layer of finer grained access
inside the extent, otherwise what happens when the VFS says "AcmeFS,
give me these blocks"? Does the FS say "uhmm.. here's a nice big
256MB blob of data"
eg SGI XFS is extent based (it calls them "allocation groups") , but
afaik it still uses moer traditional metadata such as superblocks,
blocks, indirect blocks, fragments, directories, inodes within each
ext2 also has a more primitive type of extent iirc.
>> The ones that I'm familiar with (VxFS, UFS on Solaris 2.6+) have a
> direct I/O option which accesses a file, or the entire filesystem,
> without using the filesystem cache.
presumably you mean data cache? (block/page/buffer depending on OS)
> It's still talking to a filesystem, but more directly.
not so sure. on XFS the GRIO stream has to be pre-allocated
(mkfs_xfs/xfs_growfs) and you then can attach a file to that stream.
So i strongly suspect that this is just standard character raw I/O
given the illusion of a nice friendly file interface by a very thin
veneer of code inside XFS.
ie, a hack to xfs to slap a file frontend on a raw I/O device.
> The point in that case being that the point they'd pointed it at was
> the wrong point.
you missed the point i was trying to point out to you. :)
> All the VxFS feature does, if you turn it on, is look at a request,
> and if it's smaller than X, bypass the filesystem cache. That's all.
if it's aimed at vlarge db's: that's still a futile hack to try get
the fs to second guess an extremely complicated app - which it won't
get right, and it still won't perform like raw I/O.
if you can't do it properly then don't do it. And at least don't get
in the way.
Paul Jakma paul at clubi.ie
PGP5 key: http://www.clubi.ie/jakma/publickey.txt
A mathematician is a device for turning coffee into theorems.
-- P. Erdos
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!