Quoting <Pine.LNX.4.21.0006140024490.1255-100000 at fogarty.jakma.org>
by Paul Jakma <paul at clubi.ie>:
Executive summary: buffering is one thing, filesystem organisation is
another.
> > If by 'serial disk i/o' you mean 'sequential disk i/o', then yes,
> yes i do. but what's the substantive difference between serial and
> sequential anyway?
You tell me - you introduced the term 'serial disk i/o' 8)
> > It's not the buffering, it's the filesystem -
> uhmm.. filesystem would have an effect, obviously. But that can't be
> it.
It's a part of it, I never said it was the whole story.
> The killer /must/ be block buffered I/O - if it wasn't then surely
> the solution would be for (eg) Oracle to just use block devices
> directly? eg tell it to use /dev/hdd - so that it would still be
> using block buffered I/O but without the FS overhead. But then why
> was raw I/O invented? could only be because the true overhead is in
> the OS block buffering...
Direct I/O, i.e. I/O to a filesystem which bypasses the OS filesystem
cache, is midway between raw I/O and cached filesystem I/O. It was
developed because, while raw disks are the ultimate in performance,
they are more work to administer than filesystems, for obvious
reasons. It's a tradeoff between performance and ease of maintenance.
> incidentally, one way of optimising block I/O for large db
> performance is to get the OS to do minimal buffering for that fs. Eg
> donald becker had a patch where you could tell the kernel to only
> use 50% of the buffer cache for a particular fs.
Sounds to me like a variation on a theme - instead of only caching
certain transactions as in VxFS, limit the size of the cache. I'd
guess that changing a filesystem would be considered more conservative
than changing the paging stuff in the kernel.
> > With an extent-based filesystem,
> <unsure>aren't extents just a way to maintain groups of related
> blocks, to try keep these blocks in a relatively sequential order on
> disk?</unsure> extents are just another layer of indirection,
> because you still will still have blocks, fragments, {double,triple}
> indirect blocks to dereference...
Ah, but you don't, 'cos it uses extents, not indirect blocks and
fragments and things, and extents can be big. You can create a 50GB
extent, allocate 2GB, and grow all the way to 50GB without
indirection. You will need indirection to get to another extent if you
grow beyond that, but that's still an awful lot more efficient than
UFS, assuming you could get a UFS that let you have 50GB files.
> urmm... even extent/higher tech FS's such as SGI XFS, DU AdvFS, (and
> i think Vxfs too) have a raw I/O interface.
The ones that I'm familiar with (VxFS, UFS on Solaris 2.6+) have a
direct I/O option which accesses a file, or the entire filesystem,
without using the filesystem cache. It's still talking to a
filesystem, but more directly.
> also, the application control thing: that's probably an IOCTL/open
> flag to tell the fs /NOT/ to buffer that device/file.
I don't know the detail of it.
> > This is why you should ask questions if someone tells you they're
> > running Oracle on UFS.
> or maybe they can't afford VxFS? :)
I should probably have said 'production Oracle instances' - the price
of VxFS is to the price of unlimited user Oracle as a bucket of water
is to an ocean. For devlopment boxes, whatever comes with your unix
will probably do.
> that's the kind of hackery that raw I/O avoids. Sticking loads of
> clever little algorithms into your FS to determine whether or not to
> buffer a /given/ read and if so, by how much, becomes pointless
> beyond a certain point.
The point in that case being that the point they'd pointed it at was
the wrong point.
8)
Just because a feature can be misconfigured doesn't mean it's silly.
> Or do you want your FS to have an intimate knowledge of how oracle
> works? Perhaps with a 100MB kernel table full of statistics on how
> different observed Oracles access the disk?
All the VxFS feature does, if you turn it on, is look at a request,
and if it's smaller than X, bypass the filesystem cache. That's all.
--
When asked if it is true that he uses his wheelchair as a weapon he will reply:
"That's a malicious rumour. I'll run over anyone who repeats it."
Stephen Hawking - [http://www.smh.com.au/news/0001/07/features/features1.html]
David Murphy - For PGP public key, send mail with Subject: send-pgp-key
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!