AIUI == as i understand it.. yes. and it applies to this email too.
(in fact all my mails should really have big AIUI, IIRC, IMO,
etc.. disclaimers around them..) :)
On Tue, 13 Jun 2000, David Murphy wrote:
> If by 'serial disk i/o' you mean 'sequential disk i/o', then yes,
yes i do. but what's the substantive difference between serial and
sequential anyway?
> it is, and will be on any OS that uses disks. As I said
> yesterday, moving the disk heads is one of the slowest operations
> on a system - sequential reads need fewer, shorter seeks than
> random reads.
>
indeed. bear the above in mind and re-read what you say below about
filesystems. :)
> It's not the buffering, it's the filesystem -
uhmm.. filesystem would have an effect, obviously. But that can't be
it. The killer /must/ be block buffered I/O - if it wasn't then
surely the solution would be for (eg) Oracle to just use block
devices directly? eg tell it to use /dev/hdd - so that it would still
be using block buffered I/O but without the FS overhead.
but then why was raw I/O invented? could only be because the true
overhead is in the OS block buffering...
incidentally, one way of optimising block I/O for large db
performance is to get the OS to do minimal buffering for that
fs. Eg donald becker had a patch where you could tell the kernel to
only use 50% of the buffer cache for a particular fs.
> as you'll recall, with ufs, and I presume ext2fs, once a file has
> more than X direct blocks, the filesystem starts allocating
> indirect blocks, double indirect blocks, etc. etc. - the upshot
> is the bigger the file, the more pointers you have to follow
> around the disk.
but that's not really a huge overhead /imo/. anyway, chances are you
already have the indirect blocks buffered.
> With an extent-based filesystem,
<unsure>aren't extents just a way to maintain groups of related
blocks, to try keep these blocks in a relatively sequential order on
disk?</unsure> extents are just another layer of indirection, because
you still will still have blocks, fragments, {double,triple} indirect
blocks to dereference...
> the
> typical commmercial example being Veritas File System [VxFS], you
> could have a 4GB file, with the FS allocating it as just one 4GB
> extent. Applications can influence the way VxFS allocates files,
> hence you can approach the control you have with a raw disk,
> while avoiding the inconvenience of raw partitions.
>
urmm... even extent/higher tech FS's such as SGI XFS, DU AdvFS, (and
i think Vxfs too) have a raw I/O interface.
also, the application control thing: that's probably an IOCTL/open
flag to tell the fs /NOT/ to buffer that device/file.
> This is why you should ask questions if someone tells you they're
> running Oracle on UFS.
or maybe they can't afford VxFS? :)
> If they're running Oracle over NFS, the
> question would be "Have you had your head examined recently?".
>
:0
> The buffering issue is essentially double-caching eating all your RAM
still can't be the full story. if it was then the answer would be
weakly buffered block i/o - and no-one would want the following:
bash-2.03# ls -l /dev/dsk/dks0d1s0 /dev/rdsk/dks0d1s0
brw------- 2 root sys 128, 16 Feb 24 01:54 /dev/dsk/dks0d1s0
crw------- 2 root sys 128, 16 Feb 24 01:54 /dev/rdsk/dks0d1s0
> - if Oracle is caching the data in its SGA, and your OS is caching
> that same data in its VM system, you may find you don't have much RAM
> left for other things, like, say, the OS 8) VxFS has a potentially
> useful feature, where it can decide if a given read should be buffered
> or not, just be sure you've tuned the threshold - see:
>http://www.sun.com/blueprints/0400/ram-vxfs.pdf>
that's the kind of hackery that raw I/O avoids. Sticking loads of
clever little algorithms into your FS to determine whether or not to
buffer a /given/ read and if so, by how much, becomes pointless
beyond a certain point.
Or do you want your FS to have an intimate knowledge of how oracle
works? Perhaps with a 100MB kernel table full of statistics on how
different observed Oracles access the disk?
That's silly, and that's what raw I/O is about - facing up to the
fact that bloating the kernel with lots of "second-guess
userspace" stuff is bad cause it can't come close to guessing right
most of the time.
Instead throw that crap back to userspace to the code that knows best
- the app itself - by using raw I/O.
> > Also: RAID systems are optimised for long sequential seeks, which
> > helps...
>> Again, this is more the physics of disk drives than RAID systems per
> so.
*low cough* uhmmm.. yes i knew that... of course... *low cough*
:)
--
Paul Jakma paul at clubi.ie
PGP5 key: http://www.clubi.ie/jakma/publickey.txt
-------------------------------------------
Fortune:
It seems intuitively obvious to me, which means that it might be wrong.
-- Chris Torek
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!