On Tue, 13 Jun 2000, Kenn Humborg wrote:
> (The merits of this is a discussion for another thread. Talking
> about this with Kevin Lyda a few weeks ago didn't enlighten
> either of us on why this is a good thing. So if anyone wants
> to educate me (or provide URLs) on the rationale behind this
> raw I/O, fire ahead.)
>
(please preface every paragraph, nay sentence, with: AIUI)
the biggest benefit:
serial disk i/o is faster.. much faster. (least it is on DU and IRIX).
(i assume linux raw i/o is character device based, as it is on
aforementioned Unix(TM)'s)
the database can therefore also optimise writes/reads to be as sequential
as possible. With block I/O it can't really do it, as the OS is second
guessing it by doing it's own buffering.
benefit 2:
the database knows far far more about the data than the OS. In which case
the OS is just getting in the way. There's just no point for the OS to try
cache reads/writes.
The db wants writes to happen /now/, so for a block device it has to set
FSYNC - which negates and even disrupts the OS's block buffering, ie the
OS might have to traverse lists for a global cache to find the blocks
relevant to that device/file, slow...
Behaviour that holds true for simpler app's with simple data needs, (eg if
it reads block x then it will probably read blocks x+n too so i might as
well readahead and have them in buffer/page cache before the app asks for
them) don't hold true for app's with far more complex data, eg database.
ie the relationship between data on disk and data access is non-obvious.
If the DB needs to write out a bunch of changes, chances are they will not
be to anything close to a sequential range of blocks. With block
buffering the OS is not buying you much in this case ( in fact slowing you
down), probably it will seek all over the place as it tries to cluster
different blocks together that were accessed at different times.
With raw i/o however, when the db knows "i've got a bunch of stuff to
commit to disk" it doesn't matter that they are in different places on
disk - the db can do it one long sequential operation - one
seek, which is much faster.
basically: with raw i/o the db in theory can just continually seek through
the disk, reading/and writing to the appropriate places when they come
up.[1]
whereas buffered block i/o would be messily seeking left right and centre
with no real clue, because it can't get a handle on the pattern of data
access.
Also: RAID systems are optimised for long sequential seeks, which helps...
--paulj
(big AIUI and IMO applies to all above)
[1]. one way /perhaps/ of doing this would be for the db to maintain 2
queues, one for read, one for write, that point to the date to be
read/flushed.
the queue's are written to by the db jobs indicating what they want
read/flushed, and the queue is read by a "i/o" job that continually loops
through the raw i/o device. the i/o job is aware of it's position through
the disk, and is aware of the the "closest" entries in the queue's - and
just reads and writes in the appropriate places.
the i/o job could also do things like apply/verify checksums,
journals.. etc.. (with block i/o this i/o job wouldn't have much to do).
'twould be quite efficient. (disk performance anyway).
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!