On Fri, 16 Jul 2004, Tony Bolger wrote:
> Netapps address their disks a bit differently to most RAID approaches.
>> In a 'normal' RAID 4 or 5 setup, the RAID layer presents the same
> numbered blocks from each disk in turn (skipping the parity one) to
> the fs level, more or less as a virtual disk, with a predictable
> mapping between physical disk number / block and virtual disk block
> numbers. As you read the virtual blocks, on a 3 + 1 RAID 5 setup,
> you get:
>> (Letters are disks, Blocks are numbers).
>> If you add a new disk, you mess up the mappings, because then you're going
> A0,B0,C0,D0,A1,B1,C1,E1 ....
> Thus your FS sees the new disk and old parity disk interleaved with
> the data, and it's not likely to be happy about it.
>> The netapp approach is to make the FS aware of the physical disks,
> and let it worry about spreading the content around. So if you have
> a file on B0,C0,D0, (calling A the parity disk), after you add disk
> E, you still have the same file on B0,C0,D0, with new blank blocks
> on E0,E1,E2....
Yum. That might work for netapp, but it's just insane for any kind of
general system, either hardware or software RAID.
Map the disks into stripes and just maintain mappings of virtual
blocks <-> blocks in the stripes, far more sane and flexible, adding
a disk is relatively easy (map the new blocks in all the stripes
resulting from addition of disk E to either the end of the existing
logical disk or to a new logical disk).
> I'm sure a similar effect could be achieved using LVM and some
> _cunning_ algorithms,
Linux MD can already do it.
> but you'd want to check out just how well patented WAFL is first.
Iirc, Daniel Philips had to refrain from continuing work on his tux2
filesystem, because of NetApp WAFL patents, specifically those
relating to the tree and switch-nodes-for-updates related nature of
tux2 which made it very robust without having to use a journal
(writes go to a new tree of metadata, when done, switch the old root
for the new root. end result, the only critical point in updating
metadata is the switch of nodes - he called it "phase tree" i think).
> It's also possible that other people have done something similar
> with hardware RAID block mapping, but i'm not aware of anyone who
I'm pretty sure just about every sane hardware RAID controller that
can add disks to existing arrays does block mapping. :)
Paul Jakma paul at clubi.iepaul at jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam at dishone.st
The game of life is a game of boomerangs. Our thoughts, deeds and words
return to us sooner or later with astounding accuracy.
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!