-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi,
I have a weird, frustrating problem and would appreciate the insights
of anyone on this list. Please bear with me, it's a long mail but the
problem needs to be described.
Our research group focuses on CFD
(http://en.wikipedia.org/wiki/Computational_fluid_dynamics for those
interested)
Most of us use software called Fluent and one person in the group uses
CFX. All our desktop machines are Windows and we use the Windows
version but we have a cluster of 9 Fujitsu-Siemens dual processor Xeons.
When the cluster was initially delivered, it was running RedHat 6.
After a few months, some of the Fluent users found that their files
wouldn't read because they were corrupted.
Fluent files are made up of descriptive text at the top, a binary blob
of information in the middle, and text again at the bottom. Fluent has
support for gzip so I told people to gzip the files and that helped
for a while but it came back. The occurrences seemed random and only
affected about 2 out of the 5 people using Fluent on the cluster. We
would find that the modification date on a corrupted data set would be
the same as a backup that was working.
The CFX user had no problem and 2 years later continues to have no
problem.
In short, I couldn't pin it down to anything but suspected that the
versions of software offered by RedHat 6 were old and possibly dodgy.
So about a year ago, I wiped all the machines and put Debian sarge on
them. It's not a supported platform for either Fluent or CFX but I've
managed to get both working from a tarball that each provide.
It's started happening again and specifically, it's started happening
to my files. Considering that each of these datasets generally takes
about 12 hours to solve, it's more than a bit of a pain in the arse
that stuff is screwing up. One of the machines faces the network runs
Kerberos, NIS, Nagios, NFS, DNS, Squid and ntpd. The other nodes have
the Fluent and CFX software NFS mounted from the master node.
Now, don't moan about this bit - it's the only way I could do it. The
master only had 50GB of disk free. Each of the nodes had about 20GB
free. To give everyone enough space for the thing to be useful, the
/home of the heaviest user was put on the master node and the other
users were given a /home on one of the nodes, which was NFS mounted to
the master (as /home/$user). Generally, a job is set running on more
than 1 node from the master - Fluent uses rsh to contact the other
nodes. As far as possible, no heavy computation is done on the master
node.
I don't think it's an NFS problem - the user with the home on the
master node was the first to go tits up. I don't think it's a Debian
problem because the same happened with RedHat. I don't think it's a
Linux problem because no other software seems to have a problem.
Nothing in logs or dmesg. I'm leaning towards a Fluent problem or a
hardware problem so I can't think of any way to test this. The problem
is sufficiently random that I can't provide good data to the software
maker to investigate - and the fact that we're running on an
unsupported architecture doesn't help. And also, if it's a hardware
problem, why is it only files read and written with this software
that's causing the problem?
So, can anyone suggest something to try or troubleshooting steps to go
through?
Any help much appreciated.
Regards,
Cian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFT4/S2yUma7R/3b8RAj23AKCABCbCv/8c542nEkjZ/FdcJ2z0vwCeOx+L
Fo7gokVSzUyaWj3avxnJwTg=
=IbIz
-----END PGP SIGNATURE-----
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!