[PVFS-users] RE: PVFS Hangups during concurrent reads/writes
David S Metheny
david.s.metheny at conwaycorp.net
Thu Aug 12 14:35:06 EDT 2004
Rob/Brannen,
We are seeing very similar problems with the kpvfsd. We verified that
our job runs on 2 client nodes with the 1.6.3-pre1 release. The same jobs
"hangs up" on 1 of the clients using the 1.6.3-pre3 with kpvfsd. When I run
using the PVFS library from the 1.6.3-pre3 release, the job runs fine.
Each client will be doing read/write accesses to the PVFS cluster. The
clients read portions of the same input file, and write to individual output
files. Then the clients read it's individual output files, and writes to
different locations in a single output file. One client is successfully able
to complete all these tasks, and the other hangs up.
_____
From: pvfs-users-bounces at beowulf-underground.org
[mailto:pvfs-users-bounces at beowulf-underground.org] On Behalf Of Brannen S
Hough
Sent: Tuesday, August 10, 2004 3:09 PM
To: 'Rob Ross'
Cc: pvfs-users at beowulf-underground.org
Subject: [PVFS-users] RE: PVFS Hangups during concurrent reads/writes
Hi Rob,
Another follow up. I'm attaching the /etc/pvfstab file I'm using. I
mount the PVFS file system with the command
"mount -t pvfs TestHAClient:/pvfs-meta /mnt/pvfs".
I linked in the latest headers and recompiled my test program to run
it again. No dice, same effect happens when running 2 test programs
simultaneously on 2 different machines. Running one of the test programs on
a machine that wasn't acting as part of the cluster (Manager or IONode)
didn't make a difference either.
Can you reproduce what I'm seeing there using my test program? And if
you have a test program that works well for you, could you send me a copy?
Maybe I can reproduce you not seeing what I see.
Do the IONodes use just the one socket (7000) for reading and writing?
Is it possible that it has two connections open, one for each of the pvfsd
instances communicating with it, and is getting reads vs. writes confused
(i.e. thinking it needs to wait on incoming data on a socket it actually
needs to write data to, and vice versa)? The hang-ups only happen when one
test program is reading while the other one is writing (though never to the
same files - I'm careful about that) - though that does not explain how they
can "wake up" after a long time and continue where they left off.
Thanks,
- Brannen
> -----Original Message-----
> From: Rob Ross [mailto:rross at mcs.anl.gov]
> Sent: Monday, August 09, 2004 5:42 PM
> To: Brannen S Hough
> Cc: pvfs-users at beowulf-underground.org
> Subject: Re: PVFS Hangups during concurrent reads/writes
>
> Hi Brannen,
>
> I did a quick search and couldn't find any mention of 2.4.20 select()
> problems. Of couse I would like this to be a kernel problem, or perhaps a
> libc problem, but I don't see anything indicating that others have had the
> same issues.
>
> At the same time, no, we haven't had anything like this reported either!
> It's particularly odd to me that things work fine when on different
> machines while working just fine on the same machine! Usually it is the
> other way around :).
>
> Your test program is a little odd in that it moves back and forth between
> using the kernel and using the user library (if my cursory skim got the
> right impression). Also, you're playing a dangerous game keeping extra
> copies of the PVFS headers in the test subdirectory; there are changes
> between what I see in there and CVS for sure.
>
> Have you tried just using the kernel interface or just using the library?
> If so, did those work ok? Do you have an /etc/pvfstab file set up on your
> machine pointing exactly to the same directory as the mount point?
>
> Can you verify for me that PVFS_USE_NODELAY is defined in pvfs/config.h
> (not pvfs-kernel)? It's probably defined twice (it's ok).
>
> Thanks, and sorry we don't have a quick solution for you!
>
> Rob
>
>
> On Mon, 9 Aug 2004, Brannen S Hough wrote:
>
> > I've been trying to isolate this problem and find a way
> around
> > it. At its core it seems to be a select() call problem, which would
> mean a
> > linux kernel problem. Attached is a screen shot of the trace from
> running
> > ddd on pvfsd, gets hung up on line 199 in sockset.c, which is calling
> > dfd_select() (in pvfs-1.6.3-pre3/shared/dfd_set.c), which is calling
> > select().
> >
> > I tried updating my RedHat 9 to kernel version 2.4.20-31.9,
> > recompiling everything, and rerunning my tests, but I got the same
> results.
> > Any other ideas? I could try rewriting the dfd_select routine to break
> out
> > each socket file descriptor individually and calling select() on each
> > instead of passing the array of file descriptors to select(), but I'm
> not
> > sure that would fix the problem (and would make things slightly less
> > efficient).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.beowulf-underground.org/pipermail/pvfs-users/attachments/20040812/c4278987/attachment.htm
More information about the PVFS-users
mailing list