[PVFS-users] Re: PVFS access hangs
Rob Ross
rross at mcs.anl.gov
Fri Jul 16 16:01:30 EDT 2004
On Fri, 16 Jul 2004, Brannen S Hough wrote:
> I've gotten in some gigabit Ethernet equipment in the last couple of
> days, so I reconfigured my little cluster to use them. In the process I
> made them more homogeneous by installing RedHat 9 on the other two (so now
> all are running that). So at least I can get rid of any doubts I have about
> different kernel versions, etc.
Sounds good. But all that stuff was ok (except for the Xeon) before,
right?
> I see a lot of strange things that I cannot understand. Setting
> things up as in the quick start guide (except I have the pvfs data on a
> separate partition, /pvfs, with appropriate config file changes) and running
> tests on the Xeon client (running as Manager, one of the three IONodes, and
> Client) everything works fine. However, I tested both before and after
> setting up the kernel module and mounting the filesystem so standard I/O
> could access it (using the pvfs library calls) and it performed much faster
> after the filesystem was mounted. That seems odd. Also, the access using
> the library calls was slower than using standard I/O. Which also seems odd.
How were you testing with the library calls? With standard I/O?
> However, with the Xeon set up as the Manager, I was not able to use
> either the library or standard I/O (after mounting the pvfs filesystem using
> the kernel module) to access the pvfs files on either of the other machines
> running as IONodes. Here's the dump from the iolog on one of them:
>
>
> $Header: /projects/cvsroot/pvfs/iod/iod.c,v 1.66 2003/11/12 14:09:19
> pcarns Exp $
> do_rw_req: empty job - removing
> do_rw_req: empty job - removing
> brecv_timeout: giving up
> do_rw_req: empty job - removing
> do_rw_req: empty job - removing
> do_rw_req: empty job - removing
> do_rw_req: empty job - removing
> do_rw_req: empty job - removing
> do_close_req: failed (i=0, ino=532259, cap=0)
> *** SENDING ERROR ***
> new_request: close failed
That "empty job" message occurs when a client asks for a region of a file
and the server has no data in that region -- a common case would be trying
to read beyond EOF.
Are you sure that those messages are from the "cat" case below? If they
are, then they imply that some data was in fact getting to the iods from
the client node. But my first guess would be that that isn't happening.
Another thing that is possible is that somehow the IP addresses of your
iods are not quite right? For example, if you listed an address in the
.iodtab that is accessible from the local node but not remotely, then you
would have issues. Can you send your .iodtab to us? The pvfs-ping output
too, in both cases?
> I could use pvfs-ping, pvfs-ls, etc (forgot to try u2p though)
> without a problem. After mounting the pvfs filesystem I could use 'ls',
> etc. However my test program hangs while trying to write files (though it
> can be interrupted). I also tried to 'cat' one of the files, but that hung
> also (the last errors above are from that attempt).
What does your test program do, and what interface does it use?
Please do try u2p also if you have a chance, I'm guessing that it won't
work.
> I'm using the Jan patch to pvfs-1.6.2, the two patches for redhat to
> the kernel code. Not sure what else to try - I could set up one of the
> other two machines as the Manager and see what happens then - so far my
> experience has been that unless a machine is the Manager node, I cannot
> access the pvfs filesystem for read or write even though the utilities work
> fine.
I'm almost certain that you're seeing either a misconfiguration problem
due to odd IP addresses in a PVFS config file, or a routing problem, or a
firewall problem.
Regards,
Rob
More information about the PVFS-users
mailing list