[PVFS-developers] Re: [PVFS-users] crash when no place left
rross at mcs.anl.gov
Fri Jun 25 11:59:04 EDT 2004
Sorry there has been such a delay in work on this problem. But today is
PVFS1 Friday, so there is hope!
The problem appears to be coming from the iod's handling of the "run out
of space" (nospc) scenario.
I've been looking at the iod-side code quite a bit today to see what I
could figure out. A couple of things have come to my attention right off:
(1) We've got a memory leak in the "nospc" case; we calloc() space at line
510 of jobs.c (in my version) and we never free() it.
(2) It's sort of silly to calloc() that when we could malloc() it, since
we don't ever look at the contents.
Both those are really easy problems to fix, but they aren't what is
causing the behavior that you are seeing.
There is a more subtle problem with do_write() in the nospc case. Here's
the issue: do_write() only returns one value, and that value is the amount
of data it wrote. However, in the nospc scenario do_write() has actually
pulled more data off the wire than it was able to write! So what happens
is that the iod loses track of some data that was read and (I'm theorizing
here) ends up coming up short on data to read.
So the client has sent everything and is waiting for an ACK from the iod,
and the iod is sitting there wondering where the last of the data is
(which it has actually already pulled off the wire).
I'm working on a solution to this, and I think that I can probably get it
knocked out today. I'll keep you posted.
On Wed, 19 May 2004, Rob Ross wrote:
> Using the kpvfsd I find that the cp command doesn't seem to return,
> although I was able to ctrl-c the cp. The mgr and iod are still up and
> running ok. I was able, sort of, to unmount, but I cannot mount again;
> the mount.pvfs hangs waiting for soemthing or another.
> That "md_stat: lstat: ..." message is normal, and I just took it out so we
> won't have to look at it in the future.
> I'm going to try again with the user-space pvfsd and see if I can narrow
> down what is going on. It appears that the iod and mgr are handling this
> just fine, and that there is simply a problem on the client side.
> We're trying to get another prerelease out ASAP, so a fix for this may not
> make that cut. I will try to get this fixed before the 1.6.3 release.
> On Tue, 18 May 2004, Rob Ross wrote:
> > Thanks for the problem report! It is normal for the mgr to not log
> > anything in this case -- the mgr is not involved in write operations.
> > I'll see if I can replicate this here.
> > On Tue, 18 May 2004 Stadrim.DRIM.CETMEF at i-carre.net wrote:
> > > I have 10 nodes running the iod server with 60 Go on each and one
> > > other running the mgr server.
> > >
> > > On a client (with pvfsd and the kernel module), if I try to copy a
> > > file when all the servers are full, the cp command never returns. It is
> > > impossible to kill this command (SIGKILL doesn't work), but killing
> > > the mgr process works some time, in the worth case, I have to reboot
> > > the client. I think that the kernel module waits for something that
> > > never happens.
> > >
> > > In the log files, I found that all the iod servers notice that there is
> > > no place left on their local partition, but not the mgr or the pvfsd
> > > on the client.
> > >
> > > In the mgr log file, there is a lot of "md_stat: lstat:: No such file
> > > or directory", nothing in the client's log.
> > >
> > > Pvfs run on a Mandrake 9.0 with a custom kernel 2.4.26 and pvfs 1.6.2
> PVFS-users mailing list
> PVFS-users at www.beowulf-underground.org
More information about the PVFS-developers