[PVFS-users] Scaling problems with PVFS 1.6.2

Rob Ross rross at mcs.anl.gov
Tue May 11 11:06:26 EDT 2004


Hi all,

The first error that you're seeing is somewhat vague; it's difficult to 
tell for sure what is going on.  The second error is definitely concerned 
with initial socket setup to the iods, and Nathan's suggestion is likely 
to help with this.  You could also increase the values of:
        IODCOMM_CONNECT_TIMEOUT_SECS = 300,
        IODCOMM_BRECV_TIMEOUT_SECS = 300,
in include/pvfs_config.h, but you really shouldn't have to.

This might solve the first problem as well; I'm just not sure.

Have you looked to see if there is anything useful in the iod or mgr logs?  
That would be helpful to us, even just to know that there is nothing in 
there.

I'm still not 100% clear on where the error is coming from on the client 
side, in terms of location in the source.

Details below on what I think is going on...

I'm cc'ing this to pvfs-developers because there's a bug of sorts here. To
summarize what is happening, it goes like this:
- all your clients start up
- they hit the first point where they need to talk to the iods
  - this is in pvfs_open() actually
  - in all cases (unless you hack the code) all iod connections are opened 
    then, in order, starting with the first iod
  - pvfs_open() tries to send IOD_NOOPs to all the iods
    - build_simple_jobs() does this
    - iodcomm code opens up connections to iods
    - iodcomm code sends its own IOD_NOOP to iods (1)
    - after that one, the IOD_NOOP from pvfs_open is sent (2)
    - errors from here are ignored, although the message would print
- after all this, application I/O can take place (e.g. pvfs_write())
  - if we had a problem connecting before, we try again
  - iodcomm code tries to open connections to iods
  - iodcomm code sends IOD_NOOP to iods
  - errors in this path would give you the "pvfs_write: build_rw_jobs 
    failed"

The code path in question is:
  pvfs_write() -> build_rw_jobs() -> add_accesses() -> add_iodtable()

So, there's this double-NOOP thing going on that would be good to get rid 
of, and it is creating some extra traffic.

Additionally, the build_simple_jobs() call always hits the servers in 
order.  If you're of the hacking mind, dropping some code in there to 
randomize where you start contacting would better balance the load across 
the iods.  Because connection establishment is done in a blocking manner 
(for our sanity), this could substantially impact overall performance 
(however, this is hidden in pvfs_open(), so maybe you wouldn't see it?).

This suggests two potential improvements to the system:
1) removing the double-NOOP, perhaps by just calling add_iodtable() 
   directly in pvfs_open() rather than doing the build_simple_jobs() 
   thing
2) this would make doing the randomization somewhat easier too, 
   because there is little or no code in the loop

All this said, I think Nathan's approach of increasing the backlog will 
likely fix the problem in the short term :).

Regards,

Rob

On Wed, 5 May 2004, Nathan Poznick wrote:

> Thus spake Kent F. Milfeld:
> >   Up to 128 clients, all seems to go well for a simple write
> >   of 64MB from each client using MPI-I/O.  However, when 
> >   writing from 256 clients, some processes have a write error of 
> >   8288 and the processes generally hang.  The processes that show
> 
> It sounds like you may be running into a limit in the number of
> connections that the manager / iods can accept at any given time.
> Testing out whether or not this is the case should be fairly simple.
> In include/pvfs_config.h, look for the #defines for MGR_BACKLOG and
> IOD_BACKLOG.  They are both currently set to 256 - try bumping them up
> significantly and recompile PVFS.  If that fixes your problem, you may
> want to consider bumping them up higher based on the maximum number of
> clients you expect to use.
> 
> -- 
> Nathan Poznick <poznick at conwaycorp.net>
> 
> We had better appear what we are, than affect to appear what we are
> not. - Francois de La Rochefoucauld
> 
> 










More information about the PVFS-users mailing list