[Pvfs2-users] PVFS2 on BlueGene

Andrew Cherry acherry at mcs.anl.gov
Fri Apr 20 18:38:47 EDT 2007


I don't know the full details on the issues (our BGL was switched to  
jumbo frames before I came on board at ANL), but Susan Coghlan should  
be able to fill you in.  I've Cc:ed her on this note.

-Andrew

On Apr 20, 2007, at 5:31 PM, Michael Oberg wrote:

>
> I moved our BGL to Jumbo Frames several weeks ago, and we ran into
> similar issues to what you reported as well. We ended up establishing
> two networks, one for 9000MTU, and one for all of the equipment which
> must remain at 1500 (FC RAID controllers, PXE boot infrastructure for
> x86 nodes, loaner equipment, etc).
>
> Your note about having issues at or near 9000 MTU is concerning - can
> you give us some more insight into the problems that you experienced,
> and any ways that we could test for the same issues in our  
> environment?
>
> Thanks,
>
> Michael Oberg
> Research Systems Evaluation Team
> National Center for Atmospheric Research (NCAR)
> Office: 303.497.1268, Cell: 720.938.6585
> oberg at ucar.edu
>
> Andrew Cherry wrote:
>> Matthew, Sam-
>>
>> FYI, using jumbo frames is not necessarily that simple.  The IBM file
>> servers that came with our BG/L don't support jumbo frames on the
>> internal NICs.  Ours are IBM x346 type 8840, and the built-in  
>> Broadcom
>> NICs couldn't handle jumbo frames.  I imagine other xSeries boxes  
>> with
>> integrated Broadcom NICs may have similar issues.  We ended up buying
>> PCI network cards in order to implement jumbo frames in our
>> environment.  Also, you'll need to make sure your network switch can
>> handle jumbo frames (ours is a Force10, don't know the exact model  
>> off
>> the top of my head but it supports jumbo frames).
>>
>> The other thing you need to be aware of is that switching to jumbo
>> frames is an all-or-nothing proposition; if you do it, you'll have  
>> to do
>> it for *all* of the hardware on the involved network segment.  You  
>> can't
>> just change a couple of servers.
>>
>> I'm Cc:ing a couple of folks at Argonne who worked on getting jumbo
>> frames working for our environment; they might be able to warn you of
>> any other gotchas.  We're using 8000 byte frames, but if I were  
>> starting
>> from scratch I'd try something closer to 8300 so that an entire
>> 8192-byte NFS packet can fit in a single frame (avoiding  
>> fragmentation
>> if you're using an 8192 byte NFS rsize/wsize).  Note, 8300 is just a
>> ballparck guess that I haven't been able to confirm.
>>
>> Be warned -- in our environment, we started to have problems when  
>> we got
>> close to 9000 byte frames, so don't go too high.
>>
>> -Andrew Cherry
>>  BG/L Support
>>  Argonne National Laboratory
>>
>> On Apr 20, 2007, at 5:02 PM, Sam Lang wrote:
>>
>>>
>>> Hi Matthew,
>>>
>>> I think the version of PVFS in the Zepto release is pvfs2-1.5.1.
>>> Besides some performance improvements in the latest release
>>> (pvfs-2.6.3), there was a specific bugfix made in PVFS for largish
>>> mpi-io jobs.  If you could try the latest (at http://www.pvfs.org/),
>>> it would help us to verify that you're not running into the same  
>>> problem.
>>>
>>> Regarding config options for PVFS on BGL, make sure you have jumbo
>>> frames enabled, i.e.
>>>
>>> ifconfig eth0 mtu 8000 up
>>>
>>> Also, you should probably set the tcp buffer sizes explicitly in the
>>> pvfs config file, fs.conf:
>>>
>>> <Defaults>
>>>     ...
>>>         TCPBufferSend 524288
>>>         TCPBufferReceive 1048576
>>> </Defaults>
>>>
>>> You might also see better performance with an alternative trove  
>>> method
>>> for doing disk io:
>>>
>>> <StorageHints>
>>>     ...
>>>     TroveMethod alt-aio
>>> </StorageHints>
>>>
>>>
>>> Thanks,
>>>
>>> -sam
>>>
>>> On Apr 20, 2007, at 4:25 PM, Matthew Woitaszek wrote:
>>>
>>>>
>>>> Good afternoon,
>>>>
>>>> Michael Oberg and I are attempting to get PVFS2 working on  
>>>> NCAR's 1-rack
>>>> BlueGene/L system using ZeptoOS. We ran into a snag at over 8 BG/L
>>>> I/O nodes
>>>> (>256 compute nodes).
>>>>
>>>> We've been using the mpi-io-test program shipped with PVFS2 to  
>>>> test the
>>>> system. For cases up to and including 8 I/O nodes (256  
>>>> coprocessor or
>>>> 512
>>>> virtual node mode tasks), everything works fine. Larger jobs fail
>>>> with file
>>>> not found error messages, such as:
>>>>
>>>>    MPI_File_open: File does not exist, error stack:
>>>>    ADIOI_BGL_OPEN(54): File /pvfs2/mattheww/_file_0512_co does  
>>>> not exist
>>>>
>>>> The file is created on the PVFS2 filesystem and has a zero-byte  
>>>> size.
>>>> We've
>>>> run the tests with 512 tasks on 256 nodes, and it successfully  
>>>> created a
>>>> 8589934592-byte file. Going to 257 nodes fails.
>>>>
>>>> Has anyone seen this behavior before? Are there any PVFS2 server or
>>>> client
>>>> configuration options that you would recommend for a BG/L
>>>> installation like
>>>> this?
>>>>
>>>> Thanks for your time,
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pvfs2-users mailing list
>>>> Pvfs2-users at beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>
>>
>> _______________________________________________
>> Pvfs2-users mailing list
>> Pvfs2-users at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>



More information about the Pvfs2-users mailing list