[Pvfs2-users] PVFS2 on BlueGene
Andrew Cherry
acherry at mcs.anl.gov
Fri Apr 20 18:38:47 EDT 2007
I don't know the full details on the issues (our BGL was switched to
jumbo frames before I came on board at ANL), but Susan Coghlan should
be able to fill you in. I've Cc:ed her on this note.
-Andrew
On Apr 20, 2007, at 5:31 PM, Michael Oberg wrote:
>
> I moved our BGL to Jumbo Frames several weeks ago, and we ran into
> similar issues to what you reported as well. We ended up establishing
> two networks, one for 9000MTU, and one for all of the equipment which
> must remain at 1500 (FC RAID controllers, PXE boot infrastructure for
> x86 nodes, loaner equipment, etc).
>
> Your note about having issues at or near 9000 MTU is concerning - can
> you give us some more insight into the problems that you experienced,
> and any ways that we could test for the same issues in our
> environment?
>
> Thanks,
>
> Michael Oberg
> Research Systems Evaluation Team
> National Center for Atmospheric Research (NCAR)
> Office: 303.497.1268, Cell: 720.938.6585
> oberg at ucar.edu
>
> Andrew Cherry wrote:
>> Matthew, Sam-
>>
>> FYI, using jumbo frames is not necessarily that simple. The IBM file
>> servers that came with our BG/L don't support jumbo frames on the
>> internal NICs. Ours are IBM x346 type 8840, and the built-in
>> Broadcom
>> NICs couldn't handle jumbo frames. I imagine other xSeries boxes
>> with
>> integrated Broadcom NICs may have similar issues. We ended up buying
>> PCI network cards in order to implement jumbo frames in our
>> environment. Also, you'll need to make sure your network switch can
>> handle jumbo frames (ours is a Force10, don't know the exact model
>> off
>> the top of my head but it supports jumbo frames).
>>
>> The other thing you need to be aware of is that switching to jumbo
>> frames is an all-or-nothing proposition; if you do it, you'll have
>> to do
>> it for *all* of the hardware on the involved network segment. You
>> can't
>> just change a couple of servers.
>>
>> I'm Cc:ing a couple of folks at Argonne who worked on getting jumbo
>> frames working for our environment; they might be able to warn you of
>> any other gotchas. We're using 8000 byte frames, but if I were
>> starting
>> from scratch I'd try something closer to 8300 so that an entire
>> 8192-byte NFS packet can fit in a single frame (avoiding
>> fragmentation
>> if you're using an 8192 byte NFS rsize/wsize). Note, 8300 is just a
>> ballparck guess that I haven't been able to confirm.
>>
>> Be warned -- in our environment, we started to have problems when
>> we got
>> close to 9000 byte frames, so don't go too high.
>>
>> -Andrew Cherry
>> BG/L Support
>> Argonne National Laboratory
>>
>> On Apr 20, 2007, at 5:02 PM, Sam Lang wrote:
>>
>>>
>>> Hi Matthew,
>>>
>>> I think the version of PVFS in the Zepto release is pvfs2-1.5.1.
>>> Besides some performance improvements in the latest release
>>> (pvfs-2.6.3), there was a specific bugfix made in PVFS for largish
>>> mpi-io jobs. If you could try the latest (at http://www.pvfs.org/),
>>> it would help us to verify that you're not running into the same
>>> problem.
>>>
>>> Regarding config options for PVFS on BGL, make sure you have jumbo
>>> frames enabled, i.e.
>>>
>>> ifconfig eth0 mtu 8000 up
>>>
>>> Also, you should probably set the tcp buffer sizes explicitly in the
>>> pvfs config file, fs.conf:
>>>
>>> <Defaults>
>>> ...
>>> TCPBufferSend 524288
>>> TCPBufferReceive 1048576
>>> </Defaults>
>>>
>>> You might also see better performance with an alternative trove
>>> method
>>> for doing disk io:
>>>
>>> <StorageHints>
>>> ...
>>> TroveMethod alt-aio
>>> </StorageHints>
>>>
>>>
>>> Thanks,
>>>
>>> -sam
>>>
>>> On Apr 20, 2007, at 4:25 PM, Matthew Woitaszek wrote:
>>>
>>>>
>>>> Good afternoon,
>>>>
>>>> Michael Oberg and I are attempting to get PVFS2 working on
>>>> NCAR's 1-rack
>>>> BlueGene/L system using ZeptoOS. We ran into a snag at over 8 BG/L
>>>> I/O nodes
>>>> (>256 compute nodes).
>>>>
>>>> We've been using the mpi-io-test program shipped with PVFS2 to
>>>> test the
>>>> system. For cases up to and including 8 I/O nodes (256
>>>> coprocessor or
>>>> 512
>>>> virtual node mode tasks), everything works fine. Larger jobs fail
>>>> with file
>>>> not found error messages, such as:
>>>>
>>>> MPI_File_open: File does not exist, error stack:
>>>> ADIOI_BGL_OPEN(54): File /pvfs2/mattheww/_file_0512_co does
>>>> not exist
>>>>
>>>> The file is created on the PVFS2 filesystem and has a zero-byte
>>>> size.
>>>> We've
>>>> run the tests with 512 tasks on 256 nodes, and it successfully
>>>> created a
>>>> 8589934592-byte file. Going to 257 nodes fails.
>>>>
>>>> Has anyone seen this behavior before? Are there any PVFS2 server or
>>>> client
>>>> configuration options that you would recommend for a BG/L
>>>> installation like
>>>> this?
>>>>
>>>> Thanks for your time,
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pvfs2-users mailing list
>>>> Pvfs2-users at beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>
>>
>> _______________________________________________
>> Pvfs2-users mailing list
>> Pvfs2-users at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
More information about the Pvfs2-users
mailing list