[Pvfs2-users] Re: PVFS v2.8.x initial write performance

Phil Carns carns at mcs.anl.gov
Mon Apr 6 10:33:44 EDT 2009


Thanks for the extra information, Tony.  That's too bad that the 
metasync option wasn't helping for your configuration.

I don't think any relevant default configuration changes since 2.7.1.  I 
think it is just that size update issue that I mentioned earlier since 
it only really shows up in the initial write phase.  We just have a 
performance regression there for serial applications.

I don't have an answer for you yet, but we are looking into it.

-Phil

Tony Kew wrote:
> Dear Phil,
> 
> I ran the iozone job manually four times.  Only once was there any outout
> in any of the server logs (after the filesystem was initially built, 
> servers
> started and the filesystem mounted).
> 
> The one iozone run that gave errors failed after the writes had completed,
> running the initial read pass:
> 
> ################################################################################ 
> 
> Errors from node c14n29 PVFSv2 server logfile:
> ################################################################################ 
> 
> [E 03/23 12:38] job_time_mgr_expire: job time out: cancelling flow 
> operation, jo
> b_id: 3647671.
> [E 03/23 12:38] fp_multiqueue_cancel: flow proto cancel called on 
> 0x2a9586f540
> [E 03/23 12:38] fp_multiqueue_cancel: I/O error occurred
> [E 03/23 12:38] handle_io_error: flow proto error cleanup started on 
> 0x2a9586f54
> 0: Operation cancelled (possibly due to timeout)
> [E 03/23 12:38] PVFS2 server: signal 11, faulty address is (nil), from 
> (nil)
> [E 03/23 12:38] [bt] [(nil)]
> 
> 
> Other than this (which I consider an anomaly for now...)  The average 
> performance
> number for the three iozone runs that completed follow:
> 
>      Initial write:  37,625.48 KB/sec
>            Rewrite: 149,830.93 KB/sec
>               Read: 170,758.41 KB/sec
>            Re-read: 206,256.47 KB/sec
> 
> I would say there is a definite performance issue with initial writes.
> 
> Are there any filesytems configuration defaults that may have changed
> perhaps?...
> 
> Thanks Much,
> Tony
> 
> Tony Kew
> SAN Administrator
> The Center for Computational Research
> New York State Center of Excellence
> in Bioinformatics & Life Sciences
> 701 Ellicott Street, Buffalo, NY 14203
> 
> CoE Office: (716) 881-8930          Fax: (716) 849-6656
> CSE Office: (716) 645-3797 x2174
>      Cell: (716) 560-0910
> 
> "I love deadlines, I love the whooshing noise they make as they go by."
>                                                          Douglas Adams
> 
> 
> 
> Tony Kew wrote:
>> Dear Phil,
>>
>> The filesystem configuration in my tests are built as follows:
>>
>> pvfs2-genconfig  --quiet --protocol tcp --tcpport --notrovesync 
>> --trove-method "alt-aio" \
>> --server-job-timeout 60" --fsid=_a_job_specific_id  
>> --fsname=_a_job_specific_name_ \
>> --storage _a_job_specific_storage_space_ --logfile 
>> _a_job_specific_logfile_ \
>> --ioservers [list of nodes in the job] --metaservers [list of nodes in 
>> the job]
>>
>> I believe the "--notrovesync" option already sets "TroveSyncMeta no" 
>> in the config
>> file.
>>
>> I'm running an interactive PBS job to make sure the "msgpair failed" 
>> error messages
>> are generated during the filesystem build, and not sebsequently - 
>> certainly it
>> appears to be the case, but I'm running an iozone job manually to be 
>> sure...
>>
>> Tony
>>
>>
>> Tony Kew
>> SAN Administrator
>> The Center for Computational Research
>> New York State Center of Excellence
>> in Bioinformatics & Life Sciences
>> 701 Ellicott Street, Buffalo, NY 14203
>>
>> CoE Office: (716) 881-8930          Fax: (716) 849-6656
>> CSE Office: (716) 645-3797 x2174
>>      Cell: (716) 560-0910
>>
>> "I love deadlines, I love the whooshing noise they make as they go by."
>>                                                          Douglas Adams
>>
>>
>>
>> Phil Carns wrote:
>>> Hi Tony,
>>>
>>> This is most likely due to a change in how PVFS 2.8.x tracks file 
>>> size during writes beyond EOF.  It now stores file size explicitly in 
>>> berkeley db for each data file.  This is required for the new 
>>> directio storage method, but we applied it to the other methods as 
>>> well to simplify compatibility.
>>>
>>> A test that you could run to confirm this would be to change your 
>>> PVFS server configuration file to this in the StorageHints section:
>>>
>>> TroveSyncMeta no
>>>
>>> With that set to "no", PVFS will still synchronize metadata 
>>> (including the explicit size field), but it may delay synchronization 
>>> until after an acknowledgement has been sent to the client.  This 
>>> will probably hide the size update cost for a serial application.
>>>
>>> The size update overhead will only show up for serialized 
>>> applications that issue small writes beyond EOF (like iozone in the 
>>> "initial write" phase).  If it were a parallel application, PVFS 
>>> would coalesce the size updates to reduce overhead.  If it were a 
>>> serial application that used larger writes, the size update cost 
>>> would be amortized over a longer period of time.
>>>
>>> Regarding your log file warnings, those are normal.  In 2.8.x the 
>>> servers communicate with each other on startup to precreate datafile 
>>> objects.  It issues those warnings on occasion if one or more servers 
>>> is not up and running yet when it tries to do that, but it will stop 
>>> as soon as all servers are available.
>>>
>>> thanks,
>>> -Phil
>>>
>>>
>>> Tony Kew wrote:
>>>> Dear Phil,
>>>>
>>>> Irrespective of the --enable-mmap-racache option, there does seem to be
>>>> a marked performance drop between PVFS version 2.7.1 (with the 20 or
>>>> so patches along the way - not that any were for performance insofar as
>>>> I am aware) and version 2.8.x
>>>>
>>>> version 2.8.1 built was with the following configure options:
>>>> ./configure --prefix=/usr \
>>>> --libdir=/usr/lib64 \
>>>> --enable-perf-counters \
>>>> --enable-fast \
>>>> --with-kernel=%{_kernelsrcdir} \
>>>> --enable-shared \
>>>>
>>>> version 2.7.1 (fully patched) configured as above, with the addition
>>>> of the --enable-mmap-racache option.
>>>>
>>>> I ran three iozone tests for each of the tested distributions, using
>>>> a PBS batch job that creates a (new) filesystem across all the nodes
>>>> in the job.  The iozone job runs a parallel iozone job, with one
>>>> data stream on each node.  The test directory is configured as
>>>> a stripe across all the nodes, so each node is writing to all the other
>>>> nodes during the test.
>>>>
>>>> The average I/O numbers from running three iozone runs,
>>>> writing to a directory configured using the "Simple Stripe" 
>>>> ditribution:
>>>>
>>>> v2.7.1:
>>>>
>>>>  Initial write: 219,306.19 KB/sec
>>>>        Rewrite: 130,799.13 KB/sec
>>>>           Read: 183,249.66 KB/sec
>>>>        Re-read: 191,565.02 KB/sec
>>>>
>>>>
>>>> v2.8.1:
>>>>
>>>>  Initial write:  40,381.42 KB/sec
>>>>        Rewrite: 132,908.15 KB/sec
>>>>           Read: 203,758.06 KB/sec
>>>>        Re-read: 276,100.11 KB/sec
>>>>
>>>> For a TwoD Stipe distribution:
>>>>
>>>> v2.7.1:
>>>>
>>>>  Initial write: 343,876.68 KB/sec
>>>>        Rewrite: 229,740.04 KB/sec
>>>>           Read: 167,045.91 KB/sec
>>>>        Re-read: 166,417.03 KB/sec
>>>>
>>>> v2.8.1:
>>>>
>>>>  Initial write: 140,253.67 KB/sec
>>>>        Rewrite: 201,923.75 KB/sec
>>>>           Read: 182,109.70 KB/sec
>>>>        Re-read: 205,073.70 KB/sec
>>>>
>>>> KB/sec
>>>>
>>>>
>>>>
>>>> In the server log files for the v2.8.1 runs there are many of these:
>>>>
>>>> [E 03/06 16:26] Warning: msgpair failed to tcp://c14n24:3334, will 
>>>> retry: Connection refused
>>>>
>>>> ...but only at the time when the filesystem is created, so I don't
>>>> believe they have any bearing on the test results.
>>>>
>>>> Let me know if I can provide any more info, or if further tests
>>>> would be of use....
>>>>
>>>>
>>>> Many Thanks,
>>>> Tony
>>>>
>>>> Tony Kew
>>>> SAN Administrator
>>>> The Center for Computational Research
>>>> New York State Center of Excellence
>>>> in Bioinformatics & Life Sciences
>>>> 701 Ellicott Street, Buffalo, NY 14203
>>>>
>>>> CoE Office: (716) 881-8930           Fax: (716) 849-6656
>>>> CSE Office: (716) 645-3797 x2174
>>>>      Cell: (716) 560-0910
>>>>
>>>> "I love deadlines, I love the whooshing noise they make as they go by."
>>>>                                                          Douglas Adams
>>>> [...]



More information about the Pvfs2-users mailing list