[Pvfs2-users] PVFS v2.8.x initial write performance

Tony Kew tonykew at ccr.buffalo.edu
Fri Mar 20 14:58:35 EST 2009


Dear Phil,

The filesystem configuration in my tests are built as follows:

pvfs2-genconfig  --quiet --protocol tcp --tcpport --notrovesync 
--trove-method "alt-aio" \
 --server-job-timeout 60" --fsid=_a_job_specific_id  
--fsname=_a_job_specific_name_ \
 --storage _a_job_specific_storage_space_ --logfile 
_a_job_specific_logfile_ \
 --ioservers [list of nodes in the job] --metaservers [list of nodes in 
the job]

I believe the "--notrovesync" option already sets "TroveSyncMeta no" in 
the config
file.

I'm running an interactive PBS job to make sure the "msgpair failed" 
error messages
are generated during the filesystem build, and not sebsequently - 
certainly it
appears to be the case, but I'm running an iozone job manually to be sure...

Tony


Tony Kew
SAN Administrator
The Center for Computational Research
New York State Center of Excellence
 in Bioinformatics & Life Sciences
701 Ellicott Street, Buffalo, NY 14203

CoE Office: (716) 881-8930          Fax: (716) 849-6656
CSE Office: (716) 645-3797 x2174
      Cell: (716) 560-0910

"I love deadlines, I love the whooshing noise they make as they go by."
                                                          Douglas Adams



Phil Carns wrote:
> Hi Tony,
>
> This is most likely due to a change in how PVFS 2.8.x tracks file size 
> during writes beyond EOF.  It now stores file size explicitly in 
> berkeley db for each data file.  This is required for the new directio 
> storage method, but we applied it to the other methods as well to 
> simplify compatibility.
>
> A test that you could run to confirm this would be to change your PVFS 
> server configuration file to this in the StorageHints section:
>
> TroveSyncMeta no
>
> With that set to "no", PVFS will still synchronize metadata (including 
> the explicit size field), but it may delay synchronization until after 
> an acknowledgement has been sent to the client.  This will probably 
> hide the size update cost for a serial application.
>
> The size update overhead will only show up for serialized applications 
> that issue small writes beyond EOF (like iozone in the "initial write" 
> phase).  If it were a parallel application, PVFS would coalesce the 
> size updates to reduce overhead.  If it were a serial application that 
> used larger writes, the size update cost would be amortized over a 
> longer period of time.
>
> Regarding your log file warnings, those are normal.  In 2.8.x the 
> servers communicate with each other on startup to precreate datafile 
> objects.  It issues those warnings on occasion if one or more servers 
> is not up and running yet when it tries to do that, but it will stop 
> as soon as all servers are available.
>
> thanks,
> -Phil
>
>
> Tony Kew wrote:
>> Dear Phil,
>>
>> Irrespective of the --enable-mmap-racache option, there does seem to be
>> a marked performance drop between PVFS version 2.7.1 (with the 20 or
>> so patches along the way - not that any were for performance insofar as
>> I am aware) and version 2.8.x
>>
>> version 2.8.1 built was with the following configure options:
>> ./configure --prefix=/usr \
>> --libdir=/usr/lib64 \
>> --enable-perf-counters \
>> --enable-fast \
>> --with-kernel=%{_kernelsrcdir} \
>> --enable-shared \
>>
>> version 2.7.1 (fully patched) configured as above, with the addition
>> of the --enable-mmap-racache option.
>>
>> I ran three iozone tests for each of the tested distributions, using
>> a PBS batch job that creates a (new) filesystem across all the nodes
>> in the job.  The iozone job runs a parallel iozone job, with one
>> data stream on each node.  The test directory is configured as
>> a stripe across all the nodes, so each node is writing to all the other
>> nodes during the test.
>>
>> The average I/O numbers from running three iozone runs,
>> writing to a directory configured using the "Simple Stripe" ditribution:
>>
>> v2.7.1:
>>
>>  Initial write: 219,306.19 KB/sec
>>        Rewrite: 130,799.13 KB/sec
>>           Read: 183,249.66 KB/sec
>>        Re-read: 191,565.02 KB/sec
>>
>>
>> v2.8.1:
>>
>>  Initial write:  40,381.42 KB/sec
>>        Rewrite: 132,908.15 KB/sec
>>           Read: 203,758.06 KB/sec
>>        Re-read: 276,100.11 KB/sec
>>
>> For a TwoD Stipe distribution:
>>
>> v2.7.1:
>>
>>  Initial write: 343,876.68 KB/sec
>>        Rewrite: 229,740.04 KB/sec
>>           Read: 167,045.91 KB/sec
>>        Re-read: 166,417.03 KB/sec
>>
>> v2.8.1:
>>
>>  Initial write: 140,253.67 KB/sec
>>        Rewrite: 201,923.75 KB/sec
>>           Read: 182,109.70 KB/sec
>>        Re-read: 205,073.70 KB/sec
>>
>> KB/sec
>>
>>
>>
>> In the server log files for the v2.8.1 runs there are many of these:
>>
>> [E 03/06 16:26] Warning: msgpair failed to tcp://c14n24:3334, will 
>> retry: Connection refused
>>
>> ...but only at the time when the filesystem is created, so I don't
>> believe they have any bearing on the test results.
>>
>> Let me know if I can provide any more info, or if further tests
>> would be of use....
>>
>>
>> Many Thanks,
>> Tony
>>
>> Tony Kew
>> SAN Administrator
>> The Center for Computational Research
>> New York State Center of Excellence
>> in Bioinformatics & Life Sciences
>> 701 Ellicott Street, Buffalo, NY 14203
>>
>> CoE Office: (716) 881-8930           Fax: (716) 849-6656
>> CSE Office: (716) 645-3797 x2174
>>      Cell: (716) 560-0910
>>
>> "I love deadlines, I love the whooshing noise they make as they go by."
>>                                                          Douglas Adams
>> [...]


More information about the Pvfs2-users mailing list