[Pvfs2-users] Re: PVFS v2.8.x initial write performance
Phil Carns
carns at mcs.anl.gov
Mon Apr 6 10:33:44 EDT 2009
Thanks for the extra information, Tony. That's too bad that the
metasync option wasn't helping for your configuration.
I don't think any relevant default configuration changes since 2.7.1. I
think it is just that size update issue that I mentioned earlier since
it only really shows up in the initial write phase. We just have a
performance regression there for serial applications.
I don't have an answer for you yet, but we are looking into it.
-Phil
Tony Kew wrote:
> Dear Phil,
>
> I ran the iozone job manually four times. Only once was there any outout
> in any of the server logs (after the filesystem was initially built,
> servers
> started and the filesystem mounted).
>
> The one iozone run that gave errors failed after the writes had completed,
> running the initial read pass:
>
> ################################################################################
>
> Errors from node c14n29 PVFSv2 server logfile:
> ################################################################################
>
> [E 03/23 12:38] job_time_mgr_expire: job time out: cancelling flow
> operation, jo
> b_id: 3647671.
> [E 03/23 12:38] fp_multiqueue_cancel: flow proto cancel called on
> 0x2a9586f540
> [E 03/23 12:38] fp_multiqueue_cancel: I/O error occurred
> [E 03/23 12:38] handle_io_error: flow proto error cleanup started on
> 0x2a9586f54
> 0: Operation cancelled (possibly due to timeout)
> [E 03/23 12:38] PVFS2 server: signal 11, faulty address is (nil), from
> (nil)
> [E 03/23 12:38] [bt] [(nil)]
>
>
> Other than this (which I consider an anomaly for now...) The average
> performance
> number for the three iozone runs that completed follow:
>
> Initial write: 37,625.48 KB/sec
> Rewrite: 149,830.93 KB/sec
> Read: 170,758.41 KB/sec
> Re-read: 206,256.47 KB/sec
>
> I would say there is a definite performance issue with initial writes.
>
> Are there any filesytems configuration defaults that may have changed
> perhaps?...
>
> Thanks Much,
> Tony
>
> Tony Kew
> SAN Administrator
> The Center for Computational Research
> New York State Center of Excellence
> in Bioinformatics & Life Sciences
> 701 Ellicott Street, Buffalo, NY 14203
>
> CoE Office: (716) 881-8930 Fax: (716) 849-6656
> CSE Office: (716) 645-3797 x2174
> Cell: (716) 560-0910
>
> "I love deadlines, I love the whooshing noise they make as they go by."
> Douglas Adams
>
>
>
> Tony Kew wrote:
>> Dear Phil,
>>
>> The filesystem configuration in my tests are built as follows:
>>
>> pvfs2-genconfig --quiet --protocol tcp --tcpport --notrovesync
>> --trove-method "alt-aio" \
>> --server-job-timeout 60" --fsid=_a_job_specific_id
>> --fsname=_a_job_specific_name_ \
>> --storage _a_job_specific_storage_space_ --logfile
>> _a_job_specific_logfile_ \
>> --ioservers [list of nodes in the job] --metaservers [list of nodes in
>> the job]
>>
>> I believe the "--notrovesync" option already sets "TroveSyncMeta no"
>> in the config
>> file.
>>
>> I'm running an interactive PBS job to make sure the "msgpair failed"
>> error messages
>> are generated during the filesystem build, and not sebsequently -
>> certainly it
>> appears to be the case, but I'm running an iozone job manually to be
>> sure...
>>
>> Tony
>>
>>
>> Tony Kew
>> SAN Administrator
>> The Center for Computational Research
>> New York State Center of Excellence
>> in Bioinformatics & Life Sciences
>> 701 Ellicott Street, Buffalo, NY 14203
>>
>> CoE Office: (716) 881-8930 Fax: (716) 849-6656
>> CSE Office: (716) 645-3797 x2174
>> Cell: (716) 560-0910
>>
>> "I love deadlines, I love the whooshing noise they make as they go by."
>> Douglas Adams
>>
>>
>>
>> Phil Carns wrote:
>>> Hi Tony,
>>>
>>> This is most likely due to a change in how PVFS 2.8.x tracks file
>>> size during writes beyond EOF. It now stores file size explicitly in
>>> berkeley db for each data file. This is required for the new
>>> directio storage method, but we applied it to the other methods as
>>> well to simplify compatibility.
>>>
>>> A test that you could run to confirm this would be to change your
>>> PVFS server configuration file to this in the StorageHints section:
>>>
>>> TroveSyncMeta no
>>>
>>> With that set to "no", PVFS will still synchronize metadata
>>> (including the explicit size field), but it may delay synchronization
>>> until after an acknowledgement has been sent to the client. This
>>> will probably hide the size update cost for a serial application.
>>>
>>> The size update overhead will only show up for serialized
>>> applications that issue small writes beyond EOF (like iozone in the
>>> "initial write" phase). If it were a parallel application, PVFS
>>> would coalesce the size updates to reduce overhead. If it were a
>>> serial application that used larger writes, the size update cost
>>> would be amortized over a longer period of time.
>>>
>>> Regarding your log file warnings, those are normal. In 2.8.x the
>>> servers communicate with each other on startup to precreate datafile
>>> objects. It issues those warnings on occasion if one or more servers
>>> is not up and running yet when it tries to do that, but it will stop
>>> as soon as all servers are available.
>>>
>>> thanks,
>>> -Phil
>>>
>>>
>>> Tony Kew wrote:
>>>> Dear Phil,
>>>>
>>>> Irrespective of the --enable-mmap-racache option, there does seem to be
>>>> a marked performance drop between PVFS version 2.7.1 (with the 20 or
>>>> so patches along the way - not that any were for performance insofar as
>>>> I am aware) and version 2.8.x
>>>>
>>>> version 2.8.1 built was with the following configure options:
>>>> ./configure --prefix=/usr \
>>>> --libdir=/usr/lib64 \
>>>> --enable-perf-counters \
>>>> --enable-fast \
>>>> --with-kernel=%{_kernelsrcdir} \
>>>> --enable-shared \
>>>>
>>>> version 2.7.1 (fully patched) configured as above, with the addition
>>>> of the --enable-mmap-racache option.
>>>>
>>>> I ran three iozone tests for each of the tested distributions, using
>>>> a PBS batch job that creates a (new) filesystem across all the nodes
>>>> in the job. The iozone job runs a parallel iozone job, with one
>>>> data stream on each node. The test directory is configured as
>>>> a stripe across all the nodes, so each node is writing to all the other
>>>> nodes during the test.
>>>>
>>>> The average I/O numbers from running three iozone runs,
>>>> writing to a directory configured using the "Simple Stripe"
>>>> ditribution:
>>>>
>>>> v2.7.1:
>>>>
>>>> Initial write: 219,306.19 KB/sec
>>>> Rewrite: 130,799.13 KB/sec
>>>> Read: 183,249.66 KB/sec
>>>> Re-read: 191,565.02 KB/sec
>>>>
>>>>
>>>> v2.8.1:
>>>>
>>>> Initial write: 40,381.42 KB/sec
>>>> Rewrite: 132,908.15 KB/sec
>>>> Read: 203,758.06 KB/sec
>>>> Re-read: 276,100.11 KB/sec
>>>>
>>>> For a TwoD Stipe distribution:
>>>>
>>>> v2.7.1:
>>>>
>>>> Initial write: 343,876.68 KB/sec
>>>> Rewrite: 229,740.04 KB/sec
>>>> Read: 167,045.91 KB/sec
>>>> Re-read: 166,417.03 KB/sec
>>>>
>>>> v2.8.1:
>>>>
>>>> Initial write: 140,253.67 KB/sec
>>>> Rewrite: 201,923.75 KB/sec
>>>> Read: 182,109.70 KB/sec
>>>> Re-read: 205,073.70 KB/sec
>>>>
>>>> KB/sec
>>>>
>>>>
>>>>
>>>> In the server log files for the v2.8.1 runs there are many of these:
>>>>
>>>> [E 03/06 16:26] Warning: msgpair failed to tcp://c14n24:3334, will
>>>> retry: Connection refused
>>>>
>>>> ...but only at the time when the filesystem is created, so I don't
>>>> believe they have any bearing on the test results.
>>>>
>>>> Let me know if I can provide any more info, or if further tests
>>>> would be of use....
>>>>
>>>>
>>>> Many Thanks,
>>>> Tony
>>>>
>>>> Tony Kew
>>>> SAN Administrator
>>>> The Center for Computational Research
>>>> New York State Center of Excellence
>>>> in Bioinformatics & Life Sciences
>>>> 701 Ellicott Street, Buffalo, NY 14203
>>>>
>>>> CoE Office: (716) 881-8930 Fax: (716) 849-6656
>>>> CSE Office: (716) 645-3797 x2174
>>>> Cell: (716) 560-0910
>>>>
>>>> "I love deadlines, I love the whooshing noise they make as they go by."
>>>> Douglas Adams
>>>> [...]
More information about the Pvfs2-users
mailing list