[Pvfs2-developers] Re: dtype I/O bugs

Sam Lang slang at mcs.anl.gov
Thu Jun 15 12:54:05 EDT 2006



On Jun 14, 2006, at 5:05 PM, Avery Ching wrote:

> Certainly I was able to at least identify one bug I think.  The small
> I/O path is being used based on whether the amount of data going to  
> the
> I/O servers is below the max_unexep_payload.  However, when request  
> gets
> to the server, small-io.sm calls PINT_Process_request() once and then
> calls job_trove_bstream_write_list once.  Then it returns.  If the
> number of stream offset-length pairs generated is greater than
> SMALL_IO_MAX_REGIONS then the operation won't finish.  This won't show
> up in the list I/O path since we break it up on 64 ol-pairs.  It shows
> up on the datatype I/O path since it doesn't get broken up.  You could
> probably trigger it in list I/O by just making SMALL_IO_MAX_REGIONS
> smaller.
>
> Suggestions for fixing:
>
> 1) (Preferred) Loop around the job_trove_bstream_write_list and
> job_trove_bstream_read_list calls to keep moving data until the entire
> datatype has been satisfied.
>
> 2) (Alternative) Make the offset-length pairs limit part of the
> requirement for small I/O.
>

Thanks for debugging this Avery.  For now I went with option #2 since  
its easier :-).  If you find that small IO is a big improvement for  
list io then we can change it to do option 1.  Can you let me know if  
this patch fixes the problem for you?

Thanks,

-sam

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smallio.patch
Type: application/octet-stream
Size: 2907 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20060615/b6db27cc/smallio.obj
-------------- next part --------------


> Avery
>
> On Tue, 2006-06-13 at 17:44 -0500, Avery Ching wrote:
>> I was able to repeat the bug on the 4 server 20 client setup you  
>> had.  I
>> also made it happen on 1 client and 2 servers.  It seems to work  
>> fine with
>> 1 server and 1 client or 1 server and 20 clients, therefore, this  
>> probably
>> is a multi-server issue.  I'll investigate further and let you  
>> know the
>> progress.  I hope it's not another one of those PINT_Process_req() of
>> flow type problems!
>>
>> Avery
>>
>> On Mon, 12 Jun 2006, Avery Ching wrote:
>>
>>> Yeah I have.  I am not sure exactly what the problem is to be  
>>> honest.
>>> Basically that error messsage is just reporting what it got from the
>>> PVFS_Sys_write() call.  Therefore, it could be a lot of things.   
>>> The odd
>>> thing though is that it seems to happen at random places.  The  
>>> test works
>>> fine for other sizes, just fails on certain ones. I'm wondering  
>>> whether
>>> it's related to the flow or Pint_process_request() problems we've
>>> been seeing on the listserv.  Oddly enough, when I did my IPDPS  
>>> testing, I
>>> never ran into that issue for write, only for read (just sometimes -
>>> hence I had no read results =) ).
>>>
>>> Unfortunately, debugging the flow and PINT_process_req() areas is  
>>> quite
>>> difficult.  I'll try and look into it a bit though.  At least see  
>>> if I can
>>> repeat the bug.
>>>
>>> Avery
>>>
>>> suspect that the write call is not returning the correct amount  
>>> of data
>>> processed.
>>>
>>> On Mon, 12 Jun 2006, Robert Latham wrote:
>>>
>>>> Hi Avery
>>>> I've got another hpio bug:
>>>>
>>>> with 4 servers, 20 clients, hpio ran for a long long time and then
>>>> died like this:
>>>> write | region_count | c-nc | datatype
>>>> ----------------time (seconds)--------------|-bandwidth (MB/ 
>>>> s)|---test type---
>>>>   open  |   io   |  sync  | close  | total  |   IO   |  IOsyn |  
>>>> region_count
>>>>   0.062 |  8.160 |  0.208 |  0.000 |  8.429 |  0.031 |  0.030 |  
>>>> 2048
>>>> ADIOI_PVFS2_StridedDtypeIO: Warning - PVFS_sys_read/write  
>>>> returned -1610612737 and completed -4611717612071138032 bytes.
>>>> ADIOI_PVFS2_StridedDtypeIO: Warning - PVFS_sys_read/write  
>>>> returned -1610612737 and completed -4611717612071081488 bytes.
>>>>
>>>> Seen anything like this before?
>>>> ==rob
>>>>
>>>> -- 
>>>> Rob Latham
>>>> Mathematics and Computer Science Division    A215 0178 EA2D B059  
>>>> 8CDF
>>>> Argonne National Labs, IL USA                B29D F333 664A 4280  
>>>> 315B
>>>>
>>>
>



More information about the Pvfs2-developers mailing list