[Pvfs2-developers] libpvfs2 usage

Sam Lang slang at mcs.anl.gov
Wed Oct 18 17:35:12 EDT 2006


On Oct 18, 2006, at 2:18 PM, Sam Lang wrote:

>
> On Oct 16, 2006, at 5:40 PM, Brett Bode wrote:
>
>> Hello,
>>    We have modified an existing application to directly call  
>> libpvfs2. Our pvfs2 setup has 6 servers and is setup to run pvfs2  
>> over OpenIB verbs. We borrowed the code more or less from pvfs2- 
>> cp. This seems to work and we have had several successful runs.  
>> However we have also had a couple of hangs on one node. The  
>> traceback for the hang is:
>>
>> #0  0x00002ab9874a34bf in poll () from /lib/libc.so.6
>> #1  0x0000000001cbea67 in BMI_ib_testcontext ()
>> #2  0x0000000001c8feb4 in BMI_testcontext ()
>> #3  0x0000000001c99624 in PINT_thread_mgr_bmi_push ()
>> #4  0x0000000001c950d3 in do_one_work_cycle_all ()
>> #5  0x0000000001c95883 in job_testcontext ()
>> #6  0x0000000001ca37e4 in PINT_client_state_machine_test ()
>> #7  0x0000000001ca3c00 in PINT_client_wait_internal ()
>> #8  0x0000000001c7df71 in PVFS_sys_io ()
>> #9  0x0000000001c6e253 in flushBuffer ()
>>     at /afs/.scl.ameslab.gov/project/nodeimg/amd64.test/usr/src/ 
>> gamess-pvfs/bypa
>> ssIO-pvfs.c:355
>>         #10 0x0000000005eb27b0 in userFilePos ()
>>
>> Eventually we timeout and die. So the first question is do you  
>> have any suggestions as to where to look for the cause of the  
>> hang? That is a write, but I have seen it fail now during a read  
>> as well (it died on the 12th pass through after reading the  
>> complete file 11 times).
>>
>> We also have several usage and/or tuning related questions. First  
>> off, when the file is created there are options for the  
>> "dfile_count" and the "strip_size". Thus far I have left them at  
>> defaults. Can you comment on what sort of values would be optimal  
>> for sequentially accessed large files. Would tuning the IO buffer  
>> size the application passes to the strip size be useful?
>
> You're already seeing that matching the stripe size and request  
> size give you much fewer cache misses, which is new info we can add  
> to the tuning guide, or maybe Pete can come up with some  
> optimizations around that.

Rob pointed out that its not matching the two that you really want.   
The ideal strip size needs to be large enough to prevent a single  
request from being many multiples of the stripe size, but small  
enough that a request still spans all servers.  So for your specific  
case, ideally you would have:

strip_size = request_size / number_of_servers

-sam

>   Usually the strip size is used to control the behavior of disk  
> IO, as it means the trove layer is able to do reads and writes in  
> larger chunks.  I think we've generally found for that larger  
> workloads the default strip size is ideally matched to the size of  
> requests.  I think just increasing the strip size shouldn't  
> necessarily help for sequential accesses.
>
> Up to this point, the dfile_count has only been used to improve  
> performance of IO on smaller files, by setting the value to 1, so  
> that small requests are not broken down even further.  In your case  
> it probably makes sense to leave it at its default value.
>
> What 'tuning guide' you say?  Its currently a work in  
> progress :-).  If anyone is interested in helping out, especially  
> for the IB sections, we could really use it.
>
>>
>> We have also have a problem when running on our IBM EHCA's with  
>> too many memory registrations. The odd part is that I am using the  
>> same 1MB buffer all time so I don't see why it seems to be  
>> reregistered at each write. My write code looks like this:
>>
>>                 file_req = PVFS_BYTE;
>>                 ret = PVFS_Request_contiguous(ioSize, PVFS_BYTE,  
>> &mem_req);
>>                 if (ret < 0) {
>>                     PVFS_perror("PVFS_Request_contiguous", ret);
>>                     return;
>>                 }
>>                 ret = PVFS_sys_write(target_object.ref, file_req,
>>                     bufferedFilePos, myBuffer, mem_req,
>>                     &credentials, &resp_io);
>>                 if (ret == 0) {
>>                      PVFS_Request_free(&mem_req);
>>             /*       return(resp_io.total_completed);*/
>>                 } else
>>                     PVFS_perror("PVFS_sys_write", ret);
>>
>> One question is what does PVFS_Request_contiguous actually do?
>
> It creates a request structure that essentially contains the size  
> and offset into the memory buffer.
>
>> Since I am using the same buffer all the time would it be ok to  
>> setup the request once and then reuse it so long as the io size is  
>> the same?
>
> Yes.  The request structure doesn't get modified by the IO call.   
> You (correctly) use PVFS_BYTE for the file request.  The reason you  
> can't just use PVFS_BYTE for the memory request is that the size  
> has to be encapsulated in the request as well (while the file  
> request gets tiled based on the actual file size).
>
> -sam
>
>>
>> Thanks for any help you can provide,
>>
>> Brett
>>
>>
>> ____________________________________________
>> Dr. Brett Bode
>> 329 Wilhelm Hall
>> Ames Laboratory
>> Iowa State University
>> Ames, IA 50011              (515) 294-9192
>> brett at scl.ameslab.gov  FAX: (515) 294-4491
>> ____________________________________________
>>
>>
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>



More information about the Pvfs2-developers mailing list