[PVFS-users] mmargo
Ron W. Green
rwgree at sandia.gov
Fri Apr 2 23:34:10 EST 2004
you aren't going to like it - we take system time and restart all the
iods, pvfsds and mgr. then it runs OK for a while.
It seems we can restart iods and mgr without killing users. however,
I've seen user jobs die when I restart pvfsclient on the client nodes.
I have an open question as to whether I can get away with just
restarting iods and mgr. If you find out let me know.
ron
Martin Margo wrote:
> Ron,
>
> Thanks for the heads up. I thought that the January patch will fix
> the problem, at least some other folks in this list declared that.
> What are your workaround for this problem?
>
> -Martin
> On Apr 2, 2004, at 2:18 PM, Ron W. Green wrote:
>
>> we're using 1.6.2 with the january patch and are seeing these
>> enqueue messages.
>>
>> if it helps, we have 236 client nodes talking to the one mgr node,
>> and have 6 iod nodes. Is it possible that we need a much deeper
>> queue depth to accommodate long latencies in talking to the mgr?
>> Are the clients spilling out of their local queues with requests
>> waiting on mgr? I suspect it may be a scaling issue.
>>
>> thanks, I do appreciate the work being done on PVFS. It is improving.
>>
>> ron
>>
>> Nathan Poznick wrote:
>>
>>> _______________________________________________
>>> PVFS-users mailing list
>>> PVFS-users at www.beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs-users
>>>
>>>
>>>
>>> ----------------------------------------------------------------------
>>> --
>>>
>>> Date:
>>> Fri, 2 Apr 2004 10:27:54 -0700
>>>
>>>
>>> ----------------------------------------------------------------------
>>> --
>>>
>>> Thus spake Ron W. Green:
>>>
>>>> Martin,
>>>>
>>>> We seem to get those "failed on enqueue" quite often. Of course,
>>>> our cluster is much bigger too. I've scratched my head on this,
>>>> and looked at the code. The best I can tell it is when the pvfs
>>>> client attempts a metadata operation to the mgr node. I suspect
>>>> that the mgr is slow in responding and/or has run out of queueing
>>>> space to enqueue the metadata operation request (create or stat).
>>>>
>>>> Anyone on the list know if mgr has a fixed queue size? Or can we
>>>> jack up the client timeouts? Multithread mgr?
>>>>
>>>> From our testing we're quite convinced the problem lies in mgr,
>>>> that it can't keep up with metadata requests from the clients.
>>>>
>>>
>>> Actually those messages are not referring to any sort of queuing on
>>> the
>>> manager at all - they refer to the pvfsdev_enqueue/dequeue
>>> functions in
>>> the kernel module which add/remove messages from the /dev/pvfs-req
>>> device.
>>>
>>>
>>>
>>
>> --
>> Ron W. Green
>> rwgree at sandia.gov
>> +1-505-284-1600
>>
>> Sr. Engineer, ICC Applications Support
>>
>>
>>
>> _______________________________________________
>> PVFS-users mailing list
>> PVFS-users at www.beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs-users
>
>
--
Ron W. Green
rwgree at sandia.gov
+1-505-284-1600
Sr. Engineer, ICC Applications Support
More information about the PVFS-users
mailing list