[Pvfs2-users] "Remote Endpoint is Closed" error starting pvfs2-server

Phil Carns carns at mcs.anl.gov
Tue Aug 24 19:30:52 EDT 2010


>>> I modified the header file, recompiled, and ran it again -- here is the relevant portion of the debug output:
>>>
>>>        
>>>> [D 08/24 20:06] Passing mx://renton:0:3 as BMI listen address.
>>>> [D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
>>>> [D 08/24 20:06] Server using shm key hint: 1937657261
>>>> [D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 11
>>>> [D 08/24 20:06] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 12
>>>> [D 08/24 20:06] dbpf_thread_initialize: initialized
>>>> [D 08/24 20:06] dbpf_thread_function started
>>>> [D 08/24 20:06] [SYNC_COALESCE]: dbpf_sync_context_init for context 0 called
>>>> [D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
>>>> [D 08/24 20:06] bmi_mx: Setting peer mx://begbie:0:3 to BMX_PEER_WAIT.
>>>> [D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
>>>> [D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 0.
>>>> [D 08/24 20:06] bmi_mx: Setting peer mx://tommy:0:3 to BMX_PEER_WAIT.
>>>> [D 08/24 20:06] bmi_mx: bmx_peer_addref refcount was 1.
>>>> OMX: Completing iconnect request: Remote Endpoint is Closed
>>>>          
>>> I don't really understand what is supposed to happen here -- the other two machines are not running a pvfs2 server at the moment because all three of them have this error and close before the others can be started.  Surely what should happen is some kind of polling loop waiting for the other servers to be ready?  That seems to be what is implied by going into the "BMX_PEER_WAIT" state, but it seems to be having a problem maintaining that state for some reason.
>>>
>>> Josh.
>>>        
>> This output is from renton. It tries to connect to begbie and tommy, but they do not have an open MX endpoint. The connect fails and PVFS2 gives up.
>>
>> I have not experimented much with multiple servers. Perhaps someone else can chime in as to whether there should be specific order to bringing up servers (e.g. in Lustre the metadata server must come up before the storage servers).
>>
>> Another possibility is that PVFS2 tries again with socket connections but is not with MX. Can anyone verify this?
>>
>> Lastly, I expected to see some more message from bmi_mx. Is BMX_DB_CONN set in the BMX_DB_MASK?
>>
>> Scott
>>      

PVFS does sit in a loop and wait for other servers to come up.  It 
doesn't matter what order they are started as long as they all 
eventually start.  My suspicion would be that the open-mx library might 
be calling exit() or abort() when it encounters an error, causing the 
server to quit before it gets a chance to retry communication.

What version of open-mx are you using?

-Phil


More information about the Pvfs2-users mailing list