[Pvfs2-users] "Remote Endpoint is Closed" error starting pvfs2-server

Scott Atchley atchley at myri.com
Tue Aug 24 20:54:36 EDT 2010


On Aug 24, 2010, at 7:35 PM, Joshua Randall wrote:

>>> Lastly, I expected to see some more message from bmi_mx. Is BMX_DB_CONN set in the BMX_DB_MASK?
> BMX_DB_CONN was set in BMX_DB_MASK, yes -- though now I've just set BMX_DB_ALL instead, so here is a run with the full bmi_mx debug output:
> 
>> [S 08/25 00:19] PVFS2 Server on node renton version 2.8.2 starting...
>> [D 08/25 00:19] Logging all (mask 18446744073709551615)
>> [D 08/25 00:19] PINT_encode_initialize
>> [D 08/25 00:19] lebf_initialize
>> [D 08/25 00:19] check_req_size
>> [D 08/25 00:19] encode_common
>> <snip>
>> [D 08/25 00:19] lebf_encode_rel
>> [D 08/25 00:19] check_resp_size
>> [D 08/25 00:19] encode_common
>> [D 08/25 00:19] lebf_encode_resp
>> [D 08/25 00:19] lebf_encode_rel
>> [D 08/25 00:19] Passing mx://renton:0:3 as BMI listen address.
>> [D 08/25 00:19] bmi_mx: entering BMI_mx_method_addr_lookup.
>> [D 08/25 00:19] bmi_mx: BMI_mx_method_addr_lookup with id mx://renton:0:3.
>> [D 08/25 00:19] bmi_mx: entering bmx_alloc_method_addr.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_alloc_method_addr.
>> [D 08/25 00:19] bmi_mx: exiting  BMI_mx_method_addr_lookup.
>> [D 08/25 00:19] bmi_mx: entering BMI_mx_initialize.
>> OMX: Emulating MX_DISABLE_SHMEM as OMX_DISABLE_SHARED
>> OMX: Forcing shared comms to disabled
>> OMX: Setting 4 bits of context id at offset 60 in matching
>> [D 08/25 00:19] bmi_mx: entering bmx_ctx_init.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_ctx_init.
>> <snip>
>> [D 08/25 00:19] bmi_mx: entering bmx_ctx_init.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_ctx_init.
>> [D 08/25 00:19] bmi_mx: exiting  BMI_mx_initialize.
>> [D 08/25 00:19] Server using shm key hint: 1937657261
>> [D 08/25 00:19] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 11
>> [D 08/25 00:19] bmi_mx: entering BMI_mx_set_info.
>> [D 08/25 00:19] bmi_mx: exiting  BMI_mx_set_info.
>> [D 08/25 00:19] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 12
>> [D 08/25 00:19] bmi_mx: entering BMI_mx_set_info.
>> [D 08/25 00:19] bmi_mx: exiting  BMI_mx_set_info.
>> [D 08/25 00:19] dbpf_thread_initialize: initialized
>> [D 08/25 00:19] dbpf_thread_function started
>> [D 08/25 00:19] [SYNC_COALESCE]: dbpf_sync_context_init for context 0 called
>> [D 08/25 00:19] bmi_mx: entering BMI_mx_testcontext.
>> [D 08/25 00:19] bmi_mx: entering bmx_connection_handlers.
>> [D 08/25 00:19] bmi_mx: entering BMI_mx_method_addr_lookup.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_connection_handlers.
>> [D 08/25 00:19] bmi_mx: BMI_mx_method_addr_lookup with id mx://begbie:0:3.
>> [D 08/25 00:19] bmi_mx: entering bmx_alloc_method_addr.
>> [D 08/25 00:19] bmi_mx: entering bmx_deq_completed.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_alloc_method_addr.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_deq_completed.
>> [D 08/25 00:19] bmi_mx: exiting  BMI_mx_testcontext.
>> [D 08/25 00:19] bmi_mx: entering bmx_ctx_init.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_ctx_init.
>> <snip>
>> [D 08/25 00:19] bmi_mx: exiting  bmx_ctx_init.
>> [D 08/25 00:19] bmi_mx: bmx_peer_addref refcount was 0.
>> [D 08/25 00:19] bmi_mx: entering bmx_peer_connect.
>> [D 08/25 00:19] bmi_mx: Setting peer mx://begbie:0:3 to BMX_PEER_WAIT.
>> [D 08/25 00:19] bmi_mx: bmx_peer_addref refcount was 1.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_peer_connect.
>> [D 08/25 00:19] bmi_mx: exiting  BMI_mx_method_addr_lookup.
>> [D 08/25 00:19] bmi_mx: entering BMI_mx_method_addr_lookup.
>> [D 08/25 00:19] bmi_mx: BMI_mx_method_addr_lookup with id mx://renton:0:3.
>> [D 08/25 00:19] bmi_mx: exiting  BMI_mx_method_addr_lookup.
>> [D 08/25 00:19] bmi_mx: entering BMI_mx_method_addr_lookup.
>> [D 08/25 00:19] bmi_mx: BMI_mx_method_addr_lookup with id mx://tommy:0:3.
>> [D 08/25 00:19] bmi_mx: entering bmx_alloc_method_addr.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_alloc_method_addr.
>> [D 08/25 00:19] bmi_mx: entering bmx_ctx_init.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_ctx_init.
>> <snip>
>> [D 08/25 00:19] bmi_mx: entering bmx_ctx_init.
>> [D 08/25 00:19] bmi_mx: exiting  bmx_ctx_init.
>> [D 08/25 00:19] bmi_mx: bmx_peer_addref refcount was 0.
>> [D 08/25 00:19] bmi_mx: entering bmx_peer_connect.
>> [D 08/25 00:19] bmi_mx: Setting peer mx://tommy:0:3 to BMX_PEER_WAIT.
>> [D 08/25 00:19] bmi_mx: bmx_peer_addref refcount was 1.
>> OMX: Completing iconnect request: Remote Endpoint is Closed
> 
> 
> It seems like repeatedly calling bmx_ctx_init might not be the expected normal behavior, but it's hard to tell without a known-working MX stack to test it against.  If anyone has any multiple-server PVFS2 installation actually working over any type of MX (Open or Myricom), would you be able to set BMX_DB_MASK to BMX_DB_ALL and send the output?
> 
> Thanks!
> 
> Josh.

This looks normal. The bmx_ctx_init() function is used a lot. Using BMX_DB_ALL add function entry and exit statements which makes the output incredibly verbose and I only use it when debugging the order of events (e.g. when chasing a race condition).

Phil's comment about the Open-MX error code looks to be the solution. By default, bmi_mx sets MX_ERRORS_RETURN rather than the default of MX_ERRORS_FATAL. I wonder if the Open-MX error handler is ignoreing the MX one.

Scott


More information about the Pvfs2-users mailing list