[Pvfs2-users] "Remote Endpoint is Closed" error starting pvfs2-server

Joshua Randall jrandall at well.ox.ac.uk
Wed Aug 25 11:02:59 EDT 2010


> Can you try repeating your experiment with the OMX_FATAL_ERRORS  
> environment variable set to 0?

With OMX_FATAL_ERRORS=0 and MX_IMM_ACK=1, the servers now start and  
connect to each other.  Perhaps someone should add the need for  
OMX_FATAL_ERRORS=0 to the FAQ entry that discusses MX_IMM_ACK (or  
perhaps it can be set programmatically via the API?).  In any case,  
thanks for the help in getting the servers started!

However, I still cannot get the filesystem to actually work.  With the  
3 servers running, I tried pvfs2-ping, but all I get are connection  
errors.

Depending on which of the 3 servers I set in fstab, I get different  
errors with pvfs2-ping.

With the server set to the same host as I am running pvfs2-ping on (in  
this case begbie), all three servers keep running without recognizing  
any connection request, and I get this output:
>
> begbie:~$ sudo MX_IMM_ACK=1 OMX_FATAL_ERRORS=0 pvfs2-ping -m /ggeu
>
> (1) Parsing tab file...
>
> (2) Initializing system interface...
>
> (3) Initializing each file system found in tab file: /etc/pvfs2tab...
>
>    PVFS2 servers: mx://begbie:0:3
>    Storage name: pvfs2-fs
>    Local mount point: /ggeu
> [E 15:36:10.272413] Warning: msgpair failed to mx://begbie:0:3, will  
> retry: Network dropped connection on reset
> [E 15:36:33.203054] Warning: msgpair failed to mx://begbie:0:3, will  
> retry: Network dropped connection on reset
> [E 15:36:56.162436] Warning: msgpair failed to mx://begbie:0:3, will  
> retry: Network dropped connection on reset
> [E 15:37:19.112399] Warning: msgpair failed to mx://begbie:0:3, will  
> retry: Network dropped connection on reset
> [E 15:37:42.092670] Warning: msgpair failed to mx://begbie:0:3, will  
> retry: Network dropped connection on reset
> [E 15:38:05.052442] Warning: msgpair failed to mx://begbie:0:3, will  
> retry: Network dropped connection on reset
> [E 15:38:05.052484] *** msgpairarray_completion_fn: msgpair to  
> server [UNKNOWN] failed: Network dropped connection on reset
> [E 15:38:05.052499] *** Out of retries.
>    /ggeu: FAILURE!
>
> Failure: could not initialze at least one of the target file systems.
>
> (4) Searching for /ggeu in pvfstab...
> [E 15:38:05.052535] Error: /ggeu/ resides on a PVFS2 file system  
> that has not yet been initialized.
> Failure: could not find filesystem for /ggeu in pvfs2tab /etc/pvfs2tab
> Entry 0: /ggeu

If I set the host in pvfs2tab to one of the other two hosts, the  
server on that host immediately crashes with a segmentation fault, and  
the pvfs2-ping output looks like this:
>
> begbie:~$ sudo MX_IMM_ACK=1 OMX_FATAL_ERRORS=0 pvfs2-ping -m /ggeu
>
> (1) Parsing tab file...
>
> (2) Initializing system interface...
>
> (3) Initializing each file system found in tab file: /etc/pvfs2tab...
>
>    PVFS2 servers: mx://renton:0:3
>    Storage name: pvfs2-fs
>    Local mount point: /ggeu
> [E 15:39:37.761174] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 4.
> [E 15:39:37.761355] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 5.
> [E 15:39:37.771212] Warning: msgpair failed to mx://renton:0:3, will  
> retry: Operation cancelled (possibly due to timeout)
> [E 15:40:07.031576] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 37.
> [E 15:40:07.031600] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 38.
> [E 15:40:07.041632] Warning: msgpair failed to mx://renton:0:3, will  
> retry: Operation cancelled (possibly due to timeout)
> [E 15:40:37.312411] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 68.
> [E 15:40:37.312457] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 69.
> [E 15:40:37.322401] Warning: msgpair failed to mx://renton:0:3, will  
> retry: Operation cancelled (possibly due to timeout)
> [E 15:41:07.631773] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 100.
> [E 15:41:07.631822] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 101.
> [E 15:41:07.641782] Warning: msgpair failed to mx://renton:0:3, will  
> retry: Operation cancelled (possibly due to timeout)
> [E 15:41:37.941766] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 132.
> [E 15:41:37.941808] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 133.
> [E 15:41:37.951778] Warning: msgpair failed to mx://renton:0:3, will  
> retry: Operation cancelled (possibly due to timeout)
>
> [E 15:42:07.271684] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 164.
> [E 15:42:07.271729] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 165.
> [E 15:42:07.281753] Warning: msgpair failed to mx://renton:0:3, will  
> retry: Operation cancelled (possibly due to timeout)
> [E 15:42:07.281766] *** msgpairarray_completion_fn: msgpair to  
> server [UNKNOWN] failed: Operation cancelled (possibly due to timeout)
> [E 15:42:07.281781] *** Out of retries.
>    /ggeu: FAILURE!
>
> Failure: could not initialze at least one of the target file systems.
>
> (4) Searching for /ggeu in pvfstab...
> [E 15:42:07.281824] Error: /ggeu/ resides on a PVFS2 file system  
> that has not yet been initialized.
> Failure: could not find filesystem for /ggeu in pvfs2tab /etc/pvfs2tab
> Entry 0: /ggeu


And the pvfs2-server output before the seg fault looks like:
> [P 08/25 15:39] Start times (hr:min:sec):  15:39:06.049   
> 15:39:04.999  15:39:03.968  15:39:02.919  15:39:01.829  15:39:00.779
> [P 08/25 15:39] Intervals (hr:min:sec)  :  00:00:01.090   
> 00:00:01.050  00:00:01.031  00:00:01.049  00:00:01.090  00:00:01.050
> [P 08/25 15:39]  
> -------------------------------------------------------------------------------------------------------------
> [P 08/25 15:39] bytes read              :             0              
> 0             0             0             0             0
> [P 08/25 15:39] bytes written           :             0              
> 0             0             0             0             0
> [P 08/25 15:39] metadata reads          :             0              
> 0             0             0             0             0
> [P 08/25 15:39] metadata writes         :             0              
> 0             0             0             0             0
> [P 08/25 15:39] metadata dspace ops     :             0              
> 0             0             0             0             0
> [P 08/25 15:39] metadata keyval ops     :             2              
> 2             2             2             2             2
> [P 08/25 15:39] request scheduler       :             0              
> 0             0             0             0             0
> [D 08/25 15:39] [SM Exiting]: (0xc8f140) perf_update_sm:do_work  
> (error code: 0), (action: DEFERRED)
> [D 08/25 15:39] [SM Entering]: (0xc904b0) job_timer_sm:do_work  
> (status: 0)
> [D 08/25 15:39] [SM Exiting]: (0xc904b0) job_timer_sm:do_work (error  
> code: 0), (action: DEFERRED)
> [D 08/25 15:39] bmi_mx: CONN_REQ from mx://begbie:0:0.
> [D 08/25 15:39] bmi_mx: bmx_unexpected_recv rx match=  
> 0xc000000100000100 length= 16.
> [D 08/25 15:39] bmi_mx: bmx_handle_conn_req returned RX match  
> 0xc000000100000100 with Success.
> [E 08/25 15:39] PVFS2 server: signal 11, faulty address is (nil),  
> from 0x475818
> [E 08/25 15:39] [bt] pvfs2-server [0x475818]
> [E 08/25 15:39] [bt] pvfs2-server [0x475818]
> [E 08/25 15:39] [bt] pvfs2-server [0x476102]
> [E 08/25 15:39] [bt] pvfs2-server(BMI_testunexpected+0x392) [0x4549b2]
> [E 08/25 15:39] [bt] pvfs2-server [0x44d5c0]
> [E 08/25 15:39] [bt] /lib/libpthread.so.0 [0x7fe7ed8a6a04]
> [E 08/25 15:39] [bt] /lib/libc.so.6(clone+0x6d) [0x7fe7ed1e1d4d]
> Segmentation fault




More information about the Pvfs2-users mailing list