[PVFS2-developers] CVS pvfs2-server server daemons crashing

Robert Latham robl at mcs.anl.gov
Wed Jun 1 19:00:35 EDT 2005


On Wed, May 04, 2005 at 01:52:56PM -0400, Walter B. Ligon III wrote:
> Something is very wrong here.  To tell the truth, I'm not exactly sure
> why line 161 is in the code, because we should never get to that point.

Yes, i know it's taken me a month to follow up on this thread.  I
honestly didn't know we had an opteron machine floating around I
could use :>

Anyway, Walt you're absolutely right.  If you build pvfs2 w/o
optimization, the pvfs2-server segfault occurs in a slightly different
location:

#0  0x0000000000427337 in PINT_Process_request (req=0x613ed0, mem=0x0, 
    rfdata=0x5e2540, result=0x5fd9d8, mode=17)
    at ../pvfs2-source/src/io/description/pint-request.c:215
#1  0x00000000004233b3 in trove_write_callback_fn (user_ptr=0x5fd9c8, 
    error_code=0)
    at ../pvfs2-source/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:1271
#2  0x0000000000421c67 in fp_multiqueue_post (flow_d=0x5e24d0)
    at ../pvfs2-source/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:619
#3  0x000000000041f8a9 in PINT_flow_post (flow_d=0x5e24d0)
    at ../pvfs2-source/src/io/flow/flow.c:405
#4  0x00000000004196c1 in job_flow (flow_d=0x5e24d0, user_ptr=0x5eb8a0, 
    status_user_tag=0, out_status_p=0x5a4df0, id=0x7fbffff058, context_id=0, 
    timeout_sec=30) at ../pvfs2-source/src/io/job/job.c:1177
#5  0x000000000044f1e4 in io_start_flow (s_op=0x5eb8a0, js_p=0x5a4df0)
    at io.sm:241
#6  0x000000000040ea8d in PINT_state_machine_next (s=0x5eb8a0, r=0x5a4df0)
    at state-machine-fns.h:140
#7  0x000000000040c1b2 in main (argc=4, argv=0x7fbffff1a8)
    at ../pvfs2-source/src/server/pvfs2-server.c:360


(This seems like the perfect time to bust out some valgrind, but
unfortunatly the port to amd64 does not understand yet how to deal
with the pwrite64 syscall.)

A couple odd things turn up in GDB:

(gdb) p req->cur[req->lvl].rq->ereq
$3 = (struct PINT_Request *) 0xe70006437f8
(gdb) p * $3
Cannot access memory at address 0xe70006437f8
(gdb) p req->cur[req->lvl].rq->ereq->num_contig_chunks
Cannot access memory at address 0xe7000643828


linking with efence didn't help narrow down the problem any.  

(gdb) p *req
$2 = {cur = 0x2a9c0f1fa0, lvl = 0, bytes = 0, type_offset = 0, 
  target_offset = 4194304, final_offset = 8388608, eof_flag = 0 '\0'}


I turned on 'request' logging.  The last thing reported before the
segfault was this:

PINT_New_request_state
PINT_Process_request
      tiling 3 copies
      skipping ahead to target_offset
      Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as 4194304 co 0
              lvl 0 el 0 blk 0 by 0
              to 0 ta 4194304 fi 8388608

Any of this make any more sense than last time? 

==rob

-- 
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Labs, IL USA                B29D F333 664A 4280 315B


More information about the PVFS2-developers mailing list