[Pvfs2-developers] help debugging request processor/distribution

Walter B. Ligon III walt at clemson.edu
Mon Jun 12 13:37:54 EDT 2006


Phil, that last log is definitely munged - and yeah it does look like 
two overlapping logs.

It's probably worth looking at Sam's suggestion first, as that has been 
at the center of a few similar bugs.  If that's not it I'm ready to help 
out debugging the request code if you can get a clean log.

As an asside, someone might want to think about an upgrade to the 
logging to deal with multiple threads.

Walt

Phil Carns wrote:
> Hi all,
> 
> I am looking at an I/O problem that I don't completely understand.  The 
> setup is that there are 15 servers and 20 clients (all RHEL3 SMP).  The 
> clients are running a proprietary application.  At the end of the run 
> they each write their share of a data set into a 36 GB file.  So each is 
> doing a contiguous write of 36/20 GB of data.
> 
> Roughly 80-90% of the time, one or more of the clients and servers hang, 
> though it isn't predictable which one will do it.  After quite a bit of 
> digging through logs, it turns out that there is a flow that never 
> completes on the server side.  There are two problems contributing here:
> 
> 1) The flow timeout mechanism doesn't seem to be working right for me 
> (the timer for the flow pops and checks progress one time, then never 
> gets triggered again).  This isn't the core problem, but it is causing 
> the flow to get stuck forever and hold up the request scheduler rather 
> than giving up eventually.
> 
> 2) For this particular flow, the client and server appear to disagree on 
> how much data to exchange.  The client sends one 256K buffer and 
> completes the flow.  The server receives that 256K buffer but still 
> expects at least one more to arrive (which never does).
> 
> I am pretty sure I can get to the bottom of the first problem.  The 
> second one is trickier.
> 
> I attached a few log file excerpts:
> 
> client-1st.txt: This is the request processing log output from the 
> client side when it is working on the flow in question
> 
> server-1st.txt: request processing log output from the first iteration 
> of the server side flow
> 
> server-2nd.txt: request processing log output from the second iteration 
> of the server side flow
> 
> Anyone have any suggestions, or see any indication from those log 
> portions about why the server and client arn't on the same page?
> 
> Some of this log output (the server-2nd.txt in particular?) looks a 
> little jumbled, like maybe there are two flows doing request processing 
> at once, but I don't really know how to interpret it.
> 
> I may try adding a mutex around PINT_process_request() next, just to see 
> if I can make the logs clearer by making sure only one invocation is 
> going at once.
> 
> -Phil
> 
> 
> ------------------------------------------------------------------------
> 
> [D 06/09 15:49] =========================================================
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49]         tiling 5484 copies
> [D 06/09 15:49]         skipping ahead to target_offset
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0 
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 0
> [D 06/09 15:49]                 to 0 ta 22993719939 fi 22997914243
> [D 06/09 15:49]                 mto 0 mta 0 mfi 4194304
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 0 contig_size = 23001563136 lvl_flag =
> 1
> [D 06/09 15:49]         process logical skip
> [D 06/09 15:49]         retval = 22993719939 
> [D 06/09 15:49]         exiting logical skip because distribute indicates done
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0 
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 22993719939
> [D 06/09 15:49]                 to 22993719939 ta 22993719939 fi 22997914243
> [D 06/09 15:49]                 mto 0 mta 0 mfi 4194304
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 22993719939 contig_size = 7843197
> lvl_flag = 1
> [D 06/09 15:49]         calling distribute
> [D 06/09 15:49]         PINT_distribute
> [D 06/09 15:49]                 of 22993719939 sz 4194304 ix 0 sm 64 by 0 bm
> 262144 fsz 0 exfl 1
> [D 06/09 15:49]                 begin iteration loff: 22994157568
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                 file being extended
> [D 06/09 15:49]                 client lstof 0 diff 437629 sgof 437629
> [D 06/09 15:49] **********CALL***PROCESS*********
> [D 06/09 15:49]                 segment of 437629 sz 65536
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49]         skipping ahead to target_offset
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 0
> [D 06/09 15:49]                 to 0 ta 437629 fi 503165
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 0 contig_size = 4194304 lvl_flag = 1
> [D 06/09 15:49]         process logical skip
> [D 06/09 15:49]         retval = 437629
> [D 06/09 15:49]         exiting logical skip because distribute indicates done
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 437629
> [D 06/09 15:49]                 to 437629 ta 437629 fi 503165
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 437629 contig_size = 3756675 lvl_flag
> = 1
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 437629 sz 65536
> [D 06/09 15:49]         add a segment 0
> [D 06/09 15:49]         retval = 65536
> [D 06/09 15:49]         exiting distribute returned less than expected
> [D 06/09 15:49]         done sg 1 sm 64 by 65536 bm 262144 ta 437629 to 503165
> fo 503165 eof 0
> [D 06/09 15:49] *****RETURN***FROM***PROCESS*****
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 begin iteration loff: 22995140608
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                 file being extended
> [D 06/09 15:49]                 client lstof 503165 diff 917504 sgof 1420669
> [D 06/09 15:49] **********CALL***PROCESS*********
> [D 06/09 15:49]                 segment of 1420669 sz 65536
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49]         skipping ahead to target_offset
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 503165
> [D 06/09 15:49]                 to 503165 ta 1420669 fi 1486205
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 503165 contig_size = 3691139 lvl_flag
> = 1
> [D 06/09 15:49]         process logical skip
> [D 06/09 15:49]         retval = 917504
> [D 06/09 15:49]         exiting logical skip because distribute indicates done
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 1420669
> [D 06/09 15:49]                 to 1420669 ta 1420669 fi 1486205
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 1420669 contig_size = 2773635 lvl_flag
> = 1
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 1420669 sz 65536
> [D 06/09 15:49]         add a segment 1
> [D 06/09 15:49]         retval = 65536
> [D 06/09 15:49]         exiting distribute returned less than expected
> [D 06/09 15:49]         done sg 2 sm 64 by 131072 bm 262144 ta 1420669 to
> 1486205 fo 1486205 eof 0
> [D 06/09 15:49] *****RETURN***FROM***PROCESS*****
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 begin iteration loff: 22996123648
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                 file being extended
> [D 06/09 15:49]                 client lstof 1486205 diff 917504 sgof 2403709
> [D 06/09 15:49] **********CALL***PROCESS*********
> [D 06/09 15:49]                 segment of 2403709 sz 65536
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49]         skipping ahead to target_offset
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 1486205
> [D 06/09 15:49]                 to 1486205 ta 2403709 fi 2469245
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 1486205 contig_size = 2708099 lvl_flag
> = 1
> [D 06/09 15:49]         process logical skip
> [D 06/09 15:49]         retval = 917504
> [D 06/09 15:49]         exiting logical skip because distribute indicates done
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 2403709
> [D 06/09 15:49]                 to 2403709 ta 2403709 fi 2469245
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 2403709 contig_size = 1790595 lvl_flag
> = 1
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 2403709 sz 65536
> [D 06/09 15:49]         add a segment 2
> [D 06/09 15:49]         retval = 65536
> [D 06/09 15:49]         exiting distribute returned less than expected
> [D 06/09 15:49]         done sg 3 sm 64 by 196608 bm 262144 ta 2403709 to
> 2469245 fo 2469245 eof 0
> [D 06/09 15:49] *****RETURN***FROM***PROCESS*****
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 begin iteration loff: 22997106688
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                 file being extended
> [D 06/09 15:49]                 client lstof 2469245 diff 917504 sgof 3386749
> [D 06/09 15:49] **********CALL***PROCESS*********
> [D 06/09 15:49]                 segment of 3386749 sz 65536
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49]         skipping ahead to target_offset
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 2469245
> [D 06/09 15:49]                 to 2469245 ta 3386749 fi 3452285
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 2469245 contig_size = 1725059 lvl_flag
> = 1
> [D 06/09 15:49]         process logical skip
> [D 06/09 15:49]         retval = 917504
> [D 06/09 15:49]         exiting logical skip because distribute indicates done
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 3386749
> [D 06/09 15:49]                 to 3386749 ta 3386749 fi 3452285
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 3386749 contig_size = 807555 lvl_flag
> = 1
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 3386749 sz 65536
> [D 06/09 15:49]         add a segment 3
> [D 06/09 15:49]         retval = 65536
> [D 06/09 15:49]         exiting distribute returned less than expected
> [D 06/09 15:49]         done sg 4 sm 64 by 262144 bm 262144 ta 3386749 to
> 3452285 fo 3452285 eof 0
> [D 06/09 15:49] *****RETURN***FROM***PROCESS*****
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 done with segments or bytes
> [D 06/09 15:49]                         of 22997172224 sz 742019 sg 4 sm 64 by
> 262144 bm 262144
> [D 06/09 15:49]                         next loff: 22998089728 next poff:
> 1533149184
> [D 06/09 15:49]                         (return value) 4194304
> [D 06/09 15:49]                 finished
> [D 06/09 15:49]         retval = 4194304
> [D 06/09 15:49]         exiting distribute returned less than expected
> [D 06/09 15:49]         done sg 4 sm 64 by 262144 bm 262144 ta 22993719939 to
> 22997914243 fo 22997914243 eof 0
> 
> 
> ------------------------------------------------------------------------
> 
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49]         tiling 9156 copies
> [D 06/09 15:49]         skipping ahead to target_offset
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 0
> [D 06/09 15:49]                 to 0 ta 38396548819 fi 38400743123
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 0 contig_size = 38403047424 lvl_flag =
> 1
> [D 06/09 15:49]         process logical skip
> [D 06/09 15:49]         retval = 38396548819
> [D 06/09 15:49]         exiting logical skip because distribute indicates done
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 38396548819
> [D 06/09 15:49]                 to 38396548819 ta 38396548819 fi 38400743123
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 38396548819 contig_size = 6498605
> lvl_flag = 1
> [D 06/09 15:49]         calling distribute
> [D 06/09 15:49]         PINT_distribute
> [D 06/09 15:49]                 of 38396548819 sz 4194304 ix 0 sm 64 by 0 bm
> 262144 fsz 2559770624 exfl 1
> [D 06/09 15:49]                 begin iteration loff: 38396690432
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                 file being extended
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 2559770624 sz 65536
> [D 06/09 15:49]         add a segment 0
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 begin iteration loff: 38397673472
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                 file being extended
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 2559836160 sz 65536
> [D 06/09 15:49]         combine a segment 0
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 begin iteration loff: 38398656512
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                 file being extended
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 2559901696 sz 65536
> [D 06/09 15:49]         combine a segment 0
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 begin iteration loff: 38399639552
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                 file being extended
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 2559967232 sz 65536
> [D 06/09 15:49]         combine a segment 0
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 done with segments or bytes
> [D 06/09 15:49]                         of 38399705088 sz 1038035 sg 1 sm 64
> by 262144 bm 262144
> [D 06/09 15:49]                         next loff: 38400622592 next poff:
> 2560032768
> [D 06/09 15:49]                         return value 3156269
> [D 06/09 15:49]                 finished
> [D 06/09 15:49]         retval = 3156269
> [D 06/09 15:49]         exiting distribute returned less than expected
> [D 06/09 15:49]         done sg 1 sm 64 by 262144 bm 262144 ta 38396548819 to
> 38399705088 fo 38400743123 eof 0
> 
> 
> ------------------------------------------------------------------------
> 
> [D 06/09 15:49] =========================================================
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49]         tiling 8489 copies
> [D 06/09 15:49]         skipping ahead to target_offset
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 0
> [D 06/09 15:49]                 to 0 ta 35599341710 fi 35603536014
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 0 contig_size = 35605446656 lvl_flag =
> 1
> [D 06/09 15:49]         process logical skip
> [D 06/09 15:49]         retval = 35599341710
> [D 06/09 15:49]         exiting logical skip because distribute indicates done
> [D 06/09 15:49] =========================================================
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 35599341710
> [D 06/09 15:49]         tiling 9156 copies
> [D 06/09 15:49]                 to 35599341710 ta 35599341710 fi 35603536014
> [D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]                 lvl 0 el 0 blk 0 by 38399705088
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]         contig_offset = 35599341710 contig_size = 6104946
> lvl_flag = 1
> [D 06/09 15:49]                 to 38399705088 ta 38396548819 fi 38400743123
> [D 06/09 15:49]         calling distribute
> [D 06/09 15:49]         PINT_distribute
> [D 06/09 15:49]         basic type or contiguous data
> [D 06/09 15:49]                 of 35599341710 sz 4194304 ix 0 sm 64 by 0 bm
> 262144 fsz 2559770624 exfl 1
> [D 06/09 15:49]         Request disp
> [D 06/09 15:49]                 begin iteration loff: 35599941632
> [D 06/09 15:49]         contig_offset = 38399705088 contig_size = 3342336
> lvl_flag = 1
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]         calling distribute
> [D 06/09 15:49]                         of 2373320704 sz 65536
> [D 06/09 15:49]         add a segment 0
> [D 06/09 15:49]         PINT_distribute
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 begin iteration loff: 35600924672
> [D 06/09 15:49]                 of 38399705088 sz 1038035 ix 0 sm 64 by 0 bm
> 262144 fsz 2560032768 exfl 1
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 2373386240 sz 65536
> [D 06/09 15:49]         combine a segment 0
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 begin iteration loff: 38400622592
> [D 06/09 15:49]                 begin iteration loff: 35601907712
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 2373451776 sz 65536
> [D 06/09 15:49]                 file being extended
> [D 06/09 15:49]         combine a segment 0
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                 begin iteration loff: 35602890752
> [D 06/09 15:49]                 frag extends beyond strip
> [D 06/09 15:49]                         of 2560032768 sz 65536
> [D 06/09 15:49]         process a segment
> [D 06/09 15:49]                         of 2373517312 sz 65536
> [D 06/09 15:49]         add a segment 0
> [D 06/09 15:49]         combine a segment 0
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 end iteration
> [D 06/09 15:49]                 done with segments or bytes
> [D 06/09 15:49]                         of 35602956288 sz 579726 sg 1 sm 64 by
> 262144 bm 262144
> [D 06/09 15:49]                         of 38400688128 sz 54995 sg 1 sm 64 by
> 65536 bm 262144
> [D 06/09 15:49]                         next loff: 35603873792 next poff:
> 2373582848
> [D 06/09 15:49]                         (return value) 4194304
> [D 06/09 15:49]                         next loff: 38401605632 next poff:
> 2560098304
> [D 06/09 15:49]                 finished
> [D 06/09 15:49]                         (return value) 1038035
> [D 06/09 15:49]         retval = 4194304
> [D 06/09 15:49]         exiting distribute returned less than expected
> [D 06/09 15:49]                 finished
> [D 06/09 15:49]         done sg 1 sm 64 by 262144 bm 262144 ta 35599341710 to
> 35603536014 fo 35603536014 eof 0
> [D 06/09 15:49]         retval = 1038035
> [D 06/09 15:49] =========================================================
> [D 06/09 15:49]         exiting distribute returned less than expected
> [D 06/09 15:49]         done sg 1 sm 64 by 65536 bm 262144 ta 38396548819 to
> 38400743123 fo 38400743123 eof 0
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

-- 
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University


More information about the Pvfs2-developers mailing list