[Pvfs2-developers] help debugging request processor/distribution
Walter B. Ligon III
walt at clemson.edu
Mon Jun 12 13:37:54 EDT 2006
Phil, that last log is definitely munged - and yeah it does look like
two overlapping logs.
It's probably worth looking at Sam's suggestion first, as that has been
at the center of a few similar bugs. If that's not it I'm ready to help
out debugging the request code if you can get a clean log.
As an asside, someone might want to think about an upgrade to the
logging to deal with multiple threads.
Walt
Phil Carns wrote:
> Hi all,
>
> I am looking at an I/O problem that I don't completely understand. The
> setup is that there are 15 servers and 20 clients (all RHEL3 SMP). The
> clients are running a proprietary application. At the end of the run
> they each write their share of a data set into a 36 GB file. So each is
> doing a contiguous write of 36/20 GB of data.
>
> Roughly 80-90% of the time, one or more of the clients and servers hang,
> though it isn't predictable which one will do it. After quite a bit of
> digging through logs, it turns out that there is a flow that never
> completes on the server side. There are two problems contributing here:
>
> 1) The flow timeout mechanism doesn't seem to be working right for me
> (the timer for the flow pops and checks progress one time, then never
> gets triggered again). This isn't the core problem, but it is causing
> the flow to get stuck forever and hold up the request scheduler rather
> than giving up eventually.
>
> 2) For this particular flow, the client and server appear to disagree on
> how much data to exchange. The client sends one 256K buffer and
> completes the flow. The server receives that 256K buffer but still
> expects at least one more to arrive (which never does).
>
> I am pretty sure I can get to the bottom of the first problem. The
> second one is trickier.
>
> I attached a few log file excerpts:
>
> client-1st.txt: This is the request processing log output from the
> client side when it is working on the flow in question
>
> server-1st.txt: request processing log output from the first iteration
> of the server side flow
>
> server-2nd.txt: request processing log output from the second iteration
> of the server side flow
>
> Anyone have any suggestions, or see any indication from those log
> portions about why the server and client arn't on the same page?
>
> Some of this log output (the server-2nd.txt in particular?) looks a
> little jumbled, like maybe there are two flows doing request processing
> at once, but I don't really know how to interpret it.
>
> I may try adding a mutex around PINT_process_request() next, just to see
> if I can make the logs clearer by making sure only one invocation is
> going at once.
>
> -Phil
>
>
> ------------------------------------------------------------------------
>
> [D 06/09 15:49] =========================================================
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49] tiling 5484 copies
> [D 06/09 15:49] skipping ahead to target_offset
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 0
> [D 06/09 15:49] to 0 ta 22993719939 fi 22997914243
> [D 06/09 15:49] mto 0 mta 0 mfi 4194304
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 0 contig_size = 23001563136 lvl_flag =
> 1
> [D 06/09 15:49] process logical skip
> [D 06/09 15:49] retval = 22993719939
> [D 06/09 15:49] exiting logical skip because distribute indicates done
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 22993719939
> [D 06/09 15:49] to 22993719939 ta 22993719939 fi 22997914243
> [D 06/09 15:49] mto 0 mta 0 mfi 4194304
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 22993719939 contig_size = 7843197
> lvl_flag = 1
> [D 06/09 15:49] calling distribute
> [D 06/09 15:49] PINT_distribute
> [D 06/09 15:49] of 22993719939 sz 4194304 ix 0 sm 64 by 0 bm
> 262144 fsz 0 exfl 1
> [D 06/09 15:49] begin iteration loff: 22994157568
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] file being extended
> [D 06/09 15:49] client lstof 0 diff 437629 sgof 437629
> [D 06/09 15:49] **********CALL***PROCESS*********
> [D 06/09 15:49] segment of 437629 sz 65536
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49] skipping ahead to target_offset
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 0
> [D 06/09 15:49] to 0 ta 437629 fi 503165
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 0 contig_size = 4194304 lvl_flag = 1
> [D 06/09 15:49] process logical skip
> [D 06/09 15:49] retval = 437629
> [D 06/09 15:49] exiting logical skip because distribute indicates done
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 437629
> [D 06/09 15:49] to 437629 ta 437629 fi 503165
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 437629 contig_size = 3756675 lvl_flag
> = 1
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 437629 sz 65536
> [D 06/09 15:49] add a segment 0
> [D 06/09 15:49] retval = 65536
> [D 06/09 15:49] exiting distribute returned less than expected
> [D 06/09 15:49] done sg 1 sm 64 by 65536 bm 262144 ta 437629 to 503165
> fo 503165 eof 0
> [D 06/09 15:49] *****RETURN***FROM***PROCESS*****
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] begin iteration loff: 22995140608
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] file being extended
> [D 06/09 15:49] client lstof 503165 diff 917504 sgof 1420669
> [D 06/09 15:49] **********CALL***PROCESS*********
> [D 06/09 15:49] segment of 1420669 sz 65536
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49] skipping ahead to target_offset
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 503165
> [D 06/09 15:49] to 503165 ta 1420669 fi 1486205
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 503165 contig_size = 3691139 lvl_flag
> = 1
> [D 06/09 15:49] process logical skip
> [D 06/09 15:49] retval = 917504
> [D 06/09 15:49] exiting logical skip because distribute indicates done
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 1420669
> [D 06/09 15:49] to 1420669 ta 1420669 fi 1486205
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 1420669 contig_size = 2773635 lvl_flag
> = 1
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 1420669 sz 65536
> [D 06/09 15:49] add a segment 1
> [D 06/09 15:49] retval = 65536
> [D 06/09 15:49] exiting distribute returned less than expected
> [D 06/09 15:49] done sg 2 sm 64 by 131072 bm 262144 ta 1420669 to
> 1486205 fo 1486205 eof 0
> [D 06/09 15:49] *****RETURN***FROM***PROCESS*****
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] begin iteration loff: 22996123648
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] file being extended
> [D 06/09 15:49] client lstof 1486205 diff 917504 sgof 2403709
> [D 06/09 15:49] **********CALL***PROCESS*********
> [D 06/09 15:49] segment of 2403709 sz 65536
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49] skipping ahead to target_offset
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 1486205
> [D 06/09 15:49] to 1486205 ta 2403709 fi 2469245
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 1486205 contig_size = 2708099 lvl_flag
> = 1
> [D 06/09 15:49] process logical skip
> [D 06/09 15:49] retval = 917504
> [D 06/09 15:49] exiting logical skip because distribute indicates done
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 2403709
> [D 06/09 15:49] to 2403709 ta 2403709 fi 2469245
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 2403709 contig_size = 1790595 lvl_flag
> = 1
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 2403709 sz 65536
> [D 06/09 15:49] add a segment 2
> [D 06/09 15:49] retval = 65536
> [D 06/09 15:49] exiting distribute returned less than expected
> [D 06/09 15:49] done sg 3 sm 64 by 196608 bm 262144 ta 2403709 to
> 2469245 fo 2469245 eof 0
> [D 06/09 15:49] *****RETURN***FROM***PROCESS*****
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] begin iteration loff: 22997106688
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] file being extended
> [D 06/09 15:49] client lstof 2469245 diff 917504 sgof 3386749
> [D 06/09 15:49] **********CALL***PROCESS*********
> [D 06/09 15:49] segment of 3386749 sz 65536
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49] skipping ahead to target_offset
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 2469245
> [D 06/09 15:49] to 2469245 ta 3386749 fi 3452285
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 2469245 contig_size = 1725059 lvl_flag
> = 1
> [D 06/09 15:49] process logical skip
> [D 06/09 15:49] retval = 917504
> [D 06/09 15:49] exiting logical skip because distribute indicates done
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 3386749
> [D 06/09 15:49] to 3386749 ta 3386749 fi 3452285
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 3386749 contig_size = 807555 lvl_flag
> = 1
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 3386749 sz 65536
> [D 06/09 15:49] add a segment 3
> [D 06/09 15:49] retval = 65536
> [D 06/09 15:49] exiting distribute returned less than expected
> [D 06/09 15:49] done sg 4 sm 64 by 262144 bm 262144 ta 3386749 to
> 3452285 fo 3452285 eof 0
> [D 06/09 15:49] *****RETURN***FROM***PROCESS*****
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] done with segments or bytes
> [D 06/09 15:49] of 22997172224 sz 742019 sg 4 sm 64 by
> 262144 bm 262144
> [D 06/09 15:49] next loff: 22998089728 next poff:
> 1533149184
> [D 06/09 15:49] (return value) 4194304
> [D 06/09 15:49] finished
> [D 06/09 15:49] retval = 4194304
> [D 06/09 15:49] exiting distribute returned less than expected
> [D 06/09 15:49] done sg 4 sm 64 by 262144 bm 262144 ta 22993719939 to
> 22997914243 fo 22997914243 eof 0
>
>
> ------------------------------------------------------------------------
>
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49] tiling 9156 copies
> [D 06/09 15:49] skipping ahead to target_offset
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 0
> [D 06/09 15:49] to 0 ta 38396548819 fi 38400743123
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 0 contig_size = 38403047424 lvl_flag =
> 1
> [D 06/09 15:49] process logical skip
> [D 06/09 15:49] retval = 38396548819
> [D 06/09 15:49] exiting logical skip because distribute indicates done
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 38396548819
> [D 06/09 15:49] to 38396548819 ta 38396548819 fi 38400743123
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 38396548819 contig_size = 6498605
> lvl_flag = 1
> [D 06/09 15:49] calling distribute
> [D 06/09 15:49] PINT_distribute
> [D 06/09 15:49] of 38396548819 sz 4194304 ix 0 sm 64 by 0 bm
> 262144 fsz 2559770624 exfl 1
> [D 06/09 15:49] begin iteration loff: 38396690432
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] file being extended
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 2559770624 sz 65536
> [D 06/09 15:49] add a segment 0
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] begin iteration loff: 38397673472
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] file being extended
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 2559836160 sz 65536
> [D 06/09 15:49] combine a segment 0
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] begin iteration loff: 38398656512
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] file being extended
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 2559901696 sz 65536
> [D 06/09 15:49] combine a segment 0
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] begin iteration loff: 38399639552
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] file being extended
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 2559967232 sz 65536
> [D 06/09 15:49] combine a segment 0
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] done with segments or bytes
> [D 06/09 15:49] of 38399705088 sz 1038035 sg 1 sm 64
> by 262144 bm 262144
> [D 06/09 15:49] next loff: 38400622592 next poff:
> 2560032768
> [D 06/09 15:49] return value 3156269
> [D 06/09 15:49] finished
> [D 06/09 15:49] retval = 3156269
> [D 06/09 15:49] exiting distribute returned less than expected
> [D 06/09 15:49] done sg 1 sm 64 by 262144 bm 262144 ta 38396548819 to
> 38399705088 fo 38400743123 eof 0
>
>
> ------------------------------------------------------------------------
>
> [D 06/09 15:49] =========================================================
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49] tiling 8489 copies
> [D 06/09 15:49] skipping ahead to target_offset
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 0
> [D 06/09 15:49] to 0 ta 35599341710 fi 35603536014
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 0 contig_size = 35605446656 lvl_flag =
> 1
> [D 06/09 15:49] process logical skip
> [D 06/09 15:49] retval = 35599341710
> [D 06/09 15:49] exiting logical skip because distribute indicates done
> [D 06/09 15:49] =========================================================
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] PINT_process_request
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 35599341710
> [D 06/09 15:49] tiling 9156 copies
> [D 06/09 15:49] to 35599341710 ta 35599341710 fi 35603536014
> [D 06/09 15:49] Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
> 4194304 co 0
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] lvl 0 el 0 blk 0 by 38399705088
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] contig_offset = 35599341710 contig_size = 6104946
> lvl_flag = 1
> [D 06/09 15:49] to 38399705088 ta 38396548819 fi 38400743123
> [D 06/09 15:49] calling distribute
> [D 06/09 15:49] PINT_distribute
> [D 06/09 15:49] basic type or contiguous data
> [D 06/09 15:49] of 35599341710 sz 4194304 ix 0 sm 64 by 0 bm
> 262144 fsz 2559770624 exfl 1
> [D 06/09 15:49] Request disp
> [D 06/09 15:49] begin iteration loff: 35599941632
> [D 06/09 15:49] contig_offset = 38399705088 contig_size = 3342336
> lvl_flag = 1
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] calling distribute
> [D 06/09 15:49] of 2373320704 sz 65536
> [D 06/09 15:49] add a segment 0
> [D 06/09 15:49] PINT_distribute
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] begin iteration loff: 35600924672
> [D 06/09 15:49] of 38399705088 sz 1038035 ix 0 sm 64 by 0 bm
> 262144 fsz 2560032768 exfl 1
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 2373386240 sz 65536
> [D 06/09 15:49] combine a segment 0
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] begin iteration loff: 38400622592
> [D 06/09 15:49] begin iteration loff: 35601907712
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 2373451776 sz 65536
> [D 06/09 15:49] file being extended
> [D 06/09 15:49] combine a segment 0
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] begin iteration loff: 35602890752
> [D 06/09 15:49] frag extends beyond strip
> [D 06/09 15:49] of 2560032768 sz 65536
> [D 06/09 15:49] process a segment
> [D 06/09 15:49] of 2373517312 sz 65536
> [D 06/09 15:49] add a segment 0
> [D 06/09 15:49] combine a segment 0
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] end iteration
> [D 06/09 15:49] done with segments or bytes
> [D 06/09 15:49] of 35602956288 sz 579726 sg 1 sm 64 by
> 262144 bm 262144
> [D 06/09 15:49] of 38400688128 sz 54995 sg 1 sm 64 by
> 65536 bm 262144
> [D 06/09 15:49] next loff: 35603873792 next poff:
> 2373582848
> [D 06/09 15:49] (return value) 4194304
> [D 06/09 15:49] next loff: 38401605632 next poff:
> 2560098304
> [D 06/09 15:49] finished
> [D 06/09 15:49] (return value) 1038035
> [D 06/09 15:49] retval = 4194304
> [D 06/09 15:49] exiting distribute returned less than expected
> [D 06/09 15:49] finished
> [D 06/09 15:49] done sg 1 sm 64 by 262144 bm 262144 ta 35599341710 to
> 35603536014 fo 35603536014 eof 0
> [D 06/09 15:49] retval = 1038035
> [D 06/09 15:49] =========================================================
> [D 06/09 15:49] exiting distribute returned less than expected
> [D 06/09 15:49] done sg 1 sm 64 by 65536 bm 262144 ta 38396548819 to
> 38400743123 fo 38400743123 eof 0
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
--
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University
More information about the Pvfs2-developers
mailing list