[Pvfs2-developers] MD server crash, again

Kyle Schochenmaier kschoche at scl.ameslab.gov
Fri Sep 8 11:34:15 EDT 2006


I've pulled the latest code from the cvs head, and rebuilt it,
unfortunately the MD server still crashes hard, and now it seems that 
the servers are eating 100% cpu (though I didnt check this prior to 
updating, I suppose that could have been useful)

Here's a log of the IB version.. I'm almost positive this is an IB 
specific problem at this point, as nobody else is having these problems 
that I know of.

---- client ----
p5l6:~# pvfs2-cp -t /tmp/junkfile /pvfs2/6node/
Wrote 2147483648 bytes in 2.695799 seconds. 759.700592 MB/seconds
p5l6:~# pvfs2-cp -t /pvfs2/6node/junkfile /dev/null


< ctrl-c after a minute >
< all the data servers are now off in runaway land, metadata server is 
dead now, as verified below >

p5l6:~# pvfs2-ls
[E 10:18:53.867913] Warning: ib_tcp_client_connect: connect to server 
da6:3336: Connection refused.
[E 10:18:53.868039] Receive immediately failed: Connection refused
[E 10:18:53.868098] msgpair failed, will retry: Connection refused
[E 10:18:55.870880] Warning: ib_tcp_client_connect: connect to server 
da6:3336: Connection refused.
[E 10:18:55.870926] Receive immediately failed: Connection refused
[E 10:18:55.870995] msgpair failed, will retry: Connection refused
[E 10:18:57.873345] Warning: ib_tcp_client_connect: connect to server 
da6:3336: Connection refused.
[E 10:18:57.873383] Receive immediately failed: Connection refused
[E 10:18:57.873444] msgpair failed, will retry: Connection refused


---- MD server log, times are not at all sync'd with above ----

D 10:15:17.161630] PVFS2 Server version 1.5.1pre1-2006-09-07-182738 
starting.
[E 10:23:20.739431] Job time out: cancelling flow operation, job_id: 4370.
[E 10:23:20.739511] Flow proto cancel called on 0x63f640
[E 10:23:20.739526] Flow proto error cleanup started on 0x63f640, 
error_code: -1
610612737
[E 10:23:20.739628] Flow proto 0x63f640 canceling a total of 7 BMI or 
Trove oper
ations

-----
and with my current level (lack) of debugging, none of the data servers 
show anything, but are running away at 100% cpu, their logs show nothing 
other than the startup line.
-----

Pete, which level of debugging would be best to get a good log?  trove 
or network?


Thanks,
    -- Kyle



-- 
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 



More information about the Pvfs2-developers mailing list