[Pvfs2-developers] MD server crash, again
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Fri Sep 8 11:34:15 EDT 2006
I've pulled the latest code from the cvs head, and rebuilt it,
unfortunately the MD server still crashes hard, and now it seems that
the servers are eating 100% cpu (though I didnt check this prior to
updating, I suppose that could have been useful)
Here's a log of the IB version.. I'm almost positive this is an IB
specific problem at this point, as nobody else is having these problems
that I know of.
---- client ----
p5l6:~# pvfs2-cp -t /tmp/junkfile /pvfs2/6node/
Wrote 2147483648 bytes in 2.695799 seconds. 759.700592 MB/seconds
p5l6:~# pvfs2-cp -t /pvfs2/6node/junkfile /dev/null
< ctrl-c after a minute >
< all the data servers are now off in runaway land, metadata server is
dead now, as verified below >
p5l6:~# pvfs2-ls
[E 10:18:53.867913] Warning: ib_tcp_client_connect: connect to server
da6:3336: Connection refused.
[E 10:18:53.868039] Receive immediately failed: Connection refused
[E 10:18:53.868098] msgpair failed, will retry: Connection refused
[E 10:18:55.870880] Warning: ib_tcp_client_connect: connect to server
da6:3336: Connection refused.
[E 10:18:55.870926] Receive immediately failed: Connection refused
[E 10:18:55.870995] msgpair failed, will retry: Connection refused
[E 10:18:57.873345] Warning: ib_tcp_client_connect: connect to server
da6:3336: Connection refused.
[E 10:18:57.873383] Receive immediately failed: Connection refused
[E 10:18:57.873444] msgpair failed, will retry: Connection refused
---- MD server log, times are not at all sync'd with above ----
D 10:15:17.161630] PVFS2 Server version 1.5.1pre1-2006-09-07-182738
starting.
[E 10:23:20.739431] Job time out: cancelling flow operation, job_id: 4370.
[E 10:23:20.739511] Flow proto cancel called on 0x63f640
[E 10:23:20.739526] Flow proto error cleanup started on 0x63f640,
error_code: -1
610612737
[E 10:23:20.739628] Flow proto 0x63f640 canceling a total of 7 BMI or
Trove oper
ations
-----
and with my current level (lack) of debugging, none of the data servers
show anything, but are running away at 100% cpu, their logs show nothing
other than the startup line.
-----
Pete, which level of debugging would be best to get a good log? trove
or network?
Thanks,
-- Kyle
--
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
More information about the Pvfs2-developers
mailing list