[Pvfs2-users] pvfs2-induced system crashes
Sam Lang
slang at mcs.anl.gov
Thu Apr 2 11:05:25 EST 2009
Hi Jim,
Kyle is right, errors of this magnitude in PVFS should show up in the
log file. Before we get into enabling debugging output for you
though, it would be nice to know if there are error messages being
thrown from PVFS.
PVFS has two components in the client, a kernel module that integrates
with the kernel VFS layer, and a userspace daemon that runs as root.
The default locations for the logs for these two components are
different. The kernel module writes error messages to syslog. The
PVFS daemon writes error messages to the log file at /tmp/pvfs2-
client.log. When you start to see the instabilities you mentioned, do
you see anything from running dmesg, or does anything show up in the /
tmp/pvfs2-client.log file?
Also, if you monitor the pvfs2-client-core process (using ps or top),
does the memory of that process grow over time?
Thanks,
-sam
On Apr 1, 2009, at 8:09 PM, Kyle Schochenmaier wrote:
> Jim -
>
> Anytime a server process hangs, or has any kind of of connection
> issues, it will cause client side connectivity issues.
> Is it possible that your head node rebooted itself because of the
> high load?
>
> If one of your server processes had a timeout, it should be reflected
> in the log file.
> Lets turn on debugging for the client and see what happens.
> Also it might be beneficial to just restart all of the pvfs2-server
> process if you have a minute or two of downtime, if there was an issue
> with IP conflicts, I have no idea how that may affect the filesystem
> as a whole, I'll leave that to everyone else to comment on. -- This
> may be an interesting conversation, if you'd like to describe your
> experiences with it.
>
> So lets get client-side logging (network level? if there are timeouts,
> which should occur if there are connectivity issues, this should be
> reflected in logs)
> /tmp/pvfs2-client.log
>
> An additional option may be to think about upgrading to 2.8.1.
> Hopefully we can establish a cause before having to do that. But it
> may be worthwhile at some point.
>
> ~Kyle
>
>
> Kyle Schochenmaier
>
>
>
> On Wed, Apr 1, 2009 at 1:56 PM, Jim Kusznir <jkusznir at gmail.com>
> wrote:
>> 1. It appears that I didn't look at the right place this time...I've
>> had that problem in the past, but didn't throughly investigate it,
>> and
>> when one takes buffers/cache into account, I have over 6GB available.
>> This was reported for the cluster headnode only, which is a pvfs2
>> client (no pvfs2-server-related processes).
>> 2. Workload on the head node is user access to pvfs2 data. Sometimes
>> its users scp'ing data into or out of pvfs2; sometimes its them
>> working with it directly (viewing, tar/untar, etc). None of this
>> should be "high performance", and there should not be any direct
>> application utilization of data here. Cluster nodes are a different
>> story....
>> 3. I don't have any cron jobs of my own. I doubt any users would,
>> but
>> if so, their jobs would just be transferring data anyway. The high
>> load average is actually from processes that are hung trying to do
>> I/O
>> on a file in pvfs2. The CPU is basically idle. At times, I've seen
>> the loadaverage as high as 40+.
>> 4) my pvfs2 server logs have not been modified since March 2nd.
>>
>> I do know that at the beginning of the week, one of the pvfs2-server
>> nodes was having intermentent connectivity (IP conflict) issues.
>> However, these problems have started after that was corrected, and
>> the
>> week prior, the head node rebooted spontaneously at least 2 times, I
>> think it was more like 4 (I was on vacation).
>>
>> On Wed, Apr 1, 2009 at 11:38 AM, Kyle Schochenmaier <kschoche at gmail.com
>> > wrote:
>>> Jim -
>>>
>>> A couple things to start with.
>>>
>>> 1. wrt 'climbing memory usage' - is this on the client (head)
>>> node,
>>> or the pvfs2-servers (data servers)?
>>> 2. Is your workload basically some users scp'ing data to a
>>> pvfs2-mounted location on the client/head node?
>>> 3. Is it possible that you have some errant cron jobs or 'bad'
>>> scripts running that are eating up the cpu that are not related to
>>> pvfs?
>>> 4. Are there any timeouts on the pvfs2-server logs?
>>> It would make sense that users files would be inaccessible if one or
>>> more of the data servers is having connectivity issues or other
>>> issues.
>>>
>>> Can you send your pvfs2 config file as well log files for the client
>>> process and the servers (if there is anything there)
>>>
>>>
>>> ~Kyle
>>>
>>> Kyle Schochenmaier
>>>
>>>
>>>
>>> On Wed, Apr 1, 2009 at 1:27 PM, Jim Kusznir <jkusznir at gmail.com>
>>> wrote:
>>>> Hi all:
>>>>
>>>> I'm (once again) experiencing system instability that appears to be
>>>> traceable to pvfs2. Symptoms usually show up when one or more
>>>> users
>>>> start long SCP sessions for transferring 5+GB of data lasting
>>>> several
>>>> hours. I believe they usually have 1-3 sessions running in
>>>> parallel.
>>>> Symptoms include:
>>>>
>>>> * High load averages (and climbing slowly with additional use)
>>>> without
>>>> supporting CPU load. The ONLY way to recover from this is
>>>> reboot. My
>>>> load average is currently 7.10 with 99.8% idle CPU.
>>>> * hung SCP and other I/O processes
>>>> * large amounts of RAM "missing" (Currently free -m reports
>>>> 7552MB in
>>>> use; adding up usage from all processes comes to about 1GB.
>>>> * Often (always?) some users' files become unaccessible (although
>>>> users have stopped reporting those problems as its happened so
>>>> frequently).
>>>>
>>>> If I let this go a bit longer, there's a reasonable chance that the
>>>> machine will just spontaneously reboot. There's nothing logged
>>>> as to
>>>> the cause. No OOM or other errors...Just one minute everything's
>>>> fine; the next its booting up.
>>>>
>>>> Sometimes it will take a long time for these problems to build up
>>>> (for
>>>> example, right now the system load and memory issues are here
>>>> with a
>>>> couple days of "building"); sometimes the system will spontaneously
>>>> reboot several times in one day (with no notice of climbing loads
>>>> or
>>>> the like).
>>>>
>>>> These problems so far have only happened on the head node (pvfs
>>>> client); our compute nodes have not shown this problem.
>>>>
>>>> System configuration:
>>>> Rocks 5.1 with manual pvfs setup (NOT using rocks-supplied PVFS
>>>> binaries or configurations)
>>>> pvfs 2.7.1 + patches from pcarns
>>>> 3 CentOS 5 dedicated PVFS servers (each with ~10TB storage, Dell
>>>> PERC
>>>> 6/e + MD1000's)
>>>> PVFS servers are running over bonded dual-gig connections using
>>>> linux
>>>> kernel ethernet bonding driver
>>>> Clients are single-gig connected.
>>>> no off-site pvfs2 access (scp/ssh/sftp access only, via the head
>>>> node)
>>>>
>>>> Any suggestions?
>>>> I'm getting fairly desperate for help, as pvfs2 has been the main
>>>> destabilizing factor for the cluster since it went online, and
>>>> causing
>>>> spontaneous reboots is not a good thing....
>>>>
>>>> Thanks!
>>>> --Jim
>>>> _______________________________________________
>>>> Pvfs2-users mailing list
>>>> Pvfs2-users at beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>
>>
>
> _______________________________________________
> Pvfs2-users mailing list
> Pvfs2-users at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
More information about the Pvfs2-users
mailing list