[Pvfs2-users] kzalloc undefined
Sam Lang
slang at mcs.anl.gov
Thu Feb 22 20:55:59 EST 2007
On Feb 21, 2007, at 2:49 PM, Trach-Minh Tran wrote:
> On 02/21/2007 06:42 PM, Sam Lang wrote:
>>
>> On Feb 21, 2007, at 11:28 AM, Trach-Minh Tran wrote:
>>
>>> On 02/21/2007 06:10 PM, Sam Lang wrote:
>>>>
>>>> On Feb 21, 2007, at 10:49 AM, Trach-Minh Tran wrote:
>>>>
>>>>> On 02/21/2007 05:18 PM, Sam Lang wrote:
>>>>>>
>>>>>> Hi Minh,
>>>>>>
>>>>>> I got the order of my AC_TRY_COMPILE arguments wrong. That
>>>>>> was pretty
>>>>>> sloppy on my part. I've attached a patch that should fix the
>>>>>> error
>>>>>> you're getting. I'm not sure it will apply cleanly to the
>>>>>> already
>>>>>> patched 2.6.2 source that you have. Better to start with a clean
>>>>>> 2.6.2
>>>>>> tarball.
>>>>>
>>>>> Hi Sam,
>>>>>
>>>>> Thanks for you prompt response. I can now load the module. I
>>>>> will do
>>>>> some more tests with this 2.6.2 version. Until now, I've found
>>>>> using my MPI-IO program, that this is not as stable as the 2.6.1
>>>>> version:
>>>>> During about 1/2 hour running the test, already 2 data servers
>>>>> (out
>>>>> of 8) have died!
>>>>
>>>> That's surprising, the 2.6.2 release didn't include any changes
>>>> to the
>>>> servers from 2.6.1. Did you get any messages in the server logs
>>>> on the
>>>> nodes that died?
>>>>
>>>>>
>>>>> Do you think that I should stay with 2.6.1 + the misc-bug.patch
>>>>> from
>>>>> Murali?
>>>>
>>>> There aren't any other significant fixes in 2.6.2 besides
>>>> support for
>>>> the latest Berkeley DB release, and the misc-bug patch that you
>>>> mention,
>>>> so using 2.6.1 shouldn't be a problem for you. That being said,
>>>> if the
>>>> servers crash for you on 2.6.2, its likely that they will do so
>>>> with
>>>> 2.6.1 and you just haven't hit it yet. I'd also like to figure out
>>>> exactly what is causing the servers to crash. Can you send your
>>>> MPI-IO
>>>> program to us?
>>>>
>>>
>>> Hi Sam,
>>>
>>> There is nothing in the server logs! May be tomorrow (it is now
>>> 6:30 pm
>>> here) I will have more infos from the mpi-io runs I've just
>>> submitted.
>>
>> Rob thinks this might be related to the ROMIO ad_pvfs bug reported a
>> couple days ago, but the even so, corruption on the client shouldn't
>> cause the server's to segfault (esp. if the corruption is outside the
>> PVFS system interfaces). If possible, it would be great to get a
>> stack
>> trace from one of the crashed servers.
>
> Hi Sam,
>
> How can I get the stack strace ph the pvfs2 server when it dies?
> I have run another series of tests with the mpi-io program for another
> hour but the none of the servers died! I can add that when one of
> the servers
> previously died, I've got the following messages from my mpi program
> while nothing appears in the pvfs2_server.log file:
Hi Minh,
To get the stack trace you need to configure pvfs with --enable-segv-
backtrace. This will cause the segfault to print a stack trace to
the log where the segfault occurs.
-sam
>
> =====================================
> [E 17:26:51.714686] msgpair failed, will retry: Broken pipe
> [E 17:26:51.736877] handle_io_error: flow proto error cleanup
> started on 0x6fd870, error_code: -1073741973
> [E 17:26:51.737091] handle_io_error: flow proto 0x6fd870 canceled 0
> operations, will clean up.
> [E 17:26:51.737108] handle_io_error: flow proto 0x6fd870 error
> cleanup finished, error_code: -1073741973
> [E 17:26:53.734663] msgpair failed, will retry: Connection refused
> [E 17:26:55.754647] msgpair failed, will retry: Connection refused
> [E 17:26:57.774636] msgpair failed, will retry: Connection refused
> [E 17:26:59.794622] msgpair failed, will retry: Connection refused
> [E 17:27:01.814610] msgpair failed, will retry: Connection refused
> [E 17:27:01.814651] *** msgpairarray_completion_fn: msgpair to
> server tcp://io4:3334 failed: Connection refused
> [E 17:27:01.814666] *** Out of retries.
> =====================================
>
> -Minh.
>
More information about the Pvfs2-users
mailing list