[PVFS2-developers] does anyone know what this is? (fwd)
Rob Ross
rross at mcs.anl.gov
Thu Jun 23 03:06:02 EDT 2005
This sounds great to me. I also can't believe that I (we?) never
knew/saw this before.
Rob
Walter B. Ligon III wrote:
> --------
>
> well, I'm just a genius at finding obscure bugs! ;-)
>
> The macro is what I was figuring on. or something like that.
> OK, I'll start looking at that.
>
> Anyone else with insights?
>
> Walt
>
> ps - Nathan, thanks for looking that up while I was at lunch! :-)
>
>
>>Wow, that's amazing that this was never caught before in sockio. The
>>PVFS1 code path is the same here. I wonder if we lucked out in the
>>past, maybe with glibcs that happened to set errno as well even though
>>it isn't required. At any rate, I would think the error code could just
>>be converted on the spot with a macro (bmi_h_errno_to_pvfs(), to go
>>along with bmi_errno_to_pvfs()). The man page only lists a handful of
>>possible h_errno values to look for.
>>
>>-Phil
>>
>>
>>Walter B. Ligon III wrote:
>>
>>>--------
>>>
>>>OK, I believe the problem is that gethostbyname returns its error in
>>>h_errno rather than errno. At least according to my man pages. I
>>>will have to see if h_errno codes are non-overlapping with errno codes
>>>but I am assuming they do (otherwise why do it that way). So the
>>>question is how to pass that error back out. I can certainly use
>>>gossip to log an error on the spot. Otherwise I can probably map
>>>h_errno codes to some unique codes and decode them later in gossip.
>>>
>>>Any other advice and/or suggestions on handling this? Otherwise I'll
>>>just deal with it.
>>>
>>>Walt
>>>
>>>
>>>
>>>>If you've got time to knock it out, that would be great. RobL and Sam
>>>>are working on the test system and I'm dealing with a 7-hour time
>>>>change, 90 degree days with no AC in the hotel, and a bunch of random
>>>>deadlines...
>>>>
>>>>Rob
>>>>
>>>>Walter B. Ligon III wrote:
>>>>
>>>>
>>>>>--------
>>>>>Crap! I cut/paste the pvfs2tab line from the quickstart and *thought*
>>>>>I had edited it. But I didn't - still said "testhost"
>>>>>
>>>>>But we REALLY should catch this error and produce a meaningful error.
>>>>>Like "No route to host testhost" or "Host testhost refuses connection"
>>>>>or something.
>>>>>
>>>>>Can someone familiar with that code look at that, or do I need to?
>>>>>
>>>>>Walt
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Whoops, just realized that the earlier message in this thread was on the
>>>>>>wrong list, moving over now.
>>>>>>
>>>>>>I'm not really sure looking at the code how you could get a zero errno
>>>>>>value out of a failure in that path. You may need to just gdb break on
>>>>>>BMI_sockio_connect_sock() when you run pvfs2-ping and see if you can
>>>>>>tell whats failing or why the errno value isn't being set.
>>>>>>
>>>>>>I guess its possible that BMI_sockio_connect_sock() isn't even being
>>>>>>called at all (see blocks of code just before your bmi-tcp.c:1676
>>>>>>error), but that shouldn't be the case in this client side code path
>>>>>>unless something has jumbled memory and cleared the hostname out of the
>>>>>>bmi address structure, or if the hostname was broken somehow to begin wit
>>
>>h.
>>
>>>>>>Anything strange in your pvfs2tab or fstab files? Maybe the hostname is
>>>>>>empty or something?
>>>>>>
>>>>>>Actually I just tried that- if I list the server as
>>>>>>tcp://:3334/pvfs2-fs instead of tcp://localhost:3334/pvfs2-fs, then I
>>>>>>see the same message you do. The bmi address parser should probably
>>>>>>check for that condition and stop things before it gets that far.
>>>>>>
>>>>>>-Phil
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>"bt" stands for back trace. If you have back tracing enabled, then any
>>>>>>>gossip_lerr() call not only prints the line number the message occurred
>>>>>>>on, but also the stack trace. The numbers off to the right are
>>>>>>>addresses that you can convert to code locations with the "addr2line"
>>>>>>>utility.
>>>>>>>
>>>>>>>The patch that I mentioned in response to Brad's email earlier happens
>>>>>>>to also convert this gossip_lerr() call to a gossip_err() call; I don't
>>>>>>>think that network/socket failures should result in a backtrace and line
>>
>>
>>
>>>>>>>number print- its pretty confusing as you have discovered :)
>>>>>>>
>>>>>>>As far as why it is failing in the first place, I don't have any clue at
>>
>>
>>
>>>>>>>the moment...
>>>>>>>
>>>>>>>-Phil
>>>>>>>
>>>>>>>Walter B. Ligon III wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>I've built the latest CVS build - none of my changes, installed it and
>>>>>>>>run pvfs2-ping, and I get this:
>>>>>>>>
>>>>>>>>[walt at sidious pvfs]> bin/pvfs2-ping -m /mnt/pvfs2
>>>>>>>>
>>>>>>>>(1) Parsing tab file...
>>>>>>>>
>>>>>>>>(2) Initializing system interface...
>>>>>>>>
>>>>>>>>(3) Initializing each file system found in tab file: /etc/pvfs2tab...
>>>>>>>>
>>>>>>>>[16:02:45.639438] src/io/bmi/bmi_tcp/bmi-tcp.c line 1676: Error:
>>>>>>>>BMI_sockio_connect_sock: Success
>>>>>>>>[16:02:45.639742] [bt] bin/pvfs2-ping [0x8086d91]
>>>>>>>>[16:02:45.639763] [bt] bin/pvfs2-ping [0x8088923]
>>>>>>>>[16:02:45.639772] [bt]
>>>>>>>>bin/pvfs2-ping(BMI_tcp_post_sendunexpected_list+0x
>>>>>>>>a6) [0x808664e]
>>>>>>>>[16:02:45.639781] [bt]
>>>>>>>>bin/pvfs2-ping(BMI_post_sendunexpected_list+0x166)
>>>>>>>>[0x8073a2a]
>>>>>>>>[16:02:45.639790] [bt] bin/pvfs2-ping(job_bmi_send_list+0x21b)
>>>>>>>>[0x8078f07][16:02:45.639800] [bt] bin/pvfs2-ping [0x807041f]
>>>>>>>>[16:02:45.639864] [bt] bin/pvfs2-ping(vfprintf+0x3c9f) [0x8053973
>>
>>]
>>
>>>>>>>>[16:02:45.639875] [bt]
>>>>>>>>bin/pvfs2-ping(PINT_client_state_machine_post+0x1c
>>>>>>>>d) [0x8052bb9]
>>>>>>>>[16:02:45.639886] [bt]
>>>>>>>>bin/pvfs2-ping(PINT_server_get_config+0x12f) [0x8063ad7]
>>>>>>>>[16:02:45.639896] [bt] bin/pvfs2-ping(PVFS_sys_fs_add+0xc6)
>>>>>>>>[0x8053e02]
>>>>>>>>[16:02:45.639904] [bt] bin/pvfs2-ping(main+0xdd) [0x805025d]
>>>>>>>>Broken pipe
>>>>>>>>[walt at sidious pvfs]>
>>>>>>>>
>>>>>>>>Very similar error messages, only this time they have some function nam
>>
>>es
>>
>>>>>>>>imbedded. I've never seen this kind of message out of PVFS before, doe
>>
>>s
>>
>>>>>>>>no one recognize these "bt" messages?
>>>>>>>>
>>>>>>>>Walt
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>I have a branch of the code I was working on at ANL, installed it
>>>>>>>>>down here
>>>>>>>>>and it is doing very different things. In particular the client spit
>>>>>>>>>out
>>>>>>>>>this error which I'm having a hard time understanding:
>>>>>>>>>
>>>>>>>>>[walt at sidious pvfs]> bin/create.set.get.eattr /foo key1 value1
>>>>>>>>>[15:04:42.072103] src/io/bmi/bmi_tcp/bmi-tcp.c line 1676: Error:
>>>>>>>>>BMI_sockio_connect_sock: Success
>>>>>>>>>[15:04:42.072264] [bt] bin/create.set.get.eattr [0x8081a99]
>>>>>>>>>[15:04:42.072278] [bt] bin/create.set.get.eattr [0x808362b]
>>>>>>>>>[15:04:42.072291] [bt] bin/create.set.get.eattr [0x8081356]
>>>>>>>>>[15:04:42.072304] [bt] bin/create.set.get.eattr [0x806d53a]
>>>>>>>>>[15:04:42.072316] [bt] bin/create.set.get.eattr [0x80729b3]
>>>>>>>>>[15:04:42.072329] [bt] bin/create.set.get.eattr [0x806a733]
>>>>>>>>>[15:04:42.072341] [bt]
>>>>>>>>>bin/create.set.get.eattr(vfprintf+0x366f) [0x804cc5b]
>>>>>>>>>[15:04:42.072377] [bt]
>>>>>>>>>bin/create.set.get.eattr(vfprintf+0x289d) [0x804be89]
>>>>>>>>>[15:04:42.072389] [bt] bin/create.set.get.eattr [0x805ddeb]
>>>>>>>>>[15:04:42.072402] [bt] bin/create.set.get.eattr [0x807c7aa]
>>>>>>>>>[15:04:42.072414] [bt] bin/create.set.get.eattr [0x8068166]
>>>>>>>>>Broken pipe
>>>>>>>>>[walt at sidious pvfs]>
>>>>>>>>>As you see it threw and error in BMI. Ran that down and found where
>>>>>>>>>the BMI function that connects the socket returned <0 but the strerror
>>>>>>>>>translation is "Success" which doesn't make much sense to me.
>>>>>>>>>
>>>>>>>>>Then all of these lines starting [bt] followed by the command line
>>>>>>>>>string of the client program, and and unknown hex value. I have no id
>>
>>ea
>>
>>>>>>>>>where that is comming from.
>>>>>>>>>
>>>>>>>>>Have I misconfigured something? I thought I configured just like I di
>>
>>d
>>
>>>>>>>>>the last time on this RedHat EL box. Anyone recognize this? It may
>>>>>>>>>have
>>>>>>>>>something to do with my code, but my code should not have come into pl
>>
>>ay
>>
>>>>>>>>>yet, unless I did something terribly wrong.
>>>>>>>>>
>>>>>>>>>Walt
>>>>>>>>>
>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>PVFS-developers mailing list
>>>>>>>PVFS-developers at www.beowulf-underground.org
>>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
>>>>>>
>>>>>>_______________________________________________
>>>>>>PVFS2-developers mailing list
>>>>>>PVFS2-developers at beowulf-underground.org
>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>>>>
>>>>>
>>>>>
>>_______________________________________________
>>PVFS2-developers mailing list
>>PVFS2-developers at beowulf-underground.org
>>http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>
>
More information about the PVFS2-developers
mailing list