[PVFS2-developers] does anyone know what this is? (fwd)

Rob Ross rross at mcs.anl.gov
Thu Jun 23 03:06:02 EDT 2005


This sounds great to me.  I also can't believe that I (we?) never 
knew/saw this before.

Rob

Walter B. Ligon III wrote:
> --------
> 
> well, I'm just a genius at finding obscure bugs!  ;-)
> 
> The macro is what I was figuring on.  or something like that.
> OK, I'll start looking at that.
> 
> Anyone else with insights?
> 
> Walt
> 
> ps - Nathan, thanks for looking that up while I was at lunch!  :-)
> 
> 
>>Wow, that's amazing that this was never caught before in sockio.  The 
>>PVFS1 code path is the same here.  I wonder if we lucked out in the 
>>past, maybe with glibcs that happened to set errno as well even though 
>>it isn't required.  At any rate, I would think the error code could just 
>>be converted on the spot with a macro (bmi_h_errno_to_pvfs(), to go 
>>along with bmi_errno_to_pvfs()).  The man page only lists a handful of 
>>possible h_errno values to look for.
>>
>>-Phil
>>
>>
>>Walter B. Ligon III wrote:
>>
>>>-------- 
>>>
>>>OK, I believe the problem is that gethostbyname returns its error in
>>>h_errno rather than errno.  At least according to my man pages.  I
>>>will have to see if h_errno codes are non-overlapping with errno codes
>>>but I am assuming they do (otherwise why do it that way).  So the
>>>question is how to pass that error back out.  I can certainly use
>>>gossip to log an error on the spot.  Otherwise I can probably map
>>>h_errno codes to some unique codes and decode them later in gossip.
>>>
>>>Any other advice and/or suggestions on handling this?  Otherwise I'll
>>>just deal with it.
>>>
>>>Walt
>>>
>>>
>>>
>>>>If you've got time to knock it out, that would be great.  RobL and Sam 
>>>>are working on the test system and I'm dealing with a 7-hour time 
>>>>change, 90 degree days with no AC in the hotel, and a bunch of random 
>>>>deadlines...
>>>>
>>>>Rob
>>>>
>>>>Walter B. Ligon III wrote:
>>>>
>>>>
>>>>>--------
>>>>>Crap!  I cut/paste the pvfs2tab line from the quickstart and *thought*
>>>>>I had edited it.  But I didn't - still said "testhost"
>>>>>
>>>>>But we REALLY should catch this error and produce a meaningful error.
>>>>>Like "No route to host testhost" or "Host testhost refuses connection"
>>>>>or something.
>>>>>
>>>>>Can someone familiar with that code look at that, or do I need to?
>>>>>
>>>>>Walt
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Whoops, just realized that the earlier message in this thread was on the 
>>>>>>wrong list, moving over now.
>>>>>>
>>>>>>I'm not really sure looking at the code how you could get a zero errno 
>>>>>>value out of a failure in that path.  You may need to just gdb break on 
>>>>>>BMI_sockio_connect_sock() when you run pvfs2-ping and see if you can 
>>>>>>tell whats failing or why the errno value isn't being set.
>>>>>>
>>>>>>I guess its possible that BMI_sockio_connect_sock() isn't even being 
>>>>>>called at all (see blocks of code just before your bmi-tcp.c:1676 
>>>>>>error), but that shouldn't be the case in this client side code path 
>>>>>>unless something has jumbled memory and cleared the hostname out of the 
>>>>>>bmi address structure, or if the hostname was broken somehow to begin wit
>>
>>h.
>>
>>>>>>Anything strange in your pvfs2tab or fstab files?  Maybe the hostname is 
>>>>>>empty or something?
>>>>>>
>>>>>>Actually I just tried that- if I list the server as 
>>>>>>tcp://:3334/pvfs2-fs instead of tcp://localhost:3334/pvfs2-fs, then I 
>>>>>>see the same message you do.  The bmi address parser should probably 
>>>>>>check for that condition and stop things before it gets that far.
>>>>>>
>>>>>>-Phil
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>"bt" stands for back trace.  If you have back tracing enabled, then any 
>>>>>>>gossip_lerr() call not only prints the line number the message occurred 
>>>>>>>on, but also the stack trace.  The numbers off to the right are 
>>>>>>>addresses that you can convert to code locations with the "addr2line" 
>>>>>>>utility.
>>>>>>>
>>>>>>>The patch that I mentioned in response to Brad's email earlier happens 
>>>>>>>to also convert this gossip_lerr() call to a gossip_err() call; I don't 
>>>>>>>think that network/socket failures should result in a backtrace and line
>>
>> 
>>
>>>>>>>number print- its pretty confusing as you have discovered :)
>>>>>>>
>>>>>>>As far as why it is failing in the first place, I don't have any clue at
>>
>> 
>>
>>>>>>>the moment...
>>>>>>>
>>>>>>>-Phil
>>>>>>>
>>>>>>>Walter B. Ligon III wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>I've built the latest CVS build - none of my changes, installed it and
>>>>>>>>run pvfs2-ping, and I get this:
>>>>>>>>
>>>>>>>>[walt at sidious pvfs]> bin/pvfs2-ping -m /mnt/pvfs2
>>>>>>>>
>>>>>>>>(1) Parsing tab file...
>>>>>>>>
>>>>>>>>(2) Initializing system interface...
>>>>>>>>
>>>>>>>>(3) Initializing each file system found in tab file: /etc/pvfs2tab...
>>>>>>>>
>>>>>>>>[16:02:45.639438] src/io/bmi/bmi_tcp/bmi-tcp.c line 1676: Error: 
>>>>>>>>BMI_sockio_connect_sock: Success
>>>>>>>>[16:02:45.639742]       [bt] bin/pvfs2-ping [0x8086d91]
>>>>>>>>[16:02:45.639763]       [bt] bin/pvfs2-ping [0x8088923]
>>>>>>>>[16:02:45.639772]       [bt] 
>>>>>>>>bin/pvfs2-ping(BMI_tcp_post_sendunexpected_list+0x
>>>>>>>>a6) [0x808664e]
>>>>>>>>[16:02:45.639781]       [bt] 
>>>>>>>>bin/pvfs2-ping(BMI_post_sendunexpected_list+0x166)
>>>>>>>>[0x8073a2a]
>>>>>>>>[16:02:45.639790]       [bt] bin/pvfs2-ping(job_bmi_send_list+0x21b) 
>>>>>>>>[0x8078f07][16:02:45.639800]       [bt] bin/pvfs2-ping [0x807041f]
>>>>>>>>[16:02:45.639864]       [bt] bin/pvfs2-ping(vfprintf+0x3c9f) [0x8053973
>>
>>]
>>
>>>>>>>>[16:02:45.639875]       [bt] 
>>>>>>>>bin/pvfs2-ping(PINT_client_state_machine_post+0x1c
>>>>>>>>d) [0x8052bb9]
>>>>>>>>[16:02:45.639886]       [bt] 
>>>>>>>>bin/pvfs2-ping(PINT_server_get_config+0x12f) [0x8063ad7]
>>>>>>>>[16:02:45.639896]       [bt] bin/pvfs2-ping(PVFS_sys_fs_add+0xc6) 
>>>>>>>>[0x8053e02]
>>>>>>>>[16:02:45.639904]       [bt] bin/pvfs2-ping(main+0xdd) [0x805025d]
>>>>>>>>Broken pipe
>>>>>>>>[walt at sidious pvfs]>
>>>>>>>>
>>>>>>>>Very similar error messages, only this time they have some function nam
>>
>>es
>>
>>>>>>>>imbedded.  I've never seen this kind of message out of PVFS before, doe
>>
>>s
>>
>>>>>>>>no one recognize these "bt" messages?
>>>>>>>>
>>>>>>>>Walt
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>I have a branch of the code I was working on at ANL, installed it 
>>>>>>>>>down here
>>>>>>>>>and it is doing very different things.  In particular the client spit 
>>>>>>>>>out
>>>>>>>>>this error which I'm having a hard time understanding:
>>>>>>>>>
>>>>>>>>>[walt at sidious pvfs]> bin/create.set.get.eattr /foo key1 value1
>>>>>>>>>[15:04:42.072103] src/io/bmi/bmi_tcp/bmi-tcp.c line 1676: Error: 
>>>>>>>>>BMI_sockio_connect_sock: Success
>>>>>>>>>[15:04:42.072264]       [bt] bin/create.set.get.eattr [0x8081a99]
>>>>>>>>>[15:04:42.072278]       [bt] bin/create.set.get.eattr [0x808362b]
>>>>>>>>>[15:04:42.072291]       [bt] bin/create.set.get.eattr [0x8081356]
>>>>>>>>>[15:04:42.072304]       [bt] bin/create.set.get.eattr [0x806d53a]
>>>>>>>>>[15:04:42.072316]       [bt] bin/create.set.get.eattr [0x80729b3]
>>>>>>>>>[15:04:42.072329]       [bt] bin/create.set.get.eattr [0x806a733]
>>>>>>>>>[15:04:42.072341]       [bt] 
>>>>>>>>>bin/create.set.get.eattr(vfprintf+0x366f) [0x804cc5b]
>>>>>>>>>[15:04:42.072377]       [bt] 
>>>>>>>>>bin/create.set.get.eattr(vfprintf+0x289d) [0x804be89]
>>>>>>>>>[15:04:42.072389]       [bt] bin/create.set.get.eattr [0x805ddeb]
>>>>>>>>>[15:04:42.072402]       [bt] bin/create.set.get.eattr [0x807c7aa]
>>>>>>>>>[15:04:42.072414]       [bt] bin/create.set.get.eattr [0x8068166]
>>>>>>>>>Broken pipe
>>>>>>>>>[walt at sidious pvfs]>
>>>>>>>>>As you see it threw and error in BMI.  Ran that down and found where
>>>>>>>>>the BMI function that connects the socket returned <0 but the strerror
>>>>>>>>>translation is "Success" which doesn't make much sense to me.
>>>>>>>>>
>>>>>>>>>Then all of these lines starting [bt] followed by the command line
>>>>>>>>>string of the client program, and and unknown hex value.  I have no id
>>
>>ea
>>
>>>>>>>>>where that is comming from.
>>>>>>>>>
>>>>>>>>>Have I misconfigured something?  I thought I configured just like I di
>>
>>d
>>
>>>>>>>>>the last time on this RedHat EL box.  Anyone recognize this?  It may 
>>>>>>>>>have
>>>>>>>>>something to do with my code, but my code should not have come into pl
>>
>>ay
>>
>>>>>>>>>yet, unless I did something terribly wrong.
>>>>>>>>>
>>>>>>>>>Walt
>>>>>>>>>
>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>PVFS-developers mailing list
>>>>>>>PVFS-developers at www.beowulf-underground.org
>>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
>>>>>>
>>>>>>_______________________________________________
>>>>>>PVFS2-developers mailing list
>>>>>>PVFS2-developers at beowulf-underground.org
>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>>>>
>>>>>
>>>>>
>>_______________________________________________
>>PVFS2-developers mailing list
>>PVFS2-developers at beowulf-underground.org
>>http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
> 
> 


More information about the PVFS2-developers mailing list