[PVFS2-developers] does anyone know what this is? (fwd)

Walter B. Ligon III walt at CLEMSON.EDU
Wed Jun 22 14:20:26 EDT 2005


--------

well, I'm just a genius at finding obscure bugs!  ;-)

The macro is what I was figuring on.  or something like that.
OK, I'll start looking at that.

Anyone else with insights?

Walt

ps - Nathan, thanks for looking that up while I was at lunch!  :-)

> Wow, that's amazing that this was never caught before in sockio.  The 
> PVFS1 code path is the same here.  I wonder if we lucked out in the 
> past, maybe with glibcs that happened to set errno as well even though 
> it isn't required.  At any rate, I would think the error code could just 
> be converted on the spot with a macro (bmi_h_errno_to_pvfs(), to go 
> along with bmi_errno_to_pvfs()).  The man page only lists a handful of 
> possible h_errno values to look for.
> 
> -Phil
> 
> 
> Walter B. Ligon III wrote:
> > -------- 
> > 
> > OK, I believe the problem is that gethostbyname returns its error in
> > h_errno rather than errno.  At least according to my man pages.  I
> > will have to see if h_errno codes are non-overlapping with errno codes
> > but I am assuming they do (otherwise why do it that way).  So the
> > question is how to pass that error back out.  I can certainly use
> > gossip to log an error on the spot.  Otherwise I can probably map
> > h_errno codes to some unique codes and decode them later in gossip.
> > 
> > Any other advice and/or suggestions on handling this?  Otherwise I'll
> > just deal with it.
> > 
> > Walt
> > 
> > 
> >>If you've got time to knock it out, that would be great.  RobL and Sam 
> >>are working on the test system and I'm dealing with a 7-hour time 
> >>change, 90 degree days with no AC in the hotel, and a bunch of random 
> >>deadlines...
> >>
> >>Rob
> >>
> >>Walter B. Ligon III wrote:
> >>
> >>>--------
> >>>Crap!  I cut/paste the pvfs2tab line from the quickstart and *thought*
> >>>I had edited it.  But I didn't - still said "testhost"
> >>>
> >>>But we REALLY should catch this error and produce a meaningful error.
> >>>Like "No route to host testhost" or "Host testhost refuses connection"
> >>>or something.
> >>>
> >>>Can someone familiar with that code look at that, or do I need to?
> >>>
> >>>Walt
> >>>
> >>>
> >>>
> >>>>Whoops, just realized that the earlier message in this thread was on the 
> >>>>wrong list, moving over now.
> >>>>
> >>>>I'm not really sure looking at the code how you could get a zero errno 
> >>>>value out of a failure in that path.  You may need to just gdb break on 
> >>>>BMI_sockio_connect_sock() when you run pvfs2-ping and see if you can 
> >>>>tell whats failing or why the errno value isn't being set.
> >>>>
> >>>>I guess its possible that BMI_sockio_connect_sock() isn't even being 
> >>>>called at all (see blocks of code just before your bmi-tcp.c:1676 
> >>>>error), but that shouldn't be the case in this client side code path 
> >>>>unless something has jumbled memory and cleared the hostname out of the 
> >>>>bmi address structure, or if the hostname was broken somehow to begin wit
> h.
> >>>>
> >>>>Anything strange in your pvfs2tab or fstab files?  Maybe the hostname is 
> >>>> empty or something?
> >>>>
> >>>>Actually I just tried that- if I list the server as 
> >>>>tcp://:3334/pvfs2-fs instead of tcp://localhost:3334/pvfs2-fs, then I 
> >>>>see the same message you do.  The bmi address parser should probably 
> >>>>check for that condition and stop things before it gets that far.
> >>>>
> >>>>-Phil
> >>>>
> >>>>
> >>>>
> >>>>>"bt" stands for back trace.  If you have back tracing enabled, then any 
> >>>>>gossip_lerr() call not only prints the line number the message occurred 
> >>>>>on, but also the stack trace.  The numbers off to the right are 
> >>>>>addresses that you can convert to code locations with the "addr2line" 
> >>>>>utility.
> >>>>>
> >>>>>The patch that I mentioned in response to Brad's email earlier happens 
> >>>>>to also convert this gossip_lerr() call to a gossip_err() call; I don't 
> >>>>>think that network/socket failures should result in a backtrace and line
>  
> >>>>>number print- its pretty confusing as you have discovered :)
> >>>>>
> >>>>>As far as why it is failing in the first place, I don't have any clue at
>  
> >>>>>the moment...
> >>>>>
> >>>>>-Phil
> >>>>>
> >>>>>Walter B. Ligon III wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>>I've built the latest CVS build - none of my changes, installed it and
> >>>>>>run pvfs2-ping, and I get this:
> >>>>>>
> >>>>>>[walt at sidious pvfs]> bin/pvfs2-ping -m /mnt/pvfs2
> >>>>>>
> >>>>>>(1) Parsing tab file...
> >>>>>>
> >>>>>>(2) Initializing system interface...
> >>>>>>
> >>>>>>(3) Initializing each file system found in tab file: /etc/pvfs2tab...
> >>>>>>
> >>>>>>[16:02:45.639438] src/io/bmi/bmi_tcp/bmi-tcp.c line 1676: Error: 
> >>>>>>BMI_sockio_connect_sock: Success
> >>>>>>[16:02:45.639742]       [bt] bin/pvfs2-ping [0x8086d91]
> >>>>>>[16:02:45.639763]       [bt] bin/pvfs2-ping [0x8088923]
> >>>>>>[16:02:45.639772]       [bt] 
> >>>>>>bin/pvfs2-ping(BMI_tcp_post_sendunexpected_list+0x
> >>>>>>a6) [0x808664e]
> >>>>>>[16:02:45.639781]       [bt] 
> >>>>>>bin/pvfs2-ping(BMI_post_sendunexpected_list+0x166)
> >>>>>>[0x8073a2a]
> >>>>>>[16:02:45.639790]       [bt] bin/pvfs2-ping(job_bmi_send_list+0x21b) 
> >>>>>>[0x8078f07][16:02:45.639800]       [bt] bin/pvfs2-ping [0x807041f]
> >>>>>>[16:02:45.639864]       [bt] bin/pvfs2-ping(vfprintf+0x3c9f) [0x8053973
> ]
> >>>>>>[16:02:45.639875]       [bt] 
> >>>>>>bin/pvfs2-ping(PINT_client_state_machine_post+0x1c
> >>>>>>d) [0x8052bb9]
> >>>>>>[16:02:45.639886]       [bt] 
> >>>>>>bin/pvfs2-ping(PINT_server_get_config+0x12f) [0x8063ad7]
> >>>>>>[16:02:45.639896]       [bt] bin/pvfs2-ping(PVFS_sys_fs_add+0xc6) 
> >>>>>>[0x8053e02]
> >>>>>>[16:02:45.639904]       [bt] bin/pvfs2-ping(main+0xdd) [0x805025d]
> >>>>>>Broken pipe
> >>>>>>[walt at sidious pvfs]>
> >>>>>>
> >>>>>>Very similar error messages, only this time they have some function nam
> es
> >>>>>>imbedded.  I've never seen this kind of message out of PVFS before, doe
> s
> >>>>>>no one recognize these "bt" messages?
> >>>>>>
> >>>>>>Walt
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>I have a branch of the code I was working on at ANL, installed it 
> >>>>>>>down here
> >>>>>>>and it is doing very different things.  In particular the client spit 
> >>>>>>>out
> >>>>>>>this error which I'm having a hard time understanding:
> >>>>>>>
> >>>>>>>[walt at sidious pvfs]> bin/create.set.get.eattr /foo key1 value1
> >>>>>>>[15:04:42.072103] src/io/bmi/bmi_tcp/bmi-tcp.c line 1676: Error: 
> >>>>>>>BMI_sockio_connect_sock: Success
> >>>>>>>[15:04:42.072264]       [bt] bin/create.set.get.eattr [0x8081a99]
> >>>>>>>[15:04:42.072278]       [bt] bin/create.set.get.eattr [0x808362b]
> >>>>>>>[15:04:42.072291]       [bt] bin/create.set.get.eattr [0x8081356]
> >>>>>>>[15:04:42.072304]       [bt] bin/create.set.get.eattr [0x806d53a]
> >>>>>>>[15:04:42.072316]       [bt] bin/create.set.get.eattr [0x80729b3]
> >>>>>>>[15:04:42.072329]       [bt] bin/create.set.get.eattr [0x806a733]
> >>>>>>>[15:04:42.072341]       [bt] 
> >>>>>>>bin/create.set.get.eattr(vfprintf+0x366f) [0x804cc5b]
> >>>>>>>[15:04:42.072377]       [bt] 
> >>>>>>>bin/create.set.get.eattr(vfprintf+0x289d) [0x804be89]
> >>>>>>>[15:04:42.072389]       [bt] bin/create.set.get.eattr [0x805ddeb]
> >>>>>>>[15:04:42.072402]       [bt] bin/create.set.get.eattr [0x807c7aa]
> >>>>>>>[15:04:42.072414]       [bt] bin/create.set.get.eattr [0x8068166]
> >>>>>>>Broken pipe
> >>>>>>>[walt at sidious pvfs]>
> >>>>>>>As you see it threw and error in BMI.  Ran that down and found where
> >>>>>>>the BMI function that connects the socket returned <0 but the strerror
> >>>>>>>translation is "Success" which doesn't make much sense to me.
> >>>>>>>
> >>>>>>>Then all of these lines starting [bt] followed by the command line
> >>>>>>>string of the client program, and and unknown hex value.  I have no id
> ea
> >>>>>>>where that is comming from.
> >>>>>>>
> >>>>>>>Have I misconfigured something?  I thought I configured just like I di
> d
> >>>>>>>the last time on this RedHat EL box.  Anyone recognize this?  It may 
> >>>>>>>have
> >>>>>>>something to do with my code, but my code should not have come into pl
> ay
> >>>>>>>yet, unless I did something terribly wrong.
> >>>>>>>
> >>>>>>>Walt
> >>>>>>>
> >>>>>
> >>>>>_______________________________________________
> >>>>>PVFS-developers mailing list
> >>>>>PVFS-developers at www.beowulf-underground.org
> >>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
> >>>>
> >>>>_______________________________________________
> >>>>PVFS2-developers mailing list
> >>>>PVFS2-developers at beowulf-underground.org
> >>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >>>>
> >>>
> >>>
> > 
> 
> _______________________________________________
> PVFS2-developers mailing list
> PVFS2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> 

-- 
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University




More information about the PVFS2-developers mailing list