[Pvfs2-users] PVFS2 installation problem
Florin Isaila
florin.isaila at gmail.com
Thu Jul 5 08:50:55 EDT 2007
Murali, you are great!! Many thanks, it works perfectly.
Best regards
Florin
On 7/4/07, Murali Vilayannur <murali.vilayannur at gmail.com> wrote:
> Florin,
> Many thanks for setting up an account on your system!
> Your system's aio callback/glibc libraries are broken.
> We have a configure check to workaround this.
> --disable-aio-threaded-callbacks
> Like you mentioned though, the builds were broken.
> Attached diffs will fix that.
> Subsequently, I am able to do pvfs2-cp etc without any problems.
> I have already modified the pvfs-2.6.3 sources under
> /home/A40001/u72877927/florin/apps/pvfs-2.6.3 to build properly with
> the new option mentioned above.
> Let us know if you hit any problems!
> If you happen to upgrade your glibc, then do try pvfs2 with aio
> threaded callbacks. it might work until which time you have to employ
> this workaround. There shouldn't be any noticeable performance drop I
> think although others can correct me if I am wrong on that.
> thanks,
> murali
>
> --- /tmp/gen-locks.h 2007-07-04 18:10:49.223697505 +0200
> +++ ./src/common/gen-locks/gen-locks.h 2007-07-04 08:59:27.603350642 +0200
> @@ -62,7 +62,7 @@
> #endif /* __GEN_POSIX_LOCKING__ */
>
>
> -#ifdef __GEN_NULL_LOCKING__
> +#if defined(__GEN_NULL_LOCKING__) && !defined(__GEN_POSIX_LOCKING__)
> /* this stuff messes around just enough to prevent warnings */
> typedef int gen_mutex_t;
> typedef unsigned long gen_thread_t;
>
>
> On 7/3/07, Florin Isaila <florin.isaila at gmail.com> wrote:
> > Hi guys, many thanks for your replies. Unfortunatly, in spite of all
> > tries, it keeps having problems:
> >
> > 1) -disable-thread-safety does not compile
> > 2) --enable-nptl-workaround does not create the file system correctly
> > 3) Configured without any extra-parameters and with debugging is
> > getting stuck for 5 minutes and then tries again:
> >
> > [D 15:38:53.732290] Posted PVFS_SYS_IO (waiting for test)
> > [E 15:43:53.314350] job_time_mgr_expire: job time out: cancelling bmi
> > operation, job_id: 29.
> >
> > Murali, you are saying something about live debugging, would it be
> > possible for you to do that? If yes, just send me the public rsa/dsa
> > keys and then I will give you the machine details.
> >
> > Many thanks
> > Florin
> >
> >
> > On 7/3/07, Phil Carns <pcarns at wastedcycles.org> wrote:
> > > Is it possible that a response just isn't making it back to the clients
> > > for some reason?
> > >
> > > If the client library can't find anything else to do, it will be normal
> > > for it to spend the majority of its time sleeping in either poll() or
> > > epoll() until some messages show up that it needs. It should give up
> > > eventually, but the job timeouts may be set rather high by default. It
> > > looks like the defaults are 300 second timeouts with 5 retries.
> > >
> > > You might find some more information by setting the PVFS2_DEBUGMASK
> > > environment variable to "network" before running one of the pvfs2-*
> > > utilities that hangs. If that doesn't indicate anything useful you
> > > could try setting it to "verbose" to get even more debugging output. In
> > > conjunction with this you might want to set ClientJobBMITimeoutSecs and
> > > ClientJobFlowTimeoutSecs to something lower (like 30 seconds) so you can
> > > see if the client times out and retries while you watch.
> > >
> > > -Phil
> > >
> > > Murali Vilayannur wrote:
> > > > Hi Florin,
> > > > Thanks for getting back on that!
> > > > This is quite weird. it probably points to some platform-specific
> > > > library issue.
> > > > Since we do use threads, perhaps it is time to retry running configure
> > > > by disabling usage of threads and see if that helps?
> > > >
> > > > ./configure --disable-thread-safety is something you can try
> > > > perhaps ./configure --enable-nptl-workaround is also something you can
> > > > try (not together with the previous one though) to workaround glibc
> > > > oddities.
> > > > Sam, RobL, Pete any ideas? I am lost..:(
> > > > Final alternative is to perhaps do a live debug on your machine if
> > > > possible..
> > > > thanks,
> > > > Murali
> > > >
> > > > On 7/2/07, Florin Isaila <florin.isaila at gmail.com> wrote:
> > > >> Hi,
> > > >>
> > > >> many thanks Murali. I have just tried that, but it keeps getting stuck
> > > >> with an even stranger stack trace:
> > > >>
> > > >> (gdb) bt
> > > >> #0 0x0ff4b2d0 in poll () from /lib/tls/libc.so.6
> > > >> #1 0x0ffc871c in ?? () from /lib/tls/libc.so.6
> > > >> #2 0x0ffc871c in ?? () from /lib/tls/libc.so.6
> > > >> Previous frame identical to this frame (corrupt stack?)
> > > >>
> > > >> Any other suggestions?
> > > >>
> > > >> Best regards
> > > >> Florin
> > > >>
> > > >> On 7/2/07, Murali Vilayannur <murali.vilayannur at gmail.com> wrote:
> > > >> > Hi Florin,
> > > >> > Given that both your backtraces point to epoll(), can you run make
> > > >> > clean followed by configure with --disable-epoll, rebuild everything
> > > >> > and see if that works?
> > > >> > If it does work, it probably points to some epoll specific bug on ppc
> > > >> > either in pvfs2 or the libepoll code..
> > > >> > thanks,
> > > >> > Murali
> > > >> >
> > > >> > On 7/2/07, Florin Isaila <florin.isaila at gmail.com> wrote:
> > > >> > > Hi,
> > > >> > >
> > > >> > > We have installed PVFS2 2.6.3 over Ethernet on a SUSE distribution,
> > > >> > > locally on a biprocessor (PowerPC 970FX) machine.
> > > >> > >
> > > >> > > Some commands like pvfs2-ping, pvfs2-mkdir, pvfs2-ls (w/o parameters)
> > > >> > > work fine.
> > > >> > >
> > > >> > > But we can not get it run for some pvfs2-* commands. For instance
> > > >> > > pvfs2-cp gets stuck. Here the trace of gdb:
> > > >> > >
> > > >> > > (gdb) bt
> > > >> > > #0 0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
> > > >> > > #1 0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
> > > >> > > incount=128, outcount=0xffff97b0, maps=0xffff93b0,
> > > >> status=0xffff95b0,
> > > >> > > poll_timeout=10, external_mutex=0x100d2ce0)
> > > >> > > at socket-collection-epoll.c:281
> > > >> > > #2 0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
> > > >> > > #3 0x10098d10 in BMI_tcp_testcontext (incount=5,
> > > >> out_id_array=0x100d2b58,
> > > >> > > outcount=0xffff9864, error_code_array=0x100d2b80,
> > > >> > > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> > > >> max_idle_time=10,
> > > >> > > context_id=0) at bmi-tcp.c:1303
> > > >> > > #4 0x1005aa18 in BMI_testcontext (incount=5,
> > > >> out_id_array=0x100d2b58,
> > > >> > > outcount=0x100d14cc, error_code_array=0x100d2b80,
> > > >> > > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> > > >> > > max_idle_time_ms=10, context_id=0) at bmi.c:944
> > > >> > > #5 0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
> > > >> > > #6 0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
> > > >> > > at thread-mgr.c:815
> > > >> > > #7 0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at
> > > >> job.c:4661
> > > >> > > #8 0x1007025c in job_testcontext (out_id_array_p=0xffff99d0,
> > > >> > > inout_count_p=0xffff99b8, returned_user_ptr_array=0xffffd1d0,
> > > >> > > out_status_array_p=0xffffa1d0, timeout_ms=10, context_id=1) at
> > > >> job.c:4068
> > > >> > > #9 0x1000fdb0 in PINT_client_state_machine_test (op_id=3,
> > > >> > > error_code=0xffffd670) at client-state-machine.c:536
> > > >> > > ---Type <return> to continue, or q <return> to quit---
> > > >> > > #10 0x1001041c in PINT_client_wait_internal (op_id=3,
> > > >> > > in_op_str=0x100b209c "fs_add", out_error=0xffffd670,
> > > >> > > in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
> > > >> > > #11 0x10010734 in PVFS_sys_wait (op_id=3, in_op_str=0x100b209c
> > > >> "fs_add",
> > > >> > > out_error=0xffffd670) at client-state-machine.c:861
> > > >> > > #12 0x10035c4c in PVFS_sys_fs_add (mntent=0x100d3030) at
> > > >> fs-add.sm:205
> > > >> > > #13 0x1004c220 in PVFS_util_init_defaults () at pvfs2-util.c:1040
> > > >> > > #14 0x1000a5c8 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:135
> > > >> > >
> > > >> > > Some other times (but rarely) is getting stuck at a different place:
> > > >> > >
> > > >> > > (gdb) bt
> > > >> > > #0 0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
> > > >> > > #1 0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
> > > >> > > incount=128, outcount=0xffff9b30, maps=0xffff9730,
> > > >> status=0xffff9930,
> > > >> > > poll_timeout=10, external_mutex=0x100d2ce0)
> > > >> > > at socket-collection-epoll.c:281
> > > >> > > #2 0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
> > > >> > > #3 0x10098d10 in BMI_tcp_testcontext (incount=5,
> > > >> out_id_array=0x100d2b58,
> > > >> > > outcount=0xffff9be4, error_code_array=0x100d2b80,
> > > >> > > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> > > >> max_idle_time=10,
> > > >> > > context_id=0) at bmi-tcp.c:1303
> > > >> > > #4 0x1005aa18 in BMI_testcontext (incount=5,
> > > >> out_id_array=0x100d2b58,
> > > >> > > outcount=0x100d14cc, error_code_array=0x100d2b80,
> > > >> > > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> > > >> > > max_idle_time_ms=10, context_id=0) at bmi.c:944
> > > >> > > #5 0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
> > > >> > > #6 0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
> > > >> > > at thread-mgr.c:815
> > > >> > > #7 0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at
> > > >> job.c:4661
> > > >> > > #8 0x1007025c in job_testcontext (out_id_array_p=0xffff9d50,
> > > >> > > inout_count_p=0xffff9d38, returned_user_ptr_array=0xffffd550,
> > > >> > > out_status_array_p=0xffffa550, timeout_ms=10, context_id=1) at
> > > >> job.c:4068
> > > >> > > #9 0x1000fdb0 in PINT_client_state_machine_test (op_id=28,
> > > >> > > error_code=0xffffda1c) at client-state-machine.c:536
> > > >> > > ---Type <return> to continue, or q <return> to quit---
> > > >> > > #10 0x1001041c in PINT_client_wait_internal (op_id=28,
> > > >> > > in_op_str=0x100ac1b8 "io", out_error=0xffffda1c,
> > > >> > > in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
> > > >> > > #11 0x10010734 in PVFS_sys_wait (op_id=28, in_op_str=0x100ac1b8 "io",
> > > >> > > out_error=0xffffda1c) at client-state-machine.c:861
> > > >> > > #12 0x1001b78c in PVFS_sys_io (ref=
> > > >> > > {handle = 1048570, fs_id = 1957135728, __pad1 = -26176},
> > > >> > > file_req=0x100d07d8, file_req_offset=0, buffer=0x40068008,
> > > >> > > mem_req=0x100efbd0, credentials=0xffffe060, resp_p=0xffffda90,
> > > >> > > io_type=PVFS_IO_WRITE) at sys-io.sm:363
> > > >> > > #13 0x1000b078 in generic_write (dest=0xffffddb0,
> > > >> > > buffer=0x40068008 "\177ELF\001\002\001", offset=0, count=2469777,
> > > >> > > credentials=0xffffe060) at pvfs2-cp.c:365
> > > >> > > #14 0x1000a824 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:180
> > > >> > >
> > > >> > >
> > > >> > > After breaking the program with Ctrl-C, the files appear created. Any
> > > >> > > clue where this can come from? It appears like the metadata
> > > >> > > communication works but the data not.
> > > >> > >
> > > >> > > Bellow the result of the ping command.
> > > >> > >
> > > >> > > Many thanks
> > > >> > > Florin
> > > >> > >
> > > >> > > pvfs2-ping -m ~/florin/mnt/pvfs2/
> > > >> > >
> > > >> > > (1) Parsing tab file...
> > > >> > >
> > > >> > > (2) Initializing system interface...
> > > >> > >
> > > >> > > (3) Initializing each file system found in tab file:
> > > >> > > /home/A40001/u72877927/florin/app
> > > >> > > s/etc/pvfs2tab...
> > > >> > >
> > > >> > > PVFS2 servers: tcp://localhost:55555
> > > >> > > Storage name: pvfs2-fs
> > > >> > > Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2
> > > >> > > /home/A40001/u72877927/florin/mnt/pvfs2: Ok
> > > >> > >
> > > >> > > (4) Searching for /home/A40001/u72877927/florin/mnt/pvfs2/ in
> > > >> pvfstab...
> > > >> > >
> > > >> > > PVFS2 servers: tcp://localhost:55555
> > > >> > > Storage name: pvfs2-fs
> > > >> > > Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2
> > > >> > >
> > > >> > > meta servers:
> > > >> > > tcp://localhost:55555
> > > >> > >
> > > >> > > data servers:
> > > >> > > tcp://localhost:55555
> > > >> > >
> > > >> > > (5) Verifying that all servers are responding...
> > > >> > >
> > > >> > > meta servers:
> > > >> > > tcp://localhost:55555 Ok
> > > >> > >
> > > >> > > data servers:
> > > >> > > tcp://localhost:55555 Ok
> > > >> > >
> > > >> > > (6) Verifying that fsid 1957135728 is acceptable to all servers...
> > > >> > >
> > > >> > > Ok; all servers understand fs_id 1957135728
> > > >> > >
> > > >> > > (7) Verifying that root handle is owned by one server...
> > > >> > >
> > > >> > > Root handle: 1048576
> > > >> > > Ok; root handle is owned by exactly one server.
> > > >> > >
> > > >> > > =============================================================
> > > >> > >
> > > >> > > The PVFS2 filesystem at /home/A40001/u72877927/florin/mnt/pvfs2/
> > > >> > > appears to be correctly configured.
> > > >> > > _______________________________________________
> > > >> > > Pvfs2-users mailing list
> > > >> > > Pvfs2-users at beowulf-underground.org
> > > >> > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> > > >> > >
> > > >> >
> > > >>
> > > > _______________________________________________
> > > > Pvfs2-users mailing list
> > > > Pvfs2-users at beowulf-underground.org
> > > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> > >
> > >
> >
>
More information about the Pvfs2-users
mailing list