Server Help

ASSS Questions - Debugging deadlocks

Samapico - Sun Feb 13, 2011 6:01 pm
Post subject: Debugging deadlocks
Any tips for debugging deadlocks? Is it even possible to know which module is locked?
Is GDB supposed to give debug info when a deadlock occurs (like the call stack and such) ?


I don't even know which event triggers the current deadlock issue I'm having, and it happens in the middle of gameplay, so it's kinda hard to spot the bug.
JoWie - Sun Feb 13, 2011 6:18 pm
Post subject:
I am not sure if GDB is automatically fired with the deadlock module (I think it does). But if it does not you can attach it and set a breakpoint in the exit call of deadlock.c
if you get the full backtrace you can see which threads are currently at a mutex lock.

You can also try using helgrind

Deadlocks often occur because 2 modules / routines acquire the locks of multiple mutex locks in different order.
All modules use the order of:
individual player locks
arena status
player status

If you for example use
Code: Show/Hide

pd->Lock();
aman->Lock();
...
aman->Lock();
pd->Lock();

You can cause a deadlock
Arnk Kilo Dylie - Sun Feb 13, 2011 6:29 pm
Post subject:
another method i used to use which i have not been bothered to write formally is to write macros for pthread_mutex_[un]lock which will call a function to store where locks are being held, and on deadlock a log of all active locks will be printed.

unixsignal should be calling gdb and producing an abrt backtrace on deadlock...
in this backtrace you can sometimes figure it out by searching for pthread_mutex_lock to see where things are waiting. however if it's due to corruption it's just going to confuse you.

and yeah double locks are some of the most frustrating things to debug, always have your internal mutexes inside any external locking you want to use, be very careful around messes like balls.c which hold mutexes between interface calls.

if you have backtraces and modules, i can help spot common pitfalls if you're willing.
Samapico - Sun Feb 13, 2011 6:33 pm
Post subject:
The server is running on a remote machine, using breakpoints would be quite tricky (or impossible?)

I'm only using pd->Lock / Unlock in my modules...

I guess I could add a log before each lock/unlock
Dr Brain - Sun Feb 13, 2011 8:56 pm
Post subject:
Take a look at the abrt_* files generated by the deadlock module first. They may tell what you need to know without modifying anything.
Samapico - Sun Feb 13, 2011 10:13 pm
Post subject:
emmm... where would those files be?

Looking at deadlock.c, I can't see where it would be created icon_surprised.gif
Dr Brain - Mon Feb 14, 2011 7:10 am
Post subject:
core/backtrace (no .c) is the shell script that generates the backtraces. I think it's called by unixsignal.c. My copy says bt/ is the directory, but on 1.4.4 they were placed into the asss root.
JoWie - Mon Feb 14, 2011 7:30 am
Post subject:
Do you call something (callback, interface, etc), that could lead back to your module?
If so, make sure you do not lock during the calls or use a recursive mutex

Code: Show/Hide

#include "pthread.h"
static pthread_mutex_t data_mtx;

EXPORT int MM_bla(int action, Imodman *mm_, Arena *arena)
{
   pthread_mutexattr_t attr;
   if (action == MM_LOAD)
   {
      
      pthread_mutexattr_init(&attr);
      pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
      pthread_mutex_init(&data_mtx, &attr);
      pthread_mutexattr_destroy(&attr);
      
   }
   else if (action == MM_UNLOAD)
   {

      pthread_mutex_destroy(&data_mtx);

   }
}

Arnk Kilo Dylie - Mon Feb 14, 2011 5:46 pm
Post subject:
don't get sloppy with recursive mutexes though, they can solve certain problems but you might make more of a headache if you rely on them (see: what I said about double locks.)
Samapico - Mon Feb 14, 2011 6:39 pm
Post subject:
Dr Brain wrote:
core/backtrace (no .c) is the shell script that generates the backtraces. I think it's called by unixsignal.c. My copy says bt/ is the directory, but on 1.4.4 they were placed into the asss root.
And.... what if I don't see anything?


Also, there's a segmentation fault that doesn't give me any debug info either... could there be something wrong with GDB itself on the server?
Dr Brain - Mon Feb 14, 2011 7:31 pm
Post subject:
Is GDB installed?
Samapico - Mon Feb 14, 2011 8:14 pm
Post subject:
Well, it was, but Lynx might have broken something.... tongue.gif
Samapico - Mon Feb 14, 2011 10:57 pm
Post subject:
It is installed...

Code: Show/Hide


$ gdb
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.


And yet:
Code: Show/Hide

[...]/zone$ bin/asss
asss 1.5.0rc1 built at Feb 14 2011 22:44:28
Loading modules...
I <cmod> loading C module 'deadlock' from 'internal'
I <cmod> loading C module 'prng' from 'internal'
I <cmod> loading C module 'logman' from 'internal'
I <cmod> loading C module 'mainloop' from 'internal'
I <cmod> loading C module 'config' from 'internal'
I <cmod> loading C module 'log_file' from 'internal'
I <cmod> loading C module 'playerdata' from 'internal'
I <cmod> loading C module 'lagdata' from 'internal'
I <cmod> loading C module 'bw_default' from 'internal'
I <cmod> loading C module 'net' from 'internal'
I <cmod> loading C module 'chatnet' from 'internal'
I <cmod> loading C module 'arenaman' from 'internal'
I <cmod> loading C module 'mapdata' from 'internal'
I <cmod> loading C module 'capman' from 'internal'
I <cmod> loading C module 'mapnewsdl' from 'internal'
I <cmod> loading C module 'core' from 'internal'
I <cmod> loading C module 'clientset' from 'internal'
I <cmod> loading C module 'persist' from 'scoring'
Segmentation fault
[...]/zone$


What I did, is that I put my src, conf, arenas and maps folder in a svn repository, and I keep them up to date on the server with that. So I can edit code locally, commit, and svn checkout on server and test.

It seems I broke a couple of things during this transition, cause as you see, it doesn't even get past the persist module...
If I had some trace I could have a clue of what's happening, but right now it's kinda hard to debug.
I did lots of comparisons between the pre-svn src folder and the post-svn one,

Note that this issue came after the deadlock that happens in one of my module... they're unrelated, but the lack of gdb stack trace is quite annoying in both cases.



here's the stderr output when I do 'make'... The make succeeds since these are only warnings.
Code: Show/Hide


~/asss/zone/src$ make > makelog.txt
Makefile:114: /home/parallel/asss/zone/build/deps.mk: No such file or directory
makedeps: ignoring option: -I/home/parallel/asss/zone/build
makedeps: ignoring option: -D_REENTRANT
makedeps: ignoring option: -D_GNU_SOURCE
makedeps: ignoring option: -fPIC
makedeps: ignoring option: -I/usr/include/db-4.6
makedeps: ignoring option: -I/usr/include/mysql
makedeps: ignoring option: -I/usr/include/python2.5
makedeps: warning: "pthread.h" not found
makedeps: warning: "zlib.h" not found
makedeps: warning: "mysql.h" not found
makedeps: warning: "db.h" not found
makedeps: warning: "errno.h" not found
makedeps: warning: "Python.h" not found
makedeps: warning: "structmember.h" not found
core/net.c: In function 'RelThread':
core/net.c:1783: warning: no return statement in function returning non-void
cmdlist/cmdlist.c: In function 'RemoveCommandInt':
cmdlist/cmdlist.c:65: warning: implicit declaration of function 'strcasecmp'
cmdlist/cmdlist.c: In function 'LLSort_CmdStringCompare':
cmdlist/cmdlist.c:92: warning: implicit declaration of function 'strcmp'
In file included from /usr/include/python2.5/Python.h:8,
                 from pymod/pymod.c:10:
/usr/include/python2.5/pyconfig.h:982:1: warning: "_XOPEN_SOURCE" redefined
In file included from /usr/include/signal.h:29,
                 from pymod/pymod.c:4:
/usr/include/features.h:160:1: warning: this is the location of the previous definition
pymod/pymod.c:387: warning: 'cvt_c2p_dict' defined but not used
pymod/pymod.c:488: warning: 'cvt_p2c_dict' defined but not used
turf/turf_reward.c: In function 'flagTag':
turf/turf_reward.c:1008: warning: 'l_pid' may be used uninitialized in this function

I attached a complete output of the make command (make &> makelogfull.txt)
JoWie - Tue Feb 15, 2011 4:43 am
Post subject:
I think you copied over modules.conf from a windows ASSS.

Add "unixsignal" as the first module in modules.conf
Samapico - Tue Feb 15, 2011 8:20 am
Post subject:
Ahhhhhhh.... yeah, I probably did that...

I'll compare with the stock modules.conf I guess tongue.gif

Any other things to check for that are different in win and linux?
Samapico - Tue Feb 15, 2011 11:33 pm
Post subject:
Alright.... so for some reason the persist module from scoring causes a segfault, here's the backtrace:

I might have added some stat codes, but other than that I don't see what could be different than in a stock server...
Code: Show/Hide
[Thread debugging using libthread_db enabled]
[New Thread 0xb4e28b70 (LWP 13693)]
[New Thread 0xb5629b70 (LWP 13692)]
[New Thread 0xb5e2ab70 (LWP 13691)]
[New Thread 0xb662bb70 (LWP 13690)]
[New Thread 0xb6e2cb70 (LWP 13689)]
[New Thread 0xb762db70 (LWP 13688)]
[New Thread 0xb7e2eb70 (LWP 13687)]
0xb7fdf402 in __kernel_vsyscall ()

Thread 8 (Thread 0xb7e2eb70 (LWP 13687)):
#0  0xb7fdf402 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7fbe876 in nanosleep () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#2  0x0804efa1 in fullsleep (millis=60000) at cmdlist/util.c:67
        ts = {tv_sec = 58, tv_nsec = 870625322}
#3  0x0809cd3e in thread_check (dummy=0x0) at core/deadlock.c:25
No locals.
#4  0xb7fb6955 in start_thread () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#5  0xb7efbe7e in clone () from /lib/i686/cmov/libc.so.6
No symbol table info available.

Thread 7 (Thread 0xb762db70 (LWP 13688)):
#0  0xb7fdf402 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7fbaf7f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#2  0x08050ed1 in MPRemove (q=0x80c23a0) at cmdlist/util.c:1283
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {135002048, 0,
                4001536, -1218260120, 1993351518, -1184167904},
              __mask_was_saved = 0}}, __pad = {0xb762d454, 0x0, 0x0, 0x0}}
        __cancel_routine = 0xb7fba300 <pthread_mutex_unlock>
        __cancel_arg = 0x80c23a8
        not_first_call = 0
        data = 0x0
#3  0x08059665 in thread_main (dummy=0x0) at core/mainloop.c:156
        wd = 0x0
#4  0xb7fb6955 in start_thread () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#5  0xb7efbe7e in clone () from /lib/i686/cmov/libc.so.6
No symbol table info available.

Thread 6 (Thread 0xb6e2cb70 (LWP 13689)):
#0  0xb7fdf402 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7fbaf7f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#2  0x08050ed1 in MPRemove (q=0x80c23a0) at cmdlist/util.c:1283
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {135002048, 0,
                4001536, -1226652824, 1995448669, -1184167904},
              __mask_was_saved = 0}}, __pad = {0xb6e2c454, 0x0, 0x0, 0x0}}
        __cancel_routine = 0xb7fba300 <pthread_mutex_unlock>
        __cancel_arg = 0x80c23a8
        not_first_call = 0
        data = 0x0
#3  0x08059665 in thread_main (dummy=0x0) at core/mainloop.c:156
        wd = 0x0
#4  0xb7fb6955 in start_thread () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#5  0xb7efbe7e in clone () from /lib/i686/cmov/libc.so.6
No symbol table info available.

Thread 5 (Thread 0xb662bb70 (LWP 13690)):
#0  0xb7fdf402 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7fbaf7f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#2  0x08050ed1 in MPRemove (q=0x80c23a0) at cmdlist/util.c:1283
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {135002048, 0,
                4001536, -1235045528, 1980768604, -1184167904},
              __mask_was_saved = 0}}, __pad = {0xb662b454, 0x0, 0x0, 0x0}}
        __cancel_routine = 0xb7fba300 <pthread_mutex_unlock>
        __cancel_arg = 0x80c23a8
        not_first_call = 0
        data = 0x0
#3  0x08059665 in thread_main (dummy=0x0) at core/mainloop.c:156
        wd = 0x0
#4  0xb7fb6955 in start_thread () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#5  0xb7efbe7e in clone () from /lib/i686/cmov/libc.so.6
No symbol table info available.

Thread 4 (Thread 0xb5e2ab70 (LWP 13691)):
#0  0xb7fdf402 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7ef5011 in select () from /lib/i686/cmov/libc.so.6
No symbol table info available.
#2  0x0805bab9 in RecvThread (dummy=0x0) at core/net.c:1268
        tv = {tv_sec = 0, tv_usec = 921000}
        myfds = {fds_bits = {224, 0 <repeats 31 times>}}
        selfds = {fds_bits = {224, 0 <repeats 31 times>}}
        maxfd = 7
        l = 0x0
#3  0xb7fb6955 in start_thread () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#4  0xb7efbe7e in clone () from /lib/i686/cmov/libc.so.6
No symbol table info available.

Thread 3 (Thread 0xb5629b70 (LWP 13692)):
#0  0xb7fdf402 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7ec7516 in nanosleep () from /lib/i686/cmov/libc.so.6
No symbol table info available.
#2  0xb7ef5a4c in usleep () from /lib/i686/cmov/libc.so.6
No symbol table info available.
#3  0x0805ce09 in SendThread (dummy=0x0) at core/net.c:1742
        conn = 0x0
        link = 0x0
        gtc = 933797352
        p = 0x0
        dropme = 0x0
        tofree = {start = 0x0, end = 0x0}
        tokill = {start = 0x0, end = 0x0}
#4  0xb7fb6955 in start_thread () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#5  0xb7efbe7e in clone () from /lib/i686/cmov/libc.so.6
No symbol table info available.

Thread 2 (Thread 0xb4e28b70 (LWP 13693)):
#0  0xb7fdf402 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7fbaf7f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#2  0x08050ed1 in MPRemove (q=0x80c28a0) at cmdlist/util.c:1283
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {135002048, 0,
                4001536, -1260223640, 1987060057, -1184167904},
              __mask_was_saved = 0}}, __pad = {0xb4e28454, 0x0, 0x0, 0x0}}
        __cancel_routine = 0xb7fba300 <pthread_mutex_unlock>
        __cancel_arg = 0x80c28a8
        not_first_call = 0
        data = 0x0
#3  0x0805ce33 in RelThread (dummy=0x0) at core/net.c:1758
        conn = 0x0
        buf = 0x0
        spot = 0
#4  0xb7fb6955 in start_thread () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#5  0xb7efbe7e in clone () from /lib/i686/cmov/libc.so.6
No symbol table info available.

Thread 1 (Thread 0xb7e2f6c0 (LWP 13686)):
#0  0xb7fdf402 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7ec702b in waitpid () from /lib/i686/cmov/libc.so.6
No symbol table info available.
#2  0xb7e68e63 in ?? () from /lib/i686/cmov/libc.so.6
No symbol table info available.
#3  0xb7e691f2 in system () from /lib/i686/cmov/libc.so.6
No symbol table info available.
#4  0xb7fbf3bd in system () from /lib/i686/cmov/libpthread.so.0
No symbol table info available.
#5  0x0809d1b2 in handle_sigsegv (sig=11) at core/unixsignal.c:105
        cmd = "/bin/sh bin/backtrace bin/asss 13686 segv", '\000' <repeats 86 times>
        type = 0x80ba65d "segv"
#6  <signal handler called>
No symbol table info available.
#7  0xb458ed53 in __mutex_get_align () from /usr/lib/libdb-4.6.so
No symbol table info available.
#8  0xb7fd0eb3 in init_db () at scoring/persist.c:925
        err = 0
#9  0xb7fd10ae in MM_persist (action=0, mm_=0x80bfa80, arena=0x0)
    at scoring/persist.c:994
No locals.
#10 0x0804ec26 in load_c_module (spec_=0xbf8626c8 "scoring:persist",
    args=0x80f3950) at main/cmod.c:179
        buf = "info_persist\000l/asss/zone/bin/scoring.so\000g.so\000\017\000\000\n\000\000\000\001\000\000\000\000\000\000\000\364\037\367\267\000\000\000\000\327&\206\277\000\000\000\000\000\000\000\000x\026\206\277V\362\004\b\310&\206\277\230\026\206\277\377", '\000' <repeats 11 times>"\300, \367\v\b\250&\206\277S0\005\b\310&\206\277\230\026\206\277\000\001", '\000' <repeats 14 times>, "scoring:persist\000\000 verifying that it's a valid load order.\000\000 move stuff", '\000' <repeats 3442 times>, "t\213\343\267\020ii\r\220$\206\277\026\212\376\267f"...
        spec = "scoring\000persist", '\000' <repeats 4080 times>
        modname = 0xbf860610 "persist"
        filename = 0xbf860608 "scoring"
        path = 0xbf861608 "info_persist"
        ret = 0
        cmd = 0x80f39a0
        lm = 0x80bfc20
#11 0x0804edd1 in loader (action=0, args=0x80f3950,
    line=0xbf8626c8 "scoring:persist", arena=0x0) at main/cmod.c:226
        cmd = 0x0
#12 0x0804d770 in LoadModule_ (spec=0xbf8626c8 "scoring:persist")
    at main/module.c:204
        ret = 1
        mod = 0x80f3950
        loadername = "c", '\000' <repeats 30 times>
        loader = 0x804ed74 <loader>
        t = 0xbf8626c8 "scoring:persist"
#13 0x0804cc7e in LoadModuleFile (fname=0x809dab8 "conf/modules.conf")
    at main/main.c:119
        line = "scoring:persist", '\000' <repeats 240 times>
        ret = 0
        ctx = 0x80e2350
#14 0x0804d020 in main (argc=1, argv=0xbf8628c4) at main/main.c:272
        code = -1208541196
A debugging session is active.

   Inferior 1 [process 13686] will be detached.

Quit anyway? (y or n) [answered Y; input not from terminal]

Arnk Kilo Dylie - Wed Feb 16, 2011 12:05 am
Post subject:
If persist is causing trouble, nuke your data/ folder and try again.
Dr Brain - Wed Feb 16, 2011 7:10 am
Post subject:
Yeah, it's segfaulting inside libdb, so it's probably nothing you did. Perhaps the data got corrupted, or was from a different version. As Arnk said, the quick solution is to empty data/.
Samapico - Wed Feb 16, 2011 8:28 am
Post subject:
i tried that, didn't seem to help...
Cheese - Wed Feb 16, 2011 1:26 pm
Post subject:
is it me or does persist seem to break far too often

whats going on here
Dr Brain - Wed Feb 16, 2011 7:48 pm
Post subject:
Maybe the libdb on your system is messed up? Try recompiling the scoring module too.
Samapico - Wed Feb 16, 2011 8:50 pm
Post subject:
well, I recompiled everything... (see make log file provided earlier)
Samapico - Sun Feb 20, 2011 1:00 am
Post subject:
There's something wrong when I compile the persist module, for sure... if I use the pre-built scoring.so, it runs fine. But if I compile it, it dies.
JoWie - Sun Feb 20, 2011 6:46 am
Post subject:
Try updating to libdb 4.8, make sure your makefile is correct after doing this.
Samapico - Sun Feb 20, 2011 1:36 pm
Post subject:
Ok.... so I finally got the backtrace of my deadlock, which was the reason I started this thread tongue.gif

Confusing stuff...
http://pastebin.com/jD2nJYHa


It seems to be triggered by an event of my module (sockey), yet that module still runs fine (for 10 seconds) after that event. So it's like another thread was locking because of it. Probably a lock() bug or something?
Arnk Kilo Dylie - Sun Feb 20, 2011 2:27 pm
Post subject:
That one is not the most helpful backtrace. Could you determine what specific event causes it?
Samapico - Sun Feb 20, 2011 2:57 pm
Post subject:
Here's my module...

http://pastebin.com/pYK5FLuj
JoWie - Sun Feb 20, 2011 3:33 pm
Post subject:
Have you tried valgrind --tool=helgrind

If all else fails and you know how to reliably trigger the bug, start uncommenting and unloading stuff until you know where it comes from.
Dr Brain - Sun Feb 20, 2011 4:38 pm
Post subject:
Line 1043, you return inside a lock. Unlock before the return.

EDIT: knew to look for this because the deadlock only had one thread waiting for a lock. All the others were happy. So it was clearly not a locking order issue, it was a case of a forgotten unlock.
Samapico - Sun Feb 20, 2011 5:10 pm
Post subject:
aaaaaaaaaaaaaah


... i was searching for exactly these kinds of things...... the eyes of someone else always see more clearly tongue.gif

thanks
Arnk Kilo Dylie - Sun Feb 20, 2011 6:19 pm
Post subject:
good practice would probably dictate to not return in the middle of a function, or at least not in the middle of a lock. then you just write lock, unlock, operations in between, and return a result at the end, and never worry.
Samapico - Sun Feb 20, 2011 7:21 pm
Post subject:
I added the locks afterwards, which is probably why I didn't notice this at the time... But yeah, obviously tongue.gif
All times are -5 GMT
View topic
Powered by phpBB 2.0 .0.11 © 2001 phpBB Group