View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000905||LDMud 3.6||General||public||2022-03-23 02:44||2023-10-14 19:17|
|Fixed in Version||3.6.7|
|Summary||0000905: Repeat crashes from stack overlapping heap|
I've seen several crashes that manifest as this form of output in stdout:
> 2022.03.12 08:18:32 Out of memory: Stack (0x7fcd03cf2d37..0x7ffcf52127a7) overlaps heap (0x7fcce98ce000..0x7fcd0c806000).
I haven't been able to pin down a cause related to load or any specific program in our mudlib. I've confirmed in each instance that we have ample physical memory on the host available. Graphs of the driver's own driver_info(DI_SIZE_MEMORY_UNUSED) statistic shows free memory available at the times of the crash. I suspect there is some kind of stack/heap corruption occurring and this isn't a true OOM event.
I've collected up some example core dumps, binaries, and log snippets from 0000001:0000005 of the crashes we've observed. You can find a .tar.gz (~400mb) with this information here (please try to keep the cores private if possible - they have all of our player data resident and the `ldmud` binary has a mysql password compiled in):
Graphs from my monitoring showing the host, process and mud memory statics over the period where two crashes were observed are here:
I'm working on upgrading our production game to 3.6.6 - perhaps the crashes will stop at that point but since I'm not sure they will and it will take me some time to update I thought I'd try to report this bug.
Thank you for all of your help!
|Steps To Reproduce||Unclear at this point.|
|Tags||No tags attached.|
I don't have a full picture, because I don't have the debugging symbols for some of the libraries used. You can try if you get meaningful backtraces (gdb ldmud ldmud.core -x bt).
But this what I have gathered: The crash seems to happen in SSL routines called from Python in another thread. And this multi-threading is a problem (and will also be in 3.6), because each non-main thread gets a stack that is allocated from the heap. So of course the stack seems to overlap with the heap.
Normally it would be sufficient just to avoid using any LDMud routines (and thus LDMud memory allocations) in any non-main thread. The problem here are the SSL routines. LDMud configures the SSL library to use the LDMud memory management. And the SSL library will happily do this also in another thread.
So what are possible solutions:
1. Avoid using threads.
2. Avoid using LDMud memory management in SSL library (Comment in pkg-openssl.c the line with CRYPTO_set_mem_functions).
We might think about introducing a configuration flag to disable registering our memory management routines in thirdparty libraries (we do this for GnuTLS, Iksemel, OpenSSL, SQLite and XML2).
Thanks for taking a look!
> I don't have a full picture, because I don't have the debugging symbols for some of the libraries used. You can try if you get meaningful backtraces (gdb ldmud ldmud.core -x bt).
That makes sense, sorry about that.
> 1. Avoid using threads.
Interesting! I'm not explicitly doing anything threaded but have been using a few async Python libraries (aiohttp and discordpy most notably). I'll see if I can figure out what might be doing work in a separate thread.
> 2. Avoid using LDMud memory management in SSL library (Comment in pkg-openssl.c the line with CRYPTO_set_mem_functions).
Can you help me understand what the disadvantages of this might be? It seems worth trying.
This depends on whether you believe that the LDMud allocator has better performance than the system (libc) allocator. This was certainly true many years ago. If this is true nowadays I don't know. We unfortunately don't have any numbers on that.
So therefore I don't know if this is an advantage or disadvantage, in SSL you'll have the performance of the system allocator instead of the LDMud allocator.
Another disadvantage is, that LDMud cannot keep track of the memory allocations anymore. So the statistics you can gather in-game (driver_info()) will not cover the SSL allocations anymore. (This is already true for any memory used by Python.)
Oh ok! That makes sense. Sounds very reasonable. I will give this workaround a shot.
Do you have any thoughts on how I could catch what might be doing work on a separate thread? I'd love to fix the root cause too. I'll give it some thought myself but if you have ideas I'm all ears.
Thanks again Gnomi. Our players will really appreciate it if I can figure this out.
I cannot answer that. It would help to have meaningful backtraces, to find out what started those threads.
You could try attaching gdb to your test instance and set a breakpoint to pthread_create. Then you might catch the creation of a thread and get an idea of the situation.
No problem. I'll see if I can get more meaningful backtraces and will pursue the other options we've discussed in parallel.
Many thanks! Your expertise here is invaluable.
If you'd like to close this bug that's fine with me. It seems clearly in my court now and not a LDMud problem.
||Just wanted to follow-up and say that disabling the LDMud allocator for pkg-openssl.c has prevented any further crashes. The culprit thread seems to be from the async http library we're using. I have to set up a build with the python debug symbols to make more progress on figuring out what's gone wrong w.r.t spawning a new thread for something that should be done async on the existing one.|
> We might think about introducing a configuration flag to disable registering our memory management routines in thirdparty libraries (we do this for GnuTLS, Iksemel, OpenSSL, SQLite and XML2).
Would it be helpful if I made a PR for this? I would be interested
> I've collected up some example core dumps, binaries, and log snippets from some of the crashes we've observed. You can find a .tar.gz (~400mb) with this information here (please try to keep the cores private if possible - they have all of our player data resident and the `ldmud` binary has a mysql password compiled in):
FYI: I've deleted the linked heapdumps. This bug could be made public.
> Would it be helpful if I made a PR for this? I would be interested
This can be closed since PR 80 was merged and the `--disable-allocator-wrappers` flag at configure time resolves the issue for us.
|2022-03-23 02:44||paradox||New Issue|
|2022-03-23 22:45||Gnomi||Note Added: 0002675|
|2022-03-23 22:51||paradox||Note Added: 0002676|
|2022-03-23 23:00||Gnomi||Note Added: 0002677|
|2022-03-23 23:02||paradox||Note Added: 0002678|
|2022-03-23 23:18||Gnomi||Note Added: 0002679|
|2022-03-23 23:21||paradox||Note Added: 0002680|
|2022-04-27 23:55||paradox||Note Added: 0002681|
|2022-04-27 23:55||paradox||Note Added: 0002682|
|2022-10-01 15:15||paradox||Note Added: 0002683|
|2022-10-02 11:03||zesstra||View Status||private => public|
|2022-11-13 23:18||Gnomi||Assigned To||=> Gnomi|
|2022-11-13 23:18||Gnomi||Status||new => assigned|
|2022-11-13 23:19||Gnomi||Project||LDMud 3.5 => LDMud 3.6|
|2022-11-13 23:19||Gnomi||Category||Runtime => General|
|2022-11-13 23:19||Gnomi||Status||assigned => resolved|
|2022-11-13 23:19||Gnomi||Resolution||open => fixed|
|2022-11-13 23:19||Gnomi||Fixed in Version||=> 3.6.7|
|2023-10-14 19:17||paradox||Status||resolved => closed|
|2023-10-14 19:17||paradox||Note Added: 0002712|