View Issue Details

IDProjectCategoryView StatusLast Update
0000905LDMud 3.6Generalpublic2023-10-14 19:17
Reporterparadox Assigned ToGnomi  
PrioritynormalSeveritycrashReproducibilitysometimes
Status closedResolutionfixed 
PlatformUbuntuOSLinuxOS Version20.04
Fixed in Version3.6.7 
Summary0000905: Repeat crashes from stack overlapping heap
DescriptionHi there,

I've seen several crashes that manifest as this form of output in stdout:

> 2022.03.12 08:18:32 Out of memory: Stack (0x7fcd03cf2d37..0x7ffcf52127a7) overlaps heap (0x7fcce98ce000..0x7fcd0c806000).

I haven't been able to pin down a cause related to load or any specific program in our mudlib. I've confirmed in each instance that we have ample physical memory on the host available. Graphs of the driver's own driver_info(DI_SIZE_MEMORY_UNUSED) statistic shows free memory available at the times of the crash. I suspect there is some kind of stack/heap corruption occurring and this isn't a true OOM event.

I've collected up some example core dumps, binaries, and log snippets from 0000001:0000005 of the crashes we've observed. You can find a .tar.gz (~400mb) with this information here (please try to keep the cores private if possible - they have all of our player data resident and the `ldmud` binary has a mysql password compiled in):

https://binaryparadox.net/d/ldmud/220322/dune.3.5.4.crashes.tar.gz

Graphs from my monitoring showing the host, process and mud memory statics over the period where two crashes were observed are here:

https://binaryparadox.net/d/ldmud/220322/driver.memory.png
https://binaryparadox.net/d/ldmud/220322/host.memory.png
https://binaryparadox.net/d/ldmud/220322/process.memory.png

I'm working on upgrading our production game to 3.6.6 - perhaps the crashes will stop at that point but since I'm not sure they will and it will take me some time to update I thought I'd try to report this bug.

Thank you for all of your help!
Steps To ReproduceUnclear at this point.
TagsNo tags attached.

Activities

Gnomi

2022-03-23 22:45

manager   ~0002675

I don't have a full picture, because I don't have the debugging symbols for some of the libraries used. You can try if you get meaningful backtraces (gdb ldmud ldmud.core -x bt).

But this what I have gathered: The crash seems to happen in SSL routines called from Python in another thread. And this multi-threading is a problem (and will also be in 3.6), because each non-main thread gets a stack that is allocated from the heap. So of course the stack seems to overlap with the heap.

Normally it would be sufficient just to avoid using any LDMud routines (and thus LDMud memory allocations) in any non-main thread. The problem here are the SSL routines. LDMud configures the SSL library to use the LDMud memory management. And the SSL library will happily do this also in another thread.

So what are possible solutions:
1. Avoid using threads.
2. Avoid using LDMud memory management in SSL library (Comment in pkg-openssl.c the line with CRYPTO_set_mem_functions).

We might think about introducing a configuration flag to disable registering our memory management routines in thirdparty libraries (we do this for GnuTLS, Iksemel, OpenSSL, SQLite and XML2).

paradox

2022-03-23 22:51

reporter   ~0002676

Thanks for taking a look!

> I don't have a full picture, because I don't have the debugging symbols for some of the libraries used. You can try if you get meaningful backtraces (gdb ldmud ldmud.core -x bt).

That makes sense, sorry about that.

> 1. Avoid using threads.

Interesting! I'm not explicitly doing anything threaded but have been using a few async Python libraries (aiohttp and discordpy most notably). I'll see if I can figure out what might be doing work in a separate thread.

> 2. Avoid using LDMud memory management in SSL library (Comment in pkg-openssl.c the line with CRYPTO_set_mem_functions).

Can you help me understand what the disadvantages of this might be? It seems worth trying.

Gnomi

2022-03-23 23:00

manager   ~0002677

This depends on whether you believe that the LDMud allocator has better performance than the system (libc) allocator. This was certainly true many years ago. If this is true nowadays I don't know. We unfortunately don't have any numbers on that.

So therefore I don't know if this is an advantage or disadvantage, in SSL you'll have the performance of the system allocator instead of the LDMud allocator.

Another disadvantage is, that LDMud cannot keep track of the memory allocations anymore. So the statistics you can gather in-game (driver_info()) will not cover the SSL allocations anymore. (This is already true for any memory used by Python.)

paradox

2022-03-23 23:02

reporter   ~0002678

Oh ok! That makes sense. Sounds very reasonable. I will give this workaround a shot.

Do you have any thoughts on how I could catch what might be doing work on a separate thread? I'd love to fix the root cause too. I'll give it some thought myself but if you have ideas I'm all ears.

Thanks again Gnomi. Our players will really appreciate it if I can figure this out.

Gnomi

2022-03-23 23:18

manager   ~0002679

I cannot answer that. It would help to have meaningful backtraces, to find out what started those threads.

You could try attaching gdb to your test instance and set a breakpoint to pthread_create. Then you might catch the creation of a thread and get an idea of the situation.

paradox

2022-03-23 23:21

reporter   ~0002680

No problem. I'll see if I can get more meaningful backtraces and will pursue the other options we've discussed in parallel.

Many thanks! Your expertise here is invaluable.

If you'd like to close this bug that's fine with me. It seems clearly in my court now and not a LDMud problem.

paradox

2022-04-27 23:55

reporter   ~0002681

Just wanted to follow-up and say that disabling the LDMud allocator for pkg-openssl.c has prevented any further crashes. The culprit thread seems to be from the async http library we're using. I have to set up a build with the python debug symbols to make more progress on figuring out what's gone wrong w.r.t spawning a new thread for something that should be done async on the existing one.

paradox

2022-04-27 23:55

reporter   ~0002682

> We might think about introducing a configuration flag to disable registering our memory management routines in thirdparty libraries (we do this for GnuTLS, Iksemel, OpenSSL, SQLite and XML2).

Would it be helpful if I made a PR for this? I would be interested

paradox

2022-10-01 15:15

reporter   ~0002683

> I've collected up some example core dumps, binaries, and log snippets from some of the crashes we've observed. You can find a .tar.gz (~400mb) with this information here (please try to keep the cores private if possible - they have all of our player data resident and the `ldmud` binary has a mysql password compiled in):

FYI: I've deleted the linked heapdumps. This bug could be made public.

> Would it be helpful if I made a PR for this? I would be interested

https://github.com/ldmud/ldmud/pull/80

paradox

2023-10-14 19:17

reporter   ~0002712

This can be closed since PR 80 was merged and the `--disable-allocator-wrappers` flag at configure time resolves the issue for us.

Thanks!

Issue History

Date Modified Username Field Change
2022-03-23 02:44 paradox New Issue
2022-03-23 22:45 Gnomi Note Added: 0002675
2022-03-23 22:51 paradox Note Added: 0002676
2022-03-23 23:00 Gnomi Note Added: 0002677
2022-03-23 23:02 paradox Note Added: 0002678
2022-03-23 23:18 Gnomi Note Added: 0002679
2022-03-23 23:21 paradox Note Added: 0002680
2022-04-27 23:55 paradox Note Added: 0002681
2022-04-27 23:55 paradox Note Added: 0002682
2022-10-01 15:15 paradox Note Added: 0002683
2022-10-02 11:03 zesstra View Status private => public
2022-11-13 23:18 Gnomi Assigned To => Gnomi
2022-11-13 23:18 Gnomi Status new => assigned
2022-11-13 23:19 Gnomi Project LDMud 3.5 => LDMud 3.6
2022-11-13 23:19 Gnomi Category Runtime => General
2022-11-13 23:19 Gnomi Status assigned => resolved
2022-11-13 23:19 Gnomi Resolution open => fixed
2022-11-13 23:19 Gnomi Fixed in Version => 3.6.7
2023-10-14 19:17 paradox Status resolved => closed
2023-10-14 19:17 paradox Note Added: 0002712