Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

热更新lua代码导致skynet工作线程死锁 #1637

Open
wanghongshuai137 opened this issue Sep 2, 2022 · 4 comments
Open

热更新lua代码导致skynet工作线程死锁 #1637

wanghongshuai137 opened this issue Sep 2, 2022 · 4 comments

Comments

@wanghongshuai137
Copy link

线上游戏服,今早玩家反馈卡顿,上服务器查看情况,发现cpu、负载很低,查看日志发现某服务出现大量的endless,并且当前服务没有任何其他日志输出(其他服务正常)。基于这些情况,执行 “pstack PID“ 查看skynet进程的线程情况发现 Thread 2 异常,具体如下:

Thread 2 (Thread 0x7f19cf5f7700 (LWP 27881)):
0  0x00007f19dad6989c in __lll_lock_wait_private () from /lib64/libc.so.6
1  0x00007f19dacdb45d in _L_lock_121 () from /lib64/libc.so.6
2  0x00007f19dacd9023 in __GI__IO_un_link () from /lib64/libc.so.6
3  0x00007f19dacd7fe8 in __GI__IO_file_close_it () from /lib64/libc.so.6
4  0x00007f19dacd55a9 in freopen64 () from /lib64/libc.so.6
5  0x00000000004263b9 in luaL_loadfilex_ (L=L@entry=0x7f189487af88, filename=filename@entry=0x7f18ace7bac0 "scripts/apis/army.lua", mode=mode@entry=0x7f19bb5d5be0 "bt") at lauxlib.c:794
6  0x00000000004271fa in luaL_loadfilex (L=L@entry=0x7f189487af88, filename=filename@entry=0x7f18ace7bac0 "scripts/apis/army.lua", mode=mode@entry=0x7f19bb5d5be0 "bt") at lauxlib.c:1232
7  0x000000000042b3d6 in luaB_loadfile (L=0x7f189487af88) at lbaselib.c:322
8  0x0000000000415763 in luaD_precall (L=L@entry=0x7f189487af88, func=func@entry=0x7f17a5598360, nresults=2) at ldo.c:532
9  0x0000000000422576 in luaV_execute (L=L@entry=0x7f189487af88, ci=<optimized out>) at lvm.c:1626
10 0x00000000004158e0 in ccall (L=L@entry=0x7f189487af88, func=<optimized out>, nResults=nResults@entry=-1, inc=inc@entry=1) at ldo.c:577
11 0x00000000004159ca in luaD_call (L=L@entry=0x7f189487af88, func=<optimized out>, nResults=nResults@entry=-1) at ldo.c:587
12 0x0000000000412a40 in lua_pcallk (L=L@entry=0x7f189487af88, nargs=nargs@entry=1, nresults=nresults@entry=-1, errfunc=errfunc@entry=2, ctx=ctx@entry=2, k=k@entry=0x42afd0 <finishpcall>) at lapi.c:1071
13 0x000000000042b07f in luaB_xpcall (L=0x7f189487af88) at lbaselib.c:473
14 0x0000000000415763 in luaD_precall (L=L@entry=0x7f189487af88, func=func@entry=0x7f17a55981b0, nresults=3) at ldo.c:532
15 0x0000000000422576 in luaV_execute (L=L@entry=0x7f189487af88, ci=<optimized out>, ci@entry=0x7f17dc692500) at lvm.c:1626
16 0x0000000000415423 in unroll (L=0x7f189487af88, ud=<optimized out>) at ldo.c:685
17 0x0000000000414b9a in luaD_rawrunprotected (L=L@entry=0x7f189487af88, f=f@entry=0x415910 <resume>, ud=ud@entry=0x7f19cf5f4f7c) at ldo.c:144
18 0x0000000000415a54 in lua_resume (L=L@entry=0x7f189487af88, from=from@entry=0x7f19a53d4a08, nargs=<optimized out>, nargs@entry=5, nresults=nresults@entry=0x7f19cf5f4fbc) at ldo.c:788
19 0x00007f19d8ffd05d in lua_resumeX (nresults=0x7f19cf5f4fbc, nargs=5, from=0x7f19a53d4a08, L=0x7f189487af88) at service-src/service_snlua.c:90
20 auxresume (narg=5, co=0x7f189487af88, L=0x7f19a53d4a08) at service-src/service_snlua.c:146
21 timing_resume (L=L@entry=0x7f19a53d4a08, co_index=co_index@entry=1, n=5) at service-src/service_snlua.c:198
22 0x00007f19d8ffd530 in luaB_coresume (L=0x7f19a53d4a08) at service-src/service_snlua.c:217
23 0x0000000000415763 in luaD_precall (L=L@entry=0x7f19a53d4a08, func=func@entry=0x7f17d8bf8410, nresults=nresults@entry=-1) at ldo.c:532
24 0x000000000042227b in luaV_execute (L=L@entry=0x7f19a53d4a08, ci=<optimized out>) at lvm.c:1656
25 0x00000000004158e0 in ccall (L=0x7f19a53d4a08, func=<optimized out>, nResults=<optimized out>, inc=65537) at ldo.c:577
26 0x0000000000414b9a in luaD_rawrunprotected (L=L@entry=0x7f19a53d4a08, f=f@entry=0x411370 <f_call>, ud=ud@entry=0x7f19cf5f52c0) at ldo.c:144
27 0x0000000000415c5e in luaD_pcall (L=L@entry=0x7f19a53d4a08, func=func@entry=0x411370 <f_call>, u=u@entry=0x7f19cf5f52c0, old_top=192, ef=<optimized out>) at ldo.c:892
28 0x00000000004129c7 in lua_pcallk (L=L@entry=0x7f19a53d4a08, nargs=<optimized out>, nresults=nresults@entry=-1, errfunc=errfunc@entry=0, ctx=ctx@entry=0, k=k@entry=0x42afd0 <finishpcall>) at lapi.c:1059
29 0x000000000042b0f0 in luaB_pcall (L=0x7f19a53d4a08) at lbaselib.c:456
30 0x0000000000415763 in luaD_precall (L=L@entry=0x7f19a53d4a08, func=func@entry=0x7f17d8bf82a0, nresults=2) at ldo.c:532
31 0x0000000000422576 in luaV_execute (L=L@entry=0x7f19a53d4a08, ci=<optimized out>) at lvm.c:1626
32 0x00000000004158e0 in ccall (L=0x7f19a53d4a08, func=<optimized out>, nResults=<optimized out>, inc=65537) at ldo.c:577
33 0x0000000000414b9a in luaD_rawrunprotected (L=L@entry=0x7f19a53d4a08, f=f@entry=0x411370 <f_call>, ud=ud@entry=0x7f19cf5f5590) at ldo.c:144
34 0x0000000000415c5e in luaD_pcall (L=L@entry=0x7f19a53d4a08, func=func@entry=0x411370 <f_call>, u=u@entry=0x7f19cf5f5590, old_top=48, ef=<optimized out>) at ldo.c:892
35 0x00000000004129c7 in lua_pcallk (L=L@entry=0x7f19a53d4a08, nargs=nargs@entry=5, nresults=nresults@entry=0, errfunc=errfunc@entry=1, ctx=ctx@entry=0, k=k@entry=0x0) at lapi.c:1059
36 0x00007f19ce6115b8 in _cb (context=0x7f19b6d38580, ud=0x7f19a53d4a08, type=20, session=2481, source=136, msg=0x7f19b8021900, sz=241) at lualib-src/lua-skynet.c:75
37 0x00000000004095d6 in dispatch_message (ctx=ctx@entry=0x7f19b6d38580, msg=msg@entry=0x7f19cf5f5650) at skynet-src/skynet_server.c:276
38 0x000000000040a1ac in skynet_context_message_dispatch (sm=sm@entry=0x7f19da80b300, q=0x7f19290a86c0, weight=weight@entry=1) at skynet-src/skynet_server.c:336
39 0x000000000040a95e in thread_worker (p=<optimized out>) at skynet-src/skynet_start.c:163
40 0x00007f19db955e25 in start_thread () from /lib64/libpthread.so.0
41 0x00007f19dad5bbad in clone () from /lib64/libc.so.6

看线程堆栈,像是线程调用luaL_loadfilex_ 造成的死锁,比较难复现。
skynet 版本1.5,lua版本是5.4.3

@cloudwu
Copy link
Owner

cloudwu commented Sep 2, 2022

我不认为这是 skynet 的问题。

死锁发生在 freopen64 里,你可以 google 到一些关于 freopen 和 fclose 发生 deadlock 的问题(例如:https://www.cygwin.com/bugzilla/show_bug.cgi?id=24963 )。你可以尝试升级 crt ,看是否有 bug 需要修复。同时也检查进程打开文件数目有没有超过上限。

另外,freopen 只发生在 binary 文件中。我认为可以避免 binary 源码的使用。或者修改代码,直接用二进制方式打开源文件,不要走 freopen 。

@wanghongshuai137
Copy link
Author

wanghongshuai137 commented Sep 2, 2022

我不认为这是 skynet 的问题。

死锁发生在 freopen64 里,你可以 google 到一些关于 freopen 和 fclose 发生 deadlock 的问题(例如:https://www.cygwin.com/bugzilla/show_bug.cgi?id=24963 )。你可以尝试升级 crt ,看是否有 bug 需要修复。同时也检查进程打开文件数目有没有超过上限。

另外,freopen 只发生在 binary 文件中。我认为可以避免 binary 源码的使用。或者修改代码,直接用二进制方式打开源文件,不要走 freopen 。

非常感谢!!
应该不是文件数量的问题

@sniper00
Copy link
Contributor

sniper00 commented Sep 2, 2022

如果是agent模式热更确实会瞬间打开大量文件 我的方案是热更的文件内容先保存到加锁的hashmap,然后通知agent热更从map里面取

@wanghongshuai137
Copy link
Author

如果是agent模式热更确实会瞬间打开大量文件 我的方案是热更的文件内容先保存到加锁的hashmap,然后通知agent热更从map里面取

我们这边不是agent模式的,服务不是很多,同步热更的,所以同时打开的文件数不会太多。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants