Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes.
Set by lizmat on 24 May 2021.
Geth MoarVM/new-disp-nativecall-libffi: 8 commits pushed by (Stefan Seifert)++, (Nicholas Clark)++ 12:12
nine Turns out on new-disp-nativecall the Inline::Perl5 segfaults/assertion errors disappear 12:39
lizmat so maybe a push forward is more efficient than trying to fix the issue with the current setup? 12:43
nine That's certainly tempting. But without knowing what the exact problem is, it's hard to decide whether the problem is really fixed by new-disp-nativecall or if it just goes into hiding again. 12:48
The issue also goes away if I prohibit JIT compilation of sp_resumption. Of course that doesn't mean that sp_resumption is at fault as this could just stop the JIT from reaching the actually broken part. But then, in the affected frame, sp_resumption is only followed by sp_guardconc and sp_runbytecode_o. 13:00
nine Prohibiting JIT of sp_runbytecode_o does _not_ fix the problem. And sp_guardconc is ooooold (JITed since 2014) and will appear in most JIT compiled frames. Would be surprising if a problem only appears now. 13:03
So sp_resumption it is? Well yes, but how? All it actually does at runtime is write VMNull into a register. It is kinda hard to get this wrong. 13:04
nine And not just that, since we already have an op for writing a VMNull into a register (nqp::null), the actual implementation has been there since 2014 as well. 13:06
lizmat hmmm 13:09
lizmat assumes battery operated humming duck mode
nine Now sp_resumption is a strange beast. If it only wrote that VMNull there wouldn't be a point of having this op in the first place. Its purpose seems more internal to spesh. It takes a variable number of operands with the apparent purpose of keeping spesh from eliminating them. 13:15
timo yeah, it reserves a bit of space on the stack frame for use in resumption data 13:22
like access to the original dispatch's arguments
nine Btw. the "is built" feature is a 6.e thing and thus not yet available for use code, isn't it? 13:27
The docs don't mention this 13:28
timo i don't know 13:33
nine Well it got introduced in 2020, so can't be in 6.c or 6.d 13:34
lizmat is built works everywhere, afaik 13:35
it's not versioned afaik 13:36
jnthnwrthngtn nine: See src/core/oplist, which has an explanation of sp_resumption just above the op definition
The purpose is partly "keep spesh from eliminating them", but also to make sure we can recover those registers in the event of a resumption 13:37
nine lizmat: but the only way to tell the system that I need a compiler feature is to state a minimum language version. There are 6.d compiler versions without "is built" support, so I'd have to state 6.e, which doesn't exist yet. 13:41
Another interesting fact: the error goes away when I remove the local patch that speeds up the objectkeeper by using an IterationBuffer instead of array: gist.github.com/niner/23eedda15d16...a20fc7c19c 13:45
Now the ObjectKeeper's .free method is involved as it's called by the broken frame. What's really strange though is that that broken frame (found by spesh bisecting) is not what I get for tc->cur_frame. And the assertion failure happens ins getlexstatic_o which is not in use in that frame 13:48
jnthnwrthngtn Does the JITted machine code correspond to the frame? 13:51
nine That's the thing.... though the errors go away (absolutely reliably) when I disable the JIT, they do not actually occur in JITed frames. 13:53
jnthnwrthngtn Is there a deopt from a JITted frame just before the issue? 13:55
timo does introducing IterationBuffer as a "dependency" to the serialization context change anything?
does rr's chaos mode do anything interesting? 13:56
nine jnthnwrthngtn: there are different failure modes. The "Internal error: Unwound entire stack and missed handler" one does make some sense though. It happens when a nested runloop executes a return_o. This goes via MVM_frame_try_return/MVM_callstack_unwind_frame/unwind_after_handler/MVM_frame_unwind_to to MVM_callstack_unwind_frame which returns 0 due to the MVM_CALLSTACK_RECORD_NESTED_RUNLOOP entry on the 13:57
callstack, leading to the error message
I don't see any relevant deopts 13:58
The weird thing about this is that it's trying to unwind to command_eval. Definitely not the right target for the return 14:03
Aha, there's an exception and it's "Attempt to read past end of string heap when locating string" 14:08
So just another symptom of some general screw up
timo: the program is non-threaded (and I'm running with MVM_SPESH_BLOCKING=1), so chaos mode probably won't show anything interesting. 14:09
timo ah, dang
nine Smaller nursery makes it appear sooner. Still in a native callback though 14:11
timo hm, i wonder if we need to introduce optional redzones in more places for use in --valgrind 14:12
maybe something's exploding for some reason like that and isn't getting caught because reasons
nine And with a 4K nursery I can reproduce it even on new-disp-nativecall, so no, can't just storm ahead on this :( 14:15
But still no joy reproducing it without JIT
jnthnwrthngtn nine: Hm, that'd imply that there's an unhandled exception in a callback? 14:17
(The presence of the nested runloop boundary I mean) 14:18
I think we used to detect those and try to nicely report them, but I wonder if it regressed (a possible victim of my work on rearranging returns)
(Nicely report as in "oops", as in we don't consider it a condition we can recover from) 14:19
The wrong string heap number and the getlexstatic_o together make me wonder if there is no getlexstatic_o really, it's just we're in a bad location in the bytecode stream (a mis-deopt would explain it but you didn't spot one of those) 14:20
And so interpreting random things (and so interpreting things as string indexes that aren't, etc.)
That or the bytecode stream is out of sync with the cu, static info, etc. 14:21
afk for a bit, going to zizkov for walk/beer/curry :)
patrickb jnt 14:25
jnthn: The cert of commaide.com does not apply to www.commaide.com. But the links at the top of cro.services link to www.commaide.com 14:26
nine jnthnwrthngtn: the wrong place in the bytecode part kinda fits with sp_resumption and what I meant with it being a strange beast. It's clearly not the runtime effect of JITed sp_resumption. But maybe we somehow handle it wrong when calculating the bytecode position when we return to the interpreter. 14:27
Of course that would make much more sense if some actual deopt happened
timo something going wrong with the callsite thats referenced in the resumption op? 14:56
so it sort of changes its length on accident?
nine resumption doesn't reference a callsite 14:57
timo oh ok so the number of arguments it takes is in an inline cache or something 14:58
nine Nah, it's just sp_resumption reg, int, int, ... with reg getting VMNulled, the first int being some index and the second int the number of varargs 14:59
Somehow it's a mixture of JITed sp_resumption, finalizers and nested runloops 15:00
I get the "Unwound entire stack and missed handler" message even though all callbacks have a CATCH block 15:02
New one: MoarVM panic: No frame at top of callstack
timo so, CONTROL then?
nine No, they also got CONTROL blocks 15:06
timo OK
well it sounds kind of like memory corruption froom where im standing, which is maybe a bit too far away to be of much use 15:07
dogbert17 there seems to be quite a few bugs present in MoarVM atm, unless it's the same problem showing itself under different circumstances 15:16
nine dogbert17: that's not terribly surprising considering the amount of changes that went in lately 20:10
dogbert17 true, now it's a question of finding them :) 20:25
nine LOL, this is hilarious 20:28
So...my bug somehow involves sp_resumption, GC and nested runloops, right? Except that it actually doesn't. sp_resumption is innocent and the GC just caused more callbacks to appear. 20:29
japhb "hilarious" in the "OMG seriously?" sense?
nine What happens is that the frame that the callback is running is completely JIT compiled, including the return_o. Now return_o replaces the current frame with its caller which in this case is the frame that calls the native code that eventually runs the callback. 20:30
Exiting from the nested runloop is signified by the MVM_CALLSTACK_RECORD_NESTED_RUNLOOP record on the call stack. When MVM_callstack_unwind_frame encounters that it immediately returns 0 to signal that we need to stop the runloop. 20:31
MVM_frame_try_return just forwards that result: return MVM_callstack_unwind_frame(tc, 0); 20:32
The return_o op then checks this result: if (MVM_frame_try_return(tc) == 0) goto return_label; 20:33
Now what does JIT code do? if (MVM_UNLIKELY(!tc->cur_frame)) { /* somehow unwound our top frame */ goto return_label; }
s/JIT code/sp_jit_enter/ 20:34
It doesn't ever see that result and instead checks tc->cur_frame which at that time already points at the caller
So we happily continue a runloop and venture forth into unexplored territorry of random memory 20:35
timo wheeeee! 20:36
.oO( "We're going on a trip, / in our favorite rocket ship, / zooming through the sky ..." )
Geth MoarVM/fix_jited_return_from_native_runloops: 8a91bf8eb0 | (Stefan Seifert)++ | src/core/interp.c
Fix JITed return from nested runloops

When a callback frame is completely JIT compiled, including a return_o, we did not notice that it's time to exit the runloop. MVM_callstack_unwind_frame will already have set tc->cur_frame to the frame that called the native routine that in turn ran the callback and returned 0 to signal that the runloop should end. This 0 got forwarded by MVM_frame_try_return but JIT compiled code does not ... (8 more lines)
MoarVM: niner++ created pull request #1601:
Fix JITed return from nested runloops
timo got a clue why the mac build may have failed the test for `use Test; use Test; print "pass"`? 22:03
dev.azure.com/MoarVM/MoarVM/_build...amp;l=4577 22:04