Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes. Set by lizmat on 24 May 2021. |
|||
00:02
reportable6 left
00:05
reportable6 joined
00:06
colemanx left
00:07
colemanx joined
00:48
patrickb left,
patrickz joined
01:44
patrickz left
03:04
bisectable6 left,
releasable6 left,
greppable6 left,
benchable6 left,
unicodable6 left,
sourceable6 left,
evalable6 left,
committable6 left,
linkable6 left,
coverable6 left,
reportable6 left,
quotable6 left,
nativecallable6 left,
notable6 left,
squashable6 left,
tellable6 left,
bloatable6 left,
statisfiable6 left,
shareable6 left
03:05
quotable6 joined,
bloatable6 joined,
unicodable6 joined
03:06
shareable6 joined,
squashable6 joined
03:07
reportable6 joined,
committable6 joined,
statisfiable6 joined
04:06
sourceable6 joined,
coverable6 joined
04:07
tellable6 joined,
linkable6 joined,
evalable6 joined,
bisectable6 joined,
greppable6 joined,
notable6 joined
04:57
frost joined
05:05
releasable6 joined
05:06
benchable6 joined
06:02
reportable6 left
06:03
squashable6 left
06:05
reportable6 joined
06:06
nativecallable6 joined
08:01
squashable6 joined
09:27
frost left
12:02
reportable6 left
|
|||
Geth | MoarVM/new-disp-nativecall-libffi: 8 commits pushed by (Stefan Seifert)++, (Nicholas Clark)++
|
12:12 | |
nine | Turns out on new-disp-nativecall the Inline::Perl5 segfaults/assertion errors disappear | 12:39 | |
lizmat | so maybe a push forward is more efficient than trying to fix the issue with the current setup? | 12:43 | |
nine | That's certainly tempting. But without knowing what the exact problem is, it's hard to decide whether the problem is really fixed by new-disp-nativecall or if it just goes into hiding again. | 12:48 | |
The issue also goes away if I prohibit JIT compilation of sp_resumption. Of course that doesn't mean that sp_resumption is at fault as this could just stop the JIT from reaching the actually broken part. But then, in the affected frame, sp_resumption is only followed by sp_guardconc and sp_runbytecode_o. | 13:00 | ||
13:02
patrickb joined
|
|||
nine | Prohibiting JIT of sp_runbytecode_o does _not_ fix the problem. And sp_guardconc is ooooold (JITed since 2014) and will appear in most JIT compiled frames. Would be surprising if a problem only appears now. | 13:03 | |
So sp_resumption it is? Well yes, but how? All it actually does at runtime is write VMNull into a register. It is kinda hard to get this wrong. | 13:04 | ||
13:05
reportable6 joined
|
|||
nine | And not just that, since we already have an op for writing a VMNull into a register (nqp::null), the actual implementation has been there since 2014 as well. | 13:06 | |
lizmat | hmmm | 13:09 | |
lizmat assumes battery operated humming duck mode | |||
nine | Now sp_resumption is a strange beast. If it only wrote that VMNull there wouldn't be a point of having this op in the first place. Its purpose seems more internal to spesh. It takes a variable number of operands with the apparent purpose of keeping spesh from eliminating them. | 13:15 | |
timo | yeah, it reserves a bit of space on the stack frame for use in resumption data | 13:22 | |
like access to the original dispatch's arguments | |||
nine | Btw. the "is built" feature is a 6.e thing and thus not yet available for use code, isn't it? | 13:27 | |
The docs don't mention this | 13:28 | ||
timo | i don't know | 13:33 | |
nine | Well it got introduced in 2020, so can't be in 6.c or 6.d | 13:34 | |
lizmat | is built works everywhere, afaik | 13:35 | |
it's not versioned afaik | 13:36 | ||
afk& | |||
jnthnwrthngtn | nine: See src/core/oplist, which has an explanation of sp_resumption just above the op definition | ||
The purpose is partly "keep spesh from eliminating them", but also to make sure we can recover those registers in the event of a resumption | 13:37 | ||
nine | lizmat: but the only way to tell the system that I need a compiler feature is to state a minimum language version. There are 6.d compiler versions without "is built" support, so I'd have to state 6.e, which doesn't exist yet. | 13:41 | |
Another interesting fact: the error goes away when I remove the local patch that speeds up the objectkeeper by using an IterationBuffer instead of array: gist.github.com/niner/23eedda15d16...a20fc7c19c | 13:45 | ||
Now the ObjectKeeper's .free method is involved as it's called by the broken frame. What's really strange though is that that broken frame (found by spesh bisecting) is not what I get for tc->cur_frame. And the assertion failure happens ins getlexstatic_o which is not in use in that frame | 13:48 | ||
jnthnwrthngtn | Does the JITted machine code correspond to the frame? | 13:51 | |
nine | That's the thing.... though the errors go away (absolutely reliably) when I disable the JIT, they do not actually occur in JITed frames. | 13:53 | |
jnthnwrthngtn | Is there a deopt from a JITted frame just before the issue? | 13:55 | |
timo | does introducing IterationBuffer as a "dependency" to the serialization context change anything? | ||
does rr's chaos mode do anything interesting? | 13:56 | ||
nine | jnthnwrthngtn: there are different failure modes. The "Internal error: Unwound entire stack and missed handler" one does make some sense though. It happens when a nested runloop executes a return_o. This goes via MVM_frame_try_return/MVM_callstack_unwind_frame/unwind_after_handler/MVM_frame_unwind_to to MVM_callstack_unwind_frame which returns 0 due to the MVM_CALLSTACK_RECORD_NESTED_RUNLOOP entry on the | 13:57 | |
callstack, leading to the error message | |||
I don't see any relevant deopts | 13:58 | ||
The weird thing about this is that it's trying to unwind to command_eval. Definitely not the right target for the return | 14:03 | ||
Aha, there's an exception and it's "Attempt to read past end of string heap when locating string" | 14:08 | ||
So just another symptom of some general screw up | |||
timo: the program is non-threaded (and I'm running with MVM_SPESH_BLOCKING=1), so chaos mode probably won't show anything interesting. | 14:09 | ||
timo | ah, dang | ||
nine | Smaller nursery makes it appear sooner. Still in a native callback though | 14:11 | |
timo | hm, i wonder if we need to introduce optional redzones in more places for use in --valgrind | 14:12 | |
maybe something's exploding for some reason like that and isn't getting caught because reasons | |||
nine | And with a 4K nursery I can reproduce it even on new-disp-nativecall, so no, can't just storm ahead on this :( | 14:15 | |
But still no joy reproducing it without JIT | |||
jnthnwrthngtn | nine: Hm, that'd imply that there's an unhandled exception in a callback? | 14:17 | |
(The presence of the nested runloop boundary I mean) | 14:18 | ||
I think we used to detect those and try to nicely report them, but I wonder if it regressed (a possible victim of my work on rearranging returns) | |||
(Nicely report as in "oops", as in we don't consider it a condition we can recover from) | 14:19 | ||
The wrong string heap number and the getlexstatic_o together make me wonder if there is no getlexstatic_o really, it's just we're in a bad location in the bytecode stream (a mis-deopt would explain it but you didn't spot one of those) | 14:20 | ||
And so interpreting random things (and so interpreting things as string indexes that aren't, etc.) | |||
That or the bytecode stream is out of sync with the cu, static info, etc. | 14:21 | ||
afk for a bit, going to zizkov for walk/beer/curry :) | |||
patrickb | jnt | 14:25 | |
jnthn: The cert of commaide.com does not apply to www.commaide.com. But the links at the top of cro.services link to www.commaide.com | 14:26 | ||
nine | jnthnwrthngtn: the wrong place in the bytecode part kinda fits with sp_resumption and what I meant with it being a strange beast. It's clearly not the runtime effect of JITed sp_resumption. But maybe we somehow handle it wrong when calculating the bytecode position when we return to the interpreter. | 14:27 | |
Of course that would make much more sense if some actual deopt happened | |||
timo | something going wrong with the callsite thats referenced in the resumption op? | 14:56 | |
so it sort of changes its length on accident? | |||
nine | resumption doesn't reference a callsite | 14:57 | |
timo | oh ok so the number of arguments it takes is in an inline cache or something | 14:58 | |
nine | Nah, it's just sp_resumption reg, int, int, ... with reg getting VMNulled, the first int being some index and the second int the number of varargs | 14:59 | |
Somehow it's a mixture of JITed sp_resumption, finalizers and nested runloops | 15:00 | ||
I get the "Unwound entire stack and missed handler" message even though all callbacks have a CATCH block | 15:02 | ||
New one: MoarVM panic: No frame at top of callstack | |||
timo | so, CONTROL then? | ||
nine | No, they also got CONTROL blocks | 15:06 | |
timo | OK | ||
well it sounds kind of like memory corruption froom where im standing, which is maybe a bit too far away to be of much use | 15:07 | ||
dogbert17 | there seems to be quite a few bugs present in MoarVM atm, unless it's the same problem showing itself under different circumstances | 15:16 | |
18:02
reportable6 left
|
|||
nine | dogbert17: that's not terribly surprising considering the amount of changes that went in lately | 20:10 | |
dogbert17 | true, now it's a question of finding them :) | 20:25 | |
nine | LOL, this is hilarious | 20:28 | |
So...my bug somehow involves sp_resumption, GC and nested runloops, right? Except that it actually doesn't. sp_resumption is innocent and the GC just caused more callbacks to appear. | 20:29 | ||
japhb | "hilarious" in the "OMG seriously?" sense? | ||
nine | What happens is that the frame that the callback is running is completely JIT compiled, including the return_o. Now return_o replaces the current frame with its caller which in this case is the frame that calls the native code that eventually runs the callback. | 20:30 | |
Exiting from the nested runloop is signified by the MVM_CALLSTACK_RECORD_NESTED_RUNLOOP record on the call stack. When MVM_callstack_unwind_frame encounters that it immediately returns 0 to signal that we need to stop the runloop. | 20:31 | ||
MVM_frame_try_return just forwards that result: return MVM_callstack_unwind_frame(tc, 0); | 20:32 | ||
The return_o op then checks this result: if (MVM_frame_try_return(tc) == 0) goto return_label; | 20:33 | ||
Now what does JIT code do? if (MVM_UNLIKELY(!tc->cur_frame)) { /* somehow unwound our top frame */ goto return_label; } | |||
s/JIT code/sp_jit_enter/ | 20:34 | ||
It doesn't ever see that result and instead checks tc->cur_frame which at that time already points at the caller | |||
So we happily continue a runloop and venture forth into unexplored territorry of random memory | 20:35 | ||
timo | wheeeee! | 20:36 | |
japhb | .oO( "We're going on a trip, / in our favorite rocket ship, / zooming through the sky ..." ) |
20:37 | |
Geth | MoarVM/fix_jited_return_from_native_runloops: 8a91bf8eb0 | (Stefan Seifert)++ | src/core/interp.c Fix JITed return from nested runloops When a callback frame is completely JIT compiled, including a return_o, we did not notice that it's time to exit the runloop. MVM_callstack_unwind_frame will already have set tc->cur_frame to the frame that called the native routine that in turn ran the callback and returned 0 to signal that the runloop should end. This 0 got forwarded by MVM_frame_try_return but JIT compiled code does not ... (8 more lines) |
20:42 | |
MoarVM: niner++ created pull request #1601: Fix JITed return from nested runloops |
|||
21:04
reportable6 joined
|
|||
timo | got a clue why the mac build may have failed the test for `use Test; use Test; print "pass"`? | 22:03 | |
dev.azure.com/MoarVM/MoarVM/_build...amp;l=4577 | 22:04 |