#moarvm on 14 November 2021 - Raku Programming Language Log

Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes. Set by lizmat on 24 May 2021.
00:02 reportable6 left 00:05 reportable6 joined 00:06 colemanx left 00:07 colemanx joined 00:48 patrickb left, patrickz joined 01:44 patrickz left 03:04 bisectable6 left, releasable6 left, greppable6 left, benchable6 left, unicodable6 left, sourceable6 left, evalable6 left, committable6 left, linkable6 left, coverable6 left, reportable6 left, quotable6 left, nativecallable6 left, notable6 left, squashable6 left, tellable6 left, bloatable6 left, statisfiable6 left, shareable6 left 03:05 quotable6 joined, bloatable6 joined, unicodable6 joined 03:06 shareable6 joined, squashable6 joined 03:07 reportable6 joined, committable6 joined, statisfiable6 joined 04:06 sourceable6 joined, coverable6 joined 04:07 tellable6 joined, linkable6 joined, evalable6 joined, bisectable6 joined, greppable6 joined, notable6 joined 04:57 frost joined 05:05 releasable6 joined 05:06 benchable6 joined 06:02 reportable6 left 06:03 squashable6 left 06:05 reportable6 joined 06:06 nativecallable6 joined 08:01 squashable6 joined 09:27 frost left 12:02 reportable6 left
Geth	MoarVM/new-disp-nativecall-libffi: 8 commits pushed by (Stefan Seifert)++, (Nicholas Clark)++ - Support JIT compilation of native calls with rw args - Support JIT compilation of native calls with rw args for libffi - Fix NULL pointer results getting boxed after native calls - Fix NULL pointer results getting boxed after native calls for libffi - Log types of native routine's return values. - Support JIT compilation of native calls with VMArray arguments - Remove obsolete native call JIT implementation based on invoke protocol - Remove old invoke protocol	12:12	Copy link Message link Add to gist Remove
nine	Turns out on new-disp-nativecall the Inline::Perl5 segfaults/assertion errors disappear	12:39	Copy link Message link Add to gist Remove
lizmat	so maybe a push forward is more efficient than trying to fix the issue with the current setup?	12:43	Copy link Message link Add to gist Remove
nine	That's certainly tempting. But without knowing what the exact problem is, it's hard to decide whether the problem is really fixed by new-disp-nativecall or if it just goes into hiding again.	12:48	Copy link Message link Add to gist Remove
	The issue also goes away if I prohibit JIT compilation of sp_resumption. Of course that doesn't mean that sp_resumption is at fault as this could just stop the JIT from reaching the actually broken part. But then, in the affected frame, sp_resumption is only followed by sp_guardconc and sp_runbytecode_o.	13:00	Copy link Message link Add to gist Remove
13:02 patrickb joined
nine	Prohibiting JIT of sp_runbytecode_o does _not_ fix the problem. And sp_guardconc is ooooold (JITed since 2014) and will appear in most JIT compiled frames. Would be surprising if a problem only appears now.	13:03	Copy link Message link Add to gist Remove
	So sp_resumption it is? Well yes, but how? All it actually does at runtime is write VMNull into a register. It is kinda hard to get this wrong.	13:04	Copy link Message link Add to gist Remove
13:05 reportable6 joined
nine	And not just that, since we already have an op for writing a VMNull into a register (nqp::null), the actual implementation has been there since 2014 as well.	13:06	Copy link Message link Add to gist Remove
lizmat	hmmm	13:09	Copy link Message link Add to gist Remove
	lizmat assumes battery operated humming duck mode		Copy link Message link Add to gist Remove
nine	Now sp_resumption is a strange beast. If it only wrote that VMNull there wouldn't be a point of having this op in the first place. Its purpose seems more internal to spesh. It takes a variable number of operands with the apparent purpose of keeping spesh from eliminating them.	13:15	Copy link Message link Add to gist Remove
timo	yeah, it reserves a bit of space on the stack frame for use in resumption data	13:22	Copy link Message link Add to gist Remove
	like access to the original dispatch's arguments		Copy link Message link Add to gist Remove
nine	Btw. the "is built" feature is a 6.e thing and thus not yet available for use code, isn't it?	13:27	Copy link Message link Add to gist Remove
	The docs don't mention this	13:28	Copy link Message link Add to gist Remove
timo	i don't know	13:33	Copy link Message link Add to gist Remove
nine	Well it got introduced in 2020, so can't be in 6.c or 6.d	13:34	Copy link Message link Add to gist Remove
lizmat	is built works everywhere, afaik	13:35	Copy link Message link Add to gist Remove
	it's not versioned afaik	13:36	Copy link Message link Add to gist Remove
	afk&		Copy link Message link Add to gist Remove
jnthnwrthngtn	nine: See src/core/oplist, which has an explanation of sp_resumption just above the op definition		Copy link Message link Add to gist Remove
	The purpose is partly "keep spesh from eliminating them", but also to make sure we can recover those registers in the event of a resumption	13:37	Copy link Message link Add to gist Remove
nine	lizmat: but the only way to tell the system that I need a compiler feature is to state a minimum language version. There are 6.d compiler versions without "is built" support, so I'd have to state 6.e, which doesn't exist yet.	13:41	Copy link Message link Add to gist Remove
	Another interesting fact: the error goes away when I remove the local patch that speeds up the objectkeeper by using an IterationBuffer instead of array: gist.github.com/niner/23eedda15d16...a20fc7c19c	13:45	Copy link Message link Add to gist Remove
	Now the ObjectKeeper's .free method is involved as it's called by the broken frame. What's really strange though is that that broken frame (found by spesh bisecting) is not what I get for tc->cur_frame. And the assertion failure happens ins getlexstatic_o which is not in use in that frame	13:48	Copy link Message link Add to gist Remove
jnthnwrthngtn	Does the JITted machine code correspond to the frame?	13:51	Copy link Message link Add to gist Remove
nine	That's the thing.... though the errors go away (absolutely reliably) when I disable the JIT, they do not actually occur in JITed frames.	13:53	Copy link Message link Add to gist Remove
jnthnwrthngtn	Is there a deopt from a JITted frame just before the issue?	13:55	Copy link Message link Add to gist Remove
timo	does introducing IterationBuffer as a "dependency" to the serialization context change anything?		Copy link Message link Add to gist Remove
	does rr's chaos mode do anything interesting?	13:56	Copy link Message link Add to gist Remove
nine	jnthnwrthngtn: there are different failure modes. The "Internal error: Unwound entire stack and missed handler" one does make some sense though. It happens when a nested runloop executes a return_o. This goes via MVM_frame_try_return/MVM_callstack_unwind_frame/unwind_after_handler/MVM_frame_unwind_to to MVM_callstack_unwind_frame which returns 0 due to the MVM_CALLSTACK_RECORD_NESTED_RUNLOOP entry on the	13:57	Copy link Message link Add to gist Remove
	callstack, leading to the error message		Copy link Message link Add to gist Remove
	I don't see any relevant deopts	13:58	Copy link Message link Add to gist Remove
	The weird thing about this is that it's trying to unwind to command_eval. Definitely not the right target for the return	14:03	Copy link Message link Add to gist Remove
	Aha, there's an exception and it's "Attempt to read past end of string heap when locating string"	14:08	Copy link Message link Add to gist Remove
	So just another symptom of some general screw up		Copy link Message link Add to gist Remove
	timo: the program is non-threaded (and I'm running with MVM_SPESH_BLOCKING=1), so chaos mode probably won't show anything interesting.	14:09	Copy link Message link Add to gist Remove
timo	ah, dang		Copy link Message link Add to gist Remove
nine	Smaller nursery makes it appear sooner. Still in a native callback though	14:11	Copy link Message link Add to gist Remove
timo	hm, i wonder if we need to introduce optional redzones in more places for use in --valgrind	14:12	Copy link Message link Add to gist Remove
	maybe something's exploding for some reason like that and isn't getting caught because reasons		Copy link Message link Add to gist Remove
nine	And with a 4K nursery I can reproduce it even on new-disp-nativecall, so no, can't just storm ahead on this :(	14:15	Copy link Message link Add to gist Remove
	But still no joy reproducing it without JIT		Copy link Message link Add to gist Remove
jnthnwrthngtn	nine: Hm, that'd imply that there's an unhandled exception in a callback?	14:17	Copy link Message link Add to gist Remove
	(The presence of the nested runloop boundary I mean)	14:18	Copy link Message link Add to gist Remove
	I think we used to detect those and try to nicely report them, but I wonder if it regressed (a possible victim of my work on rearranging returns)		Copy link Message link Add to gist Remove
	(Nicely report as in "oops", as in we don't consider it a condition we can recover from)	14:19	Copy link Message link Add to gist Remove
	The wrong string heap number and the getlexstatic_o together make me wonder if there is no getlexstatic_o really, it's just we're in a bad location in the bytecode stream (a mis-deopt would explain it but you didn't spot one of those)	14:20	Copy link Message link Add to gist Remove
	And so interpreting random things (and so interpreting things as string indexes that aren't, etc.)		Copy link Message link Add to gist Remove
	That or the bytecode stream is out of sync with the cu, static info, etc.	14:21	Copy link Message link Add to gist Remove
	afk for a bit, going to zizkov for walk/beer/curry :)		Copy link Message link Add to gist Remove
patrickb	jnt	14:25	Copy link Message link Add to gist Remove
	jnthn: The cert of commaide.com does not apply to www.commaide.com. But the links at the top of cro.services link to www.commaide.com	14:26	Copy link Message link Add to gist Remove
nine	jnthnwrthngtn: the wrong place in the bytecode part kinda fits with sp_resumption and what I meant with it being a strange beast. It's clearly not the runtime effect of JITed sp_resumption. But maybe we somehow handle it wrong when calculating the bytecode position when we return to the interpreter.	14:27	Copy link Message link Add to gist Remove
	Of course that would make much more sense if some actual deopt happened		Copy link Message link Add to gist Remove
timo	something going wrong with the callsite thats referenced in the resumption op?	14:56	Copy link Message link Add to gist Remove
	so it sort of changes its length on accident?		Copy link Message link Add to gist Remove
nine	resumption doesn't reference a callsite	14:57	Copy link Message link Add to gist Remove
timo	oh ok so the number of arguments it takes is in an inline cache or something	14:58	Copy link Message link Add to gist Remove
nine	Nah, it's just sp_resumption reg, int, int, ... with reg getting VMNulled, the first int being some index and the second int the number of varargs	14:59	Copy link Message link Add to gist Remove
	Somehow it's a mixture of JITed sp_resumption, finalizers and nested runloops	15:00	Copy link Message link Add to gist Remove
	I get the "Unwound entire stack and missed handler" message even though all callbacks have a CATCH block	15:02	Copy link Message link Add to gist Remove
	New one: MoarVM panic: No frame at top of callstack		Copy link Message link Add to gist Remove
timo	so, CONTROL then?		Copy link Message link Add to gist Remove
nine	No, they also got CONTROL blocks	15:06	Copy link Message link Add to gist Remove
timo	OK		Copy link Message link Add to gist Remove
	well it sounds kind of like memory corruption froom where im standing, which is maybe a bit too far away to be of much use	15:07	Copy link Message link Add to gist Remove
dogbert17	there seems to be quite a few bugs present in MoarVM atm, unless it's the same problem showing itself under different circumstances	15:16	Copy link Message link Add to gist Remove
18:02 reportable6 left
nine	dogbert17: that's not terribly surprising considering the amount of changes that went in lately	20:10	Copy link Message link Add to gist Remove
dogbert17	true, now it's a question of finding them :)	20:25	Copy link Message link Add to gist Remove
nine	LOL, this is hilarious	20:28	Copy link Message link Add to gist Remove
	So...my bug somehow involves sp_resumption, GC and nested runloops, right? Except that it actually doesn't. sp_resumption is innocent and the GC just caused more callbacks to appear.	20:29	Copy link Message link Add to gist Remove
japhb	"hilarious" in the "OMG seriously?" sense?		Copy link Message link Add to gist Remove
nine	What happens is that the frame that the callback is running is completely JIT compiled, including the return_o. Now return_o replaces the current frame with its caller which in this case is the frame that calls the native code that eventually runs the callback.	20:30	Copy link Message link Add to gist Remove
	Exiting from the nested runloop is signified by the MVM_CALLSTACK_RECORD_NESTED_RUNLOOP record on the call stack. When MVM_callstack_unwind_frame encounters that it immediately returns 0 to signal that we need to stop the runloop.	20:31	Copy link Message link Add to gist Remove
	MVM_frame_try_return just forwards that result: return MVM_callstack_unwind_frame(tc, 0);	20:32	Copy link Message link Add to gist Remove
	The return_o op then checks this result: if (MVM_frame_try_return(tc) == 0) goto return_label;	20:33	Copy link Message link Add to gist Remove
	Now what does JIT code do? if (MVM_UNLIKELY(!tc->cur_frame)) { /* somehow unwound our top frame */ goto return_label; }		Copy link Message link Add to gist Remove
	s/JIT code/sp_jit_enter/	20:34	Copy link Message link Add to gist Remove
	It doesn't ever see that result and instead checks tc->cur_frame which at that time already points at the caller		Copy link Message link Add to gist Remove
	So we happily continue a runloop and venture forth into unexplored territorry of random memory	20:35	Copy link Message link Add to gist Remove
timo	wheeeee!	20:36	Copy link Message link Add to gist Remove
japhb	.oO( "We're going on a trip, / in our favorite rocket ship, / zooming through the sky ..." )	20:37	Copy link Message link Add to gist Remove
Geth	MoarVM/fix_jited_return_from_native_runloops: 8a91bf8eb0 \| (Stefan Seifert)++ \| src/core/interp.c Fix JITed return from nested runloops When a callback frame is completely JIT compiled, including a return_o, we did not notice that it's time to exit the runloop. MVM_callstack_unwind_frame will already have set tc->cur_frame to the frame that called the native routine that in turn ran the callback and returned 0 to signal that the runloop should end. This 0 got forwarded by MVM_frame_try_return but JIT compiled code does not ... (8 more lines)	20:42	Copy link Message link Add to gist Remove
	MoarVM: niner++ created pull request #1601: Fix JITed return from nested runloops		Copy link Message link Add to gist Remove
21:04 reportable6 joined
timo	got a clue why the mac build may have failed the test for `use Test; use Test; print "pass"`?	22:03	Copy link Message link Add to gist Remove
	dev.azure.com/MoarVM/MoarVM/_build...amp;l=4577	22:04	Copy link Message link Add to gist Remove

Please report any issues / comments / feature requests as an issue on App::Raku::Log.

Thank you!