Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes.
Set by lizmat on 24 May 2021.
timo is too bed 00:05
i'm not actually in bed 01:01
timo hmm, for a callsite transformation cache design, it'll have to not only be threadsafe, but ownership is also important. i guess only interned callsites are allowed to go in anyway 01:48
drop_arg is apparently never called on an interned callsite object 02:18
oh, or drop_arg is not used instead of drop_args 02:19
using directly the intern cache seems to be an okay idea 02:27
timo MVM_callsite_drop_positionals, even when it looks through the intern cache first and tries to intern at the end if the incoming cs was interned but nothing appropriate was found in the intern cache 03:26
is at only 0.16% as per perf
callstack_find_topmost_dispatch_recording is at 0.16% but a little further up 03:27
0.02% drop_positionals when there's neither the looking through the cache nor the intern attempt at the end 03:30
all of this measured for rakudo -e ''
japhb Exclusive times, I assume? 03:55
Nicholas good *, * 05:43
en.wikipedia.org/wiki/X86_calling_...onventions -- Microsoft x64 calling convention ... ... System V AMD64 ABI ... If the callee is a variadic function, then the number of floating point arguments passed to the function in vector registers must be provided by the caller in the AL register. 05:50
I *thought* that there was also some requirement on integer arguments. Anyway, that one bites 05:51
MasterDuke huh. on (roughly) master, i got 19s then 14s for m-test, and 130s and 126s for m-spectest 10:26
lizmat so faster than master ? 10:38
MasterDuke well, i was comparing to yesterday when i was on new-disp and got 19s and 19s for m-test, and 178s and 171s for m-spectest 10:40
jnthnwrthngtn MasterDuke: Curious, I get 18s and 15s here 10:47
No way is spectest going to be faster than master given spectest is hugely dependent on startup time.
MasterDuke well, you do have a much faster machine. but i'm surprised my second run isn't any faster than the first
jnthnwrthngtn Yes, that's the part I'm surprised about. 10:48
I mean, nativecall is already pre-compiled, so it should be no trouble
(For the second run)
MasterDuke maybe it was just a bad hash randomization/spesh not being blocking interaction
jnthnwrthngtn I figure that's the main slowdown
MasterDuke timo++'s prs should help with startup/spectest, correct? 10:49
lizmat perhaps it's an effect of my work ? 10:50
:-)
MasterDuke my new-disp times don't account for your commits from today, so hopefully they'll be faster next time i run it 10:51
jnthnwrthngtn lizmat: I highly doubt the setting changes you're doing would have an impact on this, if that's what you're meaning 10:52
lizmat ah, ok 10:53
jnthnwrthngtn takes a look at timo++'s work 10:54
jnthnwrthngtn Hm, I was going to say that fix to args tail is making an assumption that something with no capture is an arg drop, and that could be fragile if we later do a replace arg, although replace would be drops + insert... It'd take an insert multiple to be a problem I guess 11:03
MasterDuke there are a bunch of ops where the interpreter implementation is roughly `if (REPR(foo)->ID != MVM_REPR_ID_something || !IS_CONCRETE(foo)) MVM_exception_throw_adhoc(tc, msg) else <do something>`. sometimes the jit version it just <do something>, and sometimes it has that `if` 11:07
lizmat jnthnwrthngtn: sanity check: use nqp; sub a(Range:D $a) { dd nqp::iscont($a) }; my $r = 1..3; a $ gives 1 on new-disp 11:12
I thought they were to be deconted?
MasterDuke is there an easy way to know if the jitted version does/does not need the repr and/or concreteness checks? 11:13
lizmat /they were/$a is/
MasterDuke m: use nqp; sub a(Range:D $a) { dd nqp::iscont($a) }; my $r = 1..3; a $r 11:14
camelia 1
jnthnwrthngtn lizmat: Hm, I'd have expected master to do the same given the Range constraint 11:20
lizmat: Is the :D significant?
lizmat no
jnthnwrthngtn One can't rely on the caller side of have decont'd in new-disp, it just may have done it 11:21
Oh!
m: say Range ~~ Iterable
camelia True
jnthnwrthngtn The container is required here because it's an Iterable
Otherwise it would flatten
So the signature binder re-wraps it
So yeah, it's correct 11:22
MasterDuke: If the C thing being called does the checks, then the interpreter (and JIT if needed) can have them removed. If it doesn't, they're needed in both
MasterDuke: One of the reasons we'll gradually move towards syscalls, though, is that we can enforce the types outside of the C function and elide them 11:23
jnthnwrthngtn lunch, bbiab 11:23
MasterDuke ok. for some reason i thought we could assume in some cases that things were concrete and/or the right REPR when being run by the jit. but i'll make sure any checks done by the interpreter are also done in the jitted versions i'm making 11:26
lizmat down to 1.251 / .703 11:47
jnthnwrthngtn MasterDuke: Yes, though do check the interpreter really needs them also 12:04
MasterDuke ah, how do i check that? 12:05
jnthnwrthngtn lizmat: With further ops converted to $ ?
MasterDuke: Look at the C function called by the interp and see if it repeats the check
MasterDuke oh, ha. so far i'm just creating the c funcs, so not repeating anything, but i can check the existing ones i come across 12:06
Geth MoarVM/new-disp: edc4fb9d57 | (Timo Paulssen)++ (committed by Jonathan Worthington) | 7 files
add dispatcher-drop-n-args to optimize allocations

Instead of creating a MVMCapture and MVMCallsite for each step of removing arguments, we now offer a syscall that drops multiple arguments that live at the same index in one go.
The result is that the transformations tree can now contain null entries for the capture entry, which we have to interpret and deal with.
jnthnwrthngtn timo: I did a small change in order I could get some asserts in that will help us if we try to do further things like this 12:09
timo: Merged the NQP and Rakudo ones as is (well, rebsaed Rakudo one) 12:13
lizmat jnthnwrthngtn: yes 12:14
fwiw, no noticeable change in test-t after timo's work 12:20
MasterDuke any changes in spectest? 12:21
lizmat I don't time that atm
MasterDuke `t/02-rakudo/03-corekeys-6d.t .................................... Dubious, test returned 1 (wstat 256, 0x100)`, doh 12:23
lizmat did I broke that? 12:30
jnthnwrthngtn lizmat: It was primarily aimed at startup, and does managed to give sound couple of thousand allocations less at lesat. 12:31
*least
MasterDuke no, this is on my branch off of master. it was just a flap, but that test is pretty simple... 12:32
jnthnwrthngtn ah, actually more than that, the thousands are only those one profile-compile starts measuring but there are others in NQP setup 12:33
lizmat argh I didn't pull rakudo itself, so I missed that part of timo's work 13:03
new timings in a mo
jnthnwrthngtn Be sure to pull NQP too 13:05
lizmat no noticeable change in test-t
perl Configure.pl --force-rebuild --gen-moar=new-disp --gen-nqp=new-disp --make-install 13:06
will do that, but not for Rakudo itself :-)
afk for a few hours&
jnthnwrthngtn Just been doing some comparative measurements of master/new-disp (mostly microbenchmarks, also measuring a Cro app). We're doing a lot better at a bunch of the targetted features, of course, but also a bit better on various things that are effectively just multi/method dispatch based. 13:50
MasterDuke very cool 13:51
jnthnwrthngtn The Cro app gets a couple of hundred more requests per second, around 10% more. 13:52
Nicholas ./rakudo-m -Ilib t/spec/S32-list/grep.rakudo.moar
jnthnwrthngtn What's worse, other than the obvious (startup)
Nicholas MoarVM oops in spesh thread: Spesh: failed to fix up inline 1 () -1 -1
that was MoarVM edc4fb9d57d245929ee5d4d013b22bef1a63bf9b
(not sure if the rest matters)
jnthnwrthngtn Are almost entirely I/O benchmarks (reading lines, writing lines)
Nicholas ASAN made no comment
jnthnwrthngtn The reason for the I/O ones is seemingly that OSR does not function 13:54
MasterDuke ah. that might explain why my spesh log processing one-liner is slower
jnthnwrthngtn This is due to a7c0cc8d2b, which is a bug fix
Somewhere on the I/O path we do a role composition, and do a `ctx` op. We actually only want it for the current context, not for traversal. 13:55
Nicholas my assertion failure was with rakudo back at 9c587d92d0cdb2aa86c2ca70ed15b5c478443b02 -- Use new dispatcher-drop-n-args syscall 13:56
jnthnwrthngtn However, we don't have a way to indicate that, and so it assumes it's wanted for traversal and marks everything in the caller chain
We then are unable to OSR
m: say 0.6814 / 0.4247 13:57
camelia 1.604427
Nicholas "obviously" (it seems to be, as a poor quality teddy bear) that the brute force solution to this is a second op that is just "the current context". But is there a better way?
jnthnwrthngtn A 60% slowdown. That's not nice. 13:58
So that probably needs a solution
The other case that I don't have an explnation for yet is an object creation benchmark 13:59
MasterDuke is `ctxlexpad` sort of "the current context"?
jnthnwrthngtn ctxlexpad turns out to be the identity function :/ 14:01
I suspect it hadn't used to be
Nowadays the thing from ctx is just directly indexable for the current context 14:02
Nicholas "in spesh thread" - this might be the first "win" from commit 998ea76a17cb8dbafc6dc392d15d40a487d236c3 14:04
jnthnwrthngtn Nicholas: I've been happy about that at least a copule of times before recently 14:05
s/before//
Nicholas ah OK. It's the first that *I* noticed. I slack more. 14:05
(a *lot* more)
timo hm looks like the discord bridge works only one-way at the moment 14:12
cdn.discordapp.com/attachments/633...004322.jpg 14:20
jnthnwrthngtn Nicholas: I just did a spectest with blocking/nodelay to verify my change to get OSR back and also see that inline fixup exception 15:01
Geth MoarVM/new-disp: 6f2b01c275 | (Jonathan Worthington)++ | 8 files
Introduce non-traversable contexts

These are for when we will only read the lexicals of the exact frame we obtained it in, and thus can avoid marking the whole callstack up as needing caller position information preserved.
15:06
MoarVM/new-disp: baf1423327 | (Jonathan Worthington)++ | 3 files
Be more precise about OSR caller positions

There are two situations in which we set the caller info needed flag: one when we throw an exception and want to produce a backtrace, and another when we need to do context introspection. Only the latter is in absolute need of accurate position information, and thus must poison OSR. This, together with non-traversable contexts, lets us get OSR back in various situations, including some common cases of I/O, fixing a performance regression relative to `master`.
Nicholas running that spectest with a non-ASAN build with valgrind produced quite a bit of excitement at optimize_bb_switch (optimize.c:2299) and optimize_bb_switch (optimize.c:2280) 15:07
Conditional jump or move depends on uninitialised value(s)
and once at at 0x4B82E97: build_cfg (graph.c:487)
Geth MoarVM/new-disp: ea63d91730 | (Jonathan Worthington)++ | src/core/ext.c
Fix uninitialized read in spesh graph building
15:09
jnthnwrthngtn Nicholas: The second of those was easy, the first I've spent a while trying to figure out and can't
(A while before now, that is)
Geth MoarVM: MasterDuke17++ created pull request #1550:
Add '.new()' suggestion to type object errors
Nicholas jnthnwrthngtn: the ones you can't figure out - is there (at least) a short(er) way to trigger then? 15:11
them
jnthnwrthngtn Nicholas: I didn't repro them, just went through the code involved a few times 15:12
Hm, or if I did it didn't give me any extra clues...
Geth MoarVM/attempt_use_intern_cache_for_drop_positionals: 23ebdab1ce | (Timo Paulssen)++ | src/core/callsite.c
If possible, use the intern cache for transforms

Doesn't actually seem faster than allocating them every time we do transformations. I have only measured using an empty raku program, however, since I was hoping to make startup cheaper.
15:18
timo ah, yes. sometimes you ctx, but sometimes you ctxn't 15:19
jnthnwrthngtn Well, there really are no annotations for inline 1 in the spesh graph... 15:21
MasterDuke i just switched to new-disp and pulled all three repos, built, and ran two `make m-test m-spectest`. got 22s and 19s for m-test, and 176s and 171s for m-spectest 15:22
Nicholas timo: reason why you spotted that PyPy blog post and I didn't - it's not on the front page. 15:24
timo that's odd 15:25
Nicholas not totally.
timo you think it's not "common interest" or whatever?
Nicholas I didn't dig into *how* they made the side, but it looked like it might have been that it required "manual" work to update the front page. (No idea if that's a script to bake a new front page, or what) 15:26
I think that this was oversight. But I failed to be helpful and try to create a decent bug report
MasterDuke hm, does look like maybe my spesh log processing one-liner is a bit faster after that OSR fix though... 15:27
timo that's the code that uses -n that you mentioned the other day, yes? 15:30
MasterDuke yeah
jnthnwrthngtn OK, I figured out the inline fixup bug and it's terrible 15:31
Nicholas um, like "headdesk, how did I make that mistake?" or "oh, erk, this is gnarly to get right?" 15:32
jnthnwrthngtn It occurs when all of the following happens:
1. We are doing a nested inline 15:33
2. The thing we are inlining, which has its own inlines, has an inline that shrank to zero instructions
3. The annotations about it end up on an sp_bindcomplete, which we delete as part of inlining
It processes the annotations on the bindcomplete instruction and fixes them up. We then delete said instruction. The annotations then move onto the next instruction so as not to get lost. 15:34
We then fix them up again
Making them bogus
timo and then they bug us
MasterDuke am i correct in thinking that if possible, it's better to jit something via writing some asm in emit.dasc than moving it to a function and calling that from the interpreter and the jit? 15:37
timo we're essentially making the same trade-off the compiler does when deciding whether to inline a given function 15:39
Geth MoarVM/new-disp: 75560fd2ec | (Jonathan Worthington)++ | src/spesh/inline.c
Correct deletion of sp_bindcomplete

We cannot do it immediately, as annotation motion might cause us to fix up the same annotation twice, which is wrong. Thus do the deletion after all fixups of annotations are completed.
15:40
timo if the code to Do The Thing is about as short as the stuff to call the function and the parts of the function that deal with being called, then we can probably prefer emit.dasc
jnthnwrthngtn Nicholas: That seems to do it.
MasterDuke i'm going to guess they are here github.com/MoarVM/MoarVM/blob/mast...3010-L3027 15:45
jnthnwrthngtn With OSR reinstated we now beat master at the I/O benchmarks, and have caught up with Ruby in a "write a million lines of utf-8" one 15:49
timo in theory spesh could put markcode* into the repr-speshed ops and allow MVMCode REPR to optimize it into sp_get_i* and sp_bind_i* or whatever 15:50
jnthnwrthngtn timo: Yes, I'd already figured we want something like that, just didn't quite figure out how 15:51
(As in, a nice way to factor it)
timo getstaticcode and gedcodecuid could also
do you mean how i described it isn't that nice way to factor it? 15:53
japhb jnthnwrthngtn: Nice to hear we're caught up with Ruby on that benchmark, but where is Ruby on the utf-8 I/O efficiency scale? Is this a major achievement?
MasterDuke somehow i missed that comma 2021.08 was released, i'll have to give its profile viewer a try 15:54
jnthnwrthngtn japhb: More efficient than Python, less than Perl. 15:55
japhb: Although I should add: less than *recent* Perl.
(I think there were UTF-8 I/O speedups there) 15:56
japhb: Major achievement only really in so far as our I/O handle impl and coordination of encoding is all in Raku, whereas I suspect in Ruby one ends up far more quickly in C code. 15:57
OK, so about the object creation benchmark we've lost something on from master: profiler says we JIT 99.95% of frames on master but 36.35% of frames on new-disp. 4% less inlining. 15:58
timo oof
jnthnwrthngtn Though oddly, I can't find any "bailed completely" in the spesh log 15:59
lesson, bbl 16:01
timo are there any prof_enter that should have become enterspesh in the spesh log?
MasterDuke that sounds like exactly what a profile of my one-liner shows 16:02
japhb jnthnwrthngtn: Ah, interesting re: UTF-8 I/O efficiency. That all gives me good context, thanks. 16:28
MasterDuke timo: show can i know if a prof_enter should have been prof_enterspesh? if it's in the 'after'? 16:30
timo yeah
MasterDuke they're all in the 'Before', don't see in an 'After' 16:36
timo OK
MasterDuke any other ideas? 16:45
timo i'd perhaps perf record and see if there's actually a big portion of samples in interp_run rather than jitted frames which would be identified from having the perf map on 16:47
except i've seen a boatload of 0x000000asdfgh frames in perf report results as well even with the perf map turned on 16:48
MasterDuke: could you give me your -n code right quick? i thought i had it but i don't 16:59
MasterDuke raku -ne 'BEGIN my (%h, $f); if .starts-with(q|Spesh of |) and /^"Spesh of " $<func>=(<-[\ ]>+)/ { $f = ~$<func> } elsif .contains(q|JIT: bailed completely because of <|) and /"JIT: bailed completely because of <" $<op>=(<-[>]>+)/ { %h{q|l_|~$<op>}.push($f) } elsif .contains(q|expr bail: Cannot get template for: |) and /"expr bail: Cannot get
template for: " $<op>=(\w+)/ { %h{q|t_|~$<op>}.push($f) }; END for %h.keys.sort -> $k { say qq|$k: %h{$k}.Bag()| }'
timo haha that's long
MasterDuke 23.75% MVM_interp_run 17:00
10.01% MVM_string_utf8_decodestream
yeah, guess i could pull those strings out into a variable 17:01
of course i only duplicated them when it was too slow with just the regex 17:03
of course i only duplicated them when it was too slow with just the regex 17:04
timo hehe. 17:05
MasterDuke 23% interp_run is way higher than usual, seems to suggest stuff actually isn't getting jitted. but why can't we tell why? 17:07
timo <anon> from -e:1 here has 1.3 mega entries and 0% jit 17:08
hum. 17:11
there's no complete bail
but it does say "jit not successful"
jnthnwrthngtn oh 17:13
timo it does succeed jitting in my non-profiled version here 17:14
wonder what's wrong there
jnthnwrthngtn I'd missed the "jit was not sucessful"
timo oh, is it normal to have more than one return_o in a resulting frame?
jnthnwrthngtn timo: "resulting"? 17:16
It's OK for there to be more than one return_o in general
timo "After:"
jnthnwrthngtn Oh 17:17
Well, did the before have it? 17:18
timo ok i searched further, there's more than one -e:1 and the longer one is also not usccessfully jitted without profile
nine Darn.... the "Type check failed for return value; expected CompUnit::Handle:D but got BOOTIO (BOOTIO)" is still here. Will have a look at this on Saturday I guess
jnthnwrthngtn nine: I was gonna see how new-disp did on agrammon and it also blew up with that
nine Oooh...so it's not just this one application. 17:19
Gives hope for a reduced test case
jnthnwrthngtn Agrammon is kinda the opposite of a reduced test case, but I dunno how big your application is :D
nine tree says 47 directories, 201 files
jnthnwrthngtn Hm, it may be smaller 17:20
(Agrammon, that is)
Wonder if it's a deopt-o
nine I guess the deciding factor is just: load tons of modules so load-precomp-file gets speshed
not sensitive to inlining for a change 17:21
jnthnwrthngtn I couldn't imagine there would be as many ways to screw up deopt as I've managed to create...
nine MVM_JIT_DISABLE=1 MVM_SPESH_OSR_DISABLE=1 MVM_SPESH_PEA_DISABLE=1 MVM_SPESH_INLINE_DISABLE=1 and its still there
jnthnwrthngtn timo: I shoved in some debug prints and it turns out that we make a JIT graph but fail to compile it 17:22
nine ought to start making dinner though
jnthnwrthngtn timo: And it happens without profiling, so I suspsect that really is the problem
timo right, i see it regardless of profiling or not as well
so somewhere in the final jitting step it's failing but not bailing 17:23
jnthnwrthngtn I made a nice batch of potato salad yesterday and so dinner preparation is easy today :)
timo if you want i'll reverse-step in rr to see where exactly it stops
jnthnwrthngtn I was trying to do exactly that, set a breakpoint on printf, and it didn't hit it, wat. 17:25
timo we vsprintf 17:26
that may not go through printf
jnthnwrthngtn ah 17:34
ah, with MVM_JIT_DEBUG I get:
JIT ERROR: Negative offset for dynamic label 18
Ohh. With MVM_JIT_EXPR_DISABLE=1 it goes away 17:43
And we get 99.9% JIT
So it's apparently about the expression JIT 17:44
Though despite that it still doesn't really get back all the perf...
Ah, the bigger discrepancy may well be that `new` doesn't get inlined 17:46
Ah, just a "bytecode too large". Guess I need to look at why 17:48
But first, food
dogbert17 so, it's time for brrt to make an appearance ... 18:01
MasterDuke wow, with MVM_JIT_EXPR_DISABLE=1 i get 12.36% MVM_string_utf8_decodestream and 9.25% MVM_interp_run 18:13
Xliff Anybody wanna play around on a 64-core Google Engine instance? 18:38
Looks like Raku may have a parallelism bug ... or maybe my script does. 18:39
Working script conks out when parallel compiling my p6-ICal project with no reason.
I have no idea what to look for.
If no takers, I'll pack it up and shut it down. It's costing me $3/hour
I'll check again in a couple of hours
Nicholas I sugest that you shut it down for now, assuming it doesn't cost you lots more than $3 to set it all up again 18:40
I'm going AFK soon, and I think most folks are not really awake
[Coke] I'm in the right time zone, but also am zonking.
Nicholas Thanks to the GCC compiler farm I have access to 32 core x86_64 machines. Which, sure, aren't 64 cores. But aren't costing me. 18:42
(and insane PPC machines, I think thanks to IBM. But if the bug is in the JIT, they won't help)
Xliff I just tried at concurrency level 32 and the hangup did NOT occur. 18:50
I will try my whole set of projects at level 32 and see what I get! ;q 18:51
Nicholas jnthnwrthngtn: yes, you fixed the failure in t/spec/S32-list/grep.rakudo.moar 18:54
(forgot to confirm)
timo i wonder if i should try exposing "percentage of calls that aren't jitted because they were inlined calls from a non-jitted frame" in moarperf? 19:09
at the moment you can open the "callers" table in the routines list, then you see one with 99.4% inlined, 0.0757% jitted and one with - inlined but 99.5% jit 19:10
MasterDuke huh, could be useful 19:14
timo we don't have separate nodes in the call graph for inlined calls vs regular calls, so we can't "follow" inlined calls to the original inliner so to speak 19:16
anybody feel like we should maybe statically determine what branches have not been taken at all and throwing them out of our spesh graphs and put an unconditional deopt there? 19:37
nine timo: sounds like it help with inlining by getting the bytecode size under the limit. Also doesn't sound like something very common? 19:45
timo can search for "never dispatched"
it's common for subs that have a path that throws an exception in some cases
like division that has to check for zero for example 19:46
jnthnwrthngtn Exception paths often aren't taken and could indeed be handled with deopt 20:00
And yeah, lack of inline cache entries is a really good hint. 20:01
We don't even have to record branch stats that way
timo since we have a dispatch_* every few meters now anyway ... :) :) 20:03
jnthnwrthngtn Indeed :) 20:05
Hm, this is weird. `method new` is not getting any type tuples recorded in its stats 20:06
(Mu.new, that is) 20:07
omg 20:17
I missed updating a spot in the spesh stats code for the change to the way named parameters are handled 20:21
As a result, we lost type info for everything with named args 20:22
timo :D
jnthnwrthngtn Anyway, that gets things much better :) 20:31
Will do a blocking + nodelay test to make sure the extra optortunities don't shake out new problems
Geth MoarVM/new-disp: 7d3cba4e2d | (Jonathan Worthington)++ | src/spesh/stats.c
Correct handling of named arg type stats

This wasn't updated for the new calling conventions, and thus we would consider type tuples with named arguments to have incomplete type info, and thus specialize them suboptimally.
20:47
MasterDuke whoops 21:58
jnthnwrthngtn Good for my private benchmark set though; it's caught two regressions that have both been fixed today. 22:13
jnthnwrthngtn I'm content so far as performance goes with the merge now. The startup hit is the only significant regression I'm aware of. Guess we'll see what blin's verdict is. 22:16
MasterDuke i still see the problem where my script is only 55% jitted 22:19
jnthnwrthngtn Was it more JItted on master? 22:23
And does the rate go up with MVM_JIT_EXPR_DISABLE=1 ?
MasterDuke yes, it looks like the rate goes up with MVM_JIT_EXPR_DISABLE=1. i didn't get a profile like that, but interp_run goes from ~23% to ~10% 22:24
jnthnwrthngtn OK, probably it's the thing I discovered earlier then
lizmat down to 1.241 / .719 # all within noise, but feels a bit faster
jnthnwrthngtn MasterDuke: If you run MVM_JIT_DEBUG=1 does it spit out a message about a negative label? 22:25
MasterDuke let me see... 22:26
JIT ERROR: Negative offset for dynamic label 185
JIT ERROR: Negative offset for dynamic label 65
jnthnwrthngtn That's the one
I do wonder if we can isolate it to a particular template
Also wonder how widespread this is 22:27
Ah, I see a bunch of them during the Rakudo build if I set it while doing that 22:28
Several of them in Test::CSV too 22:29
MasterDuke ha. profile with no env variables is 2mb. profile with MVM_JIT_EXPR_DISABLE=1 is 11mb
and 89% jitted instead of 55%
both have 35k deopts 22:30
whoops, without is 44% jitted, not 55%
jnthnwrthngtn Yeah, this is worth trying to hunt down. 22:33
MasterDuke interesting. on master, the profile says it's 20s slower (than new-disp with MVM_JIT_EXPR_DISABLE=1), but 93% jitted instead of 89%. still ~33k deopts 22:42
jnthnwrthngtn MasterDuke: What's the wallclock times on master/new-disp? 22:51
MasterDuke master is ~58s 22:52
building new-disp... 22:54
timo were you able to rr it by breakpointing vsnprintf or whatever, jnthnwrthngtn? 22:55
MasterDuke new-disp is ~55s 22:56
new-disp + MVM_JIT_EXPR_DISABLE=1 is ~45s
hot damn 22:57
jnthnwrthngtn MasterDuke: Ah, so the issue not that new-disp is slower, but that it should be even faster. OK, that's a nice problem. :)
(I'd misunderstood it as new-disp being slower) 22:58
MasterDuke a couple days ago it was ~15s slower, so there have been some good improvements recently
timo worst case ever would be: when this problem from exprjit happens, start again from the start but turn exprjit off, either completely, or after the given BB or ins or whatever that caused trouble
jnthnwrthngtn timo: I just put a breakpoint on the line where printf was. I did trace it back a bit further, but then spotted there's a jit debug option, ran with that, and it told me where it was bailing out 22:59
timo i wonder if i can find anything interesting by finding the exact spot where it happens tho 23:00
jnthnwrthngtn At a wild guess, we create a label but never emit it, so it never gets fixed up by dynasm
timo did JIT_DEBUG also spit out the graphs and tile lists in the spesh log? perhaps that only comes after the error, so wouldn't show anything in our troubled case 23:01
jnthnwrthngtn Hm, not sure...I think that's another option?
The error is really late, fwiw
It's storing the labels produced by dynasm into the spesh candidate and notices a negative one 23:02
timo yeah, after all the nodes have been put down, which includes one node for each exprjit graph
jnthnwrthngtn Yeah, we've even produced machine code by that point
I wonder if labels get negative offsets before dynasm emits code and fixes them up as it does so 23:03
Thus the "not emitted label" theory
I don't know the expr jit well enough to know how plausible/likely that is
timo right. same, really 23:06
timo 13: (branch (label $name)) 23:13
and then the label is nowhere to be seen!
there is only (label $name) and (label :fail) in the tile list logs? ok that just means that's the tile that implements a piece of the tree, that's why it says $name there 23:16
but it's definitely missing another appearance of the (label ...) tile 23:17
it's a bit tricky to navigate the huge spesh logs we have, especially when there's thousands and thousands of lines just for updating stats but no specializations 23:23
i'll print out some pointers or something to help me find the spot where actually a thing happens 23:25
yeah i don't actually know where to look here to see what's going on 23:51
i can dump the compiled bytecode before it tries to do the dynamic label fixup 23:53
but i think i still need brrt to make sense of this problem 23:57