Nicholas good *, #moarvm
MasterDuke releasable6: status 08:23
releasable6 MasterDuke, Next release in ≈15 days and ≈10 hours. 3 blockers. Changelog for this release was not started yet
MasterDuke, Details: gist.github.com/5403529993f6bb901d...8fabfc4930
MasterDuke the last two blockers might've already been fixed? 08:25
any objections to merging github.com/MoarVM/MoarVM/pull/1555 ?
Nicholas I'm not competant to review it, so I can't usefully comment. (But obviously, d'oh, I can't really object either. Which was your actual question) 08:38
MasterDuke those commits have had quite a large number of spectests run, with no (new) problems. however, if people want to wait until after the release since we did already have the large new-disp merge that's fine 08:42
jnthnwrthngtn moarning o/
Nicholas \o
jnthnwrthngtn MasterDuke: I think 15 days is plenty of time to shake out issues, and running with JIT disabled is a good way to see if any issues might relate to them. 09:42
MasterDuke: I assume you've done spectest with blocking + nodelay also?
MasterDuke no, but i can run that now
jnthnwrthngtn OK, do nqp and rakudo build and test with that; if no regressions in those, I'd say merge it. 09:47
MasterDuke wow, i don't usually run full spectests with those. so much slower! 10:05
Geth MoarVM/master: 16 commits pushed by (Daniel Green)++, Unknown++, MasterDuke17++
review: github.com/MoarVM/MoarVM/compare/6...33aef886e7
MasterDuke i guess probably a good time for nqp+rakudo bumps to help with any bisecting if needed 10:11
lizmat shall I do the honours then?
MasterDuke sure 10:12
lizmat 2021.09-624-ge733aef88 # wow, that's a high number of commits since the release :-)
hmmm... not sure if it's something to do with my MBP, but test-t times appear to have almost doubled for me? 10:34
jnthnwrthngtn lizmat: Hm, can you isolate it to a particular change? 10:36
lizmat it was a few days ago since I last did it... :-(
I thought: let's run it again, see if MasterDuke's changes helped
MasterDuke there haven't been all that many changes after the new-disp merge, right? so mine probably caused it? 10:39
jnthnwrthngtn oh, gah, I was about to say "I don't see much change" but was running the MQTT test instead of test-t 10:40
lizmat MasterDuke: I'm not sure
please let someone else confirm my numbers
it could well be something on my machine...
seems I have a Spotlight indexing run atm 10:42
will check again in 30 mins 10:43
yeah, it was something local: 11:22
1.105 as a new lowest for me
sorry for the noise 11:23
MasterDuke does anybody have any idea how to diagnose/debug why the expr jit currently can make things slower? 11:43
lizmat what was the way to disable it again? 11:46
but, uh, i now get a segv in that mqtt test if i disable it 11:47
Thread 1 "raku" received signal SIGSEGV, Segmentation fault.
0x00007ffff78db71b in compose (tc=0x55555555a110, st=0x48000000c8ec8148, info_hash=0x7fffefa0dde8) at src/6model/reprs/P6opaque.c:691
691 if (st->REPR_data)
lizmat lowest test-t with expr jit disabled: 1.043 11:49
m: say 1.105 / 1.043
camelia 1.059444
lizmat so 5% faster ?
MasterDuke and i just jitted newtype, newmixintype, and composetype (which calls compose in the emit.dasc implementation i added)
think i see the problem 11:51
lizmat ah?
Geth MoarVM: a6ff2c031b | (Daniel Green)++ | src/jit/x64/emit.dasc
Fix segfault in lego jit of composetype

FUNCTION is aliased with TMP5, so TMP5 was being overwritten and that meant we were getting the wrong STABLE later.
lizmat MasterDuke: another bump warranted ? 12:03
MasterDuke it's unlikely that people are running with the expr jit disabled (the template for composetype is fine), so i wouldn't say it's vital, but it couldn't hurt 12:05
jnthnwrthngtn MasterDuke: It'd help to figure out if the slowdown we see is either a) because the machine code produced is worse, or b) because we spend more time producing said machine code, and so spend more time interpreting
If it's b) then we'd expect to see the difference fade away by increasing the amount of time the benchmark runs for.
lizmat is it easy to switch off the actual deployment of machine code ?
to find out how much overhead it is 12:06
jnthnwrthngtn lizmat: Don't know of an easy way. We can probably somewhat see the effect in profiles of MoarVM though (by looking at functions involved in the expr JIT)
The spesh log also has times taken to JIT things. 12:07
We could grep those out and sum them
lizmat ah... but is that all of jitting, or just the expr jit ? 12:08
jnthnwrthngtn All 12:09
But you could still compare the numbers with it enabled and disabled
MasterDuke just built everything and ran all tests with the expr jit disabled, no problems 12:10
jnthnwrthngtn I can believe there's a sitaution where the machine code produced is worse, but taking longer to produce the machine code in the first place is worth investigating.
Analyzing what's going on if code quality is worse will be much harder, so it'd be better to not do that if it's not really to blame. 12:11
bbi10 12:12
MasterDuke 637077us total for with the expr jit 12:19
344700us total for without the expr jit
from a spesh log of running the mqtt test 12:20
874 instances of 'JIT was successful and compilation took' with the expr jit 12:22
870 instances without the expr jit
longest individual time with the expr jit was 26658us 12:23
longest individual time without the expr jit was 18840us
dogbert17 I have a program which runs in 34s, without the expr-jit it's 26s
lizmat test-t on a 20x larger files shows with / without expr jit *ENABLED* as: 15.020 / 14.447 12:25
so even on a longer running process, not using expr jit is faster
MasterDuke dogbert17: that would seem to indicate bad code being generated. can you check the compilation times in spesh logs
lizmat which to me indicates the generated code is not an advantage?
MasterDuke the routine that took the longest to compile with the expr jit was 'lexical_vars_to_locals' 12:27
316 BBs, Frame size: 9180 bytes (1568 from inlined frames), Specialization took 39093us (total 71578us), Bytecode size: 43270 byte
jnthnwrthngtn m: say 15.020 / 14.447 12:39
camelia 1.039662
jnthnwrthngtn So around 4% rather than 5% after some time, so we could maybe interpret that as "compilation time is a factor but not the dominating one" 12:40
dogbert17: That's a really interesting case. Are there any indications in a spesh log of JIT being unsuccessful?
Or alternatively can profile and see percent JITted or not 12:41
lizmat: Percent JITted in a comparative profile of test-t is also interesting, also any difference in deopt rates. 12:44
lizmat feels all within noise levels for the standard test-t run 12:49
only significant difference I see is 2 On Stack Replacements with the expr jit enabled, and none with it disabled 12:50
jnthnwrthngtn Hm, and deopts?
lizmat both 7 deopts
and no global deopts
jnthnwrthngtn Curious.
No smoking gun there, then. 12:51
Although the extra OSRs are a little curious
lizmat oddly enough, disabling the expr jit results in *more* jit compiled frames
98.19% with disabled, 98..08% enabled 12:52
but that feels like noise
jnthnwrthngtn That's frames in the dynamic sense, not the static one, so it's showing that we spend more time before the JITted version is available 12:54
I'd expect its repeatedly observable rather than noise, but it's also a small effect. 12:55
MasterDuke dogbert17: can you share that program?
dogbert17 jnthnwrthngtn: I can check
jnthnwrthngtn So is another hint we're looking at a machine code quality issue 12:56
(I asked about deopts in case there's a bug in the expr JIT guard generation that sees us deopt in cases we should not.)
(But no evidence so far.)
MasterDuke fwiw, 39 'JIT was not successful' with the expr jit, 47 without (still for the mqtt test) 12:57
dogbert17 Masterduke, jnthnwrthngtn: since I'm a nice guy :) I'll share the code. gist.github.com/dogbert17/7099a67e...3b6ac0a08f 12:58
dogbert17 hello brrt
# [012] dispatch not compiled: op MVMDispOpcodeBindFailureToResumption NYI 13:02
brrt ohai dogbert17
dogbert17 we're trying to figure out why a program runs faster when the expr jit is turned off
brrt ah, that's... a good thing 13:05
and the answer is 'we don't have a benchmarking suite'
or if we do, we don't have a systematic way to run it 13:06
dogbert17 this is a bizarre case, from 34s with exprjit to 26s without
in case you're intrigued the src gist is about ten lines up in the irc log 13:08
brrt I am
(I am also chronically short in time)
jnthnwrthngtn dogbert17: Thanks for the script; I can reproduce the difference too (29.9 with, 22.3 without) 13:11
dogbert17 cool but now I'm getting envious of your hardware :)
brrt hmmm... if it were bad code generation, that much worse? 13:13
there's an obvious fix though
disable the expression JIT :-)
imo the register allocator is suspect... 13:15
and consider; the expr jit needs to be 'clever' about function calls, the lego jit does not
MasterDuke interesting, i can't repro the time difference 13:24
jnthnwrthngtn Uhhh...did I mess something up or does --profile make the difference vanish?
Or at least produce identical profiles 13:25
dogbert17 FWIW, there are two 'JIT was not successful and compilation took 123us' when the exprjit is enbled but three such messages when it's disabled 13:28
MasterDuke what about sum of jit compilation times? 13:29
dogbert17 normal, i.e. with exprjit I get 133756 and without 42342 13:30
MasterDuke so almost triple, but the actual time it took wouldn't explain the runtime difference 13:32
dogbert17 and as jnthnwrthngtn wrote above, the profiles look remarkably similar 13:34
MasterDuke what about perf, does it show any noticeable differences? 13:35
dogbert17 MasterDuke: strange that you couldn't repro though
MasterDuke it does look like i'm seeing a difference now, it's just pretty small. ~24s with expr jit, ~22.5 without 13:38
wild thought, but what if you clear out your precomp directory? i just had to do that to fix a problem after i tested building nqp/rakudo with the expr jit disabled 13:41
jnthnwrthngtn This is a bit odd: if I make it 400 rather than 500 then the difference is pretty small 13:44
dogbert17 if I run the program normally I get many runs taking 27s (more or less the same as with exprjit disabled) but all of a sudden runtime jumps to 34s 13:45
what could cause the runtime to differ so much between executions
and no, my system isn't loaded
jnthnwrthngtn m: say 6.085 / 5.228
camelia 1.163925
jnthnwrthngtn m: say 29.2 / 22.3 13:46
camelia 1.309417
jnthnwrthngtn m: say 1.070 / 0.837
camelia 1.278375
brrt that is very odd yes 13:47
MasterDuke dogbert17: what if you disable hash randomization and/or run with spesh blocking?
dogbert17 MasterDuke: let me try with spesh blocking 13:48
with MVM_SPESH_BLOCKING=1 all runs, 10 atm, takes 28s 13:54
is it just a coincidence 13:55
MasterDuke i'm seeing about the same 1.5s difference with MVM_SPESH_BLOCKING=1 13:57
and still the same if i disable hash randomization 14:04
jnthnwrthngtn Did a callgrind run; 88,387,766,589 IR with expr JIT, 81,491,356,897 without 14:06
dogbert17 ha, perf top shows that when a run is slow MVM_fixed_size_alloc is on top of the chart, when the program suddenly run fast it's in like third or fourth place 14:07
jnthnwrthngtn: about ten percent difference
nah; i was mistaken MVM_fixed_size_alloc is always on top regardless 14:08
MasterDuke so now we just dump all generated machine code for with/without and compare, should just take a min or two, right? 14:11
jnthnwrthngtn The callgrind output is a bit odd 14:12
It shows 53.45% under MVM_jit_code_enter with the expr JIT and 71.55 without it 14:13
And oddly 35 million calls to MVM_frame_dispatch with it, 41 million calls without? 14:15
dogbert17 how is that possible? 14:16
jnthnwrthngtn 40 million calls to dispatch_monomorphic with expr JIT, only 21 million without
brrt that is indeed very, very odd
jnthnwrthngtn That last one is...wat
4285 calls to deopt_one with, 9,128 without 14:17
None of this makes ense
MasterDuke would it be any easier to debug this before the new-disp merge?
jnthnwrthngtn *sense
Other question: did this discrepancy exist before the new-disp merge?
MasterDuke i'm pretty sure we were talking about it before the merge. don't remember if everybody was on the branch though 14:18
who has a 2021.09 lying around...
dogbert17 jnthnwrthngtn: I believe that it did
although I'm not 100% certain 14:19
MasterDuke shareable6: 2021.09 14:20
shareable6 MasterDuke, whateverable.6lang.org/2021.09
MasterDuke so it's much slower overall with ^^^, and the number vary quite a bit 14:30
but with expr jit the numbers were consistently ~42s. without had greater variation, as low as 33s once, but usually ~40s 14:32
dogbert17 this is so bizarre 14:42
MasterDuke www.youtube.com/watch?v=C2cMG33mWVY 14:43
brrt :-D 15:08
timo oh, could yall tr comenting out reprops from jit/graph.c 15:13
since the exprjit doesn't do devirt of reprops et, removing that from the lego jit could get us an idea how much we save from that feature
i can't work right now, a cat is sitting right in front of monitor making the bottom half prett much unusable 15:14
MasterDuke just comment out the cases in consume_reprop()? 15:15
timo i'm not sure if that causes trouble 15:16
actually, there's one spot in consume_reprop where we can turn devirt off
by making sure the facts near the top don't give us the type
so just null it out or skip looking at the facts or something
Nicholas timo: the cat doesn't have some sort of icon you can use to minimise it? Or it does, but your mouse is scared of it? 15:17
timo cdn.discordapp.com/attachments/557...715493.jpg 15:19
MasterDuke ok, commented out all but the default case at the top of consume_reprop() and ran dogbert17's script with MVM_SPESH_BLOCKING=1 15:24
~25s with expr jit, ~27s without 15:25
timo praise the devirtualization
MasterDuke so without slows down by ~4-5s 15:26
jnthnwrthngtn I compared MVM_SPESH_INLINE_LOG output between the two of them and there are some curious differences there 15:27
For example:
-Can inline slip-all (1003) with bytecode size 180 into push-all (2091)
-Can inline push (4911) with bytecode size 28 into push-all (2091)
+Can NOT inline slip-all (1003) with bytecode size 416 into push-all (2091): no spesh candidate available and bytecode too large to produce an inline
+Can inline unspecialized push (4911) with bytecode size 124 into push-all (2091)
Notice how the dependent things aren't specialized yet in the second case
timo hm, max stack depth getting updated at unlucky spots during deep recursion? 15:28
jnthnwrthngtn Maybe yes, given that's the sort order 15:29
Turning on the spesh log seems to hide the issue though 15:30
timo does spesh blocking help for that particular part of the issue? 15:31
jnthnwrthngtn It gets me a much smaller difference 15:33
Which is perhaps the reprops one you just mentioned?
But is much smaller in magnitude than the whole difference
So it seems the repr ops thing is one part of it
But also that the spesh thread working for longer causes longer periods where we don't record stats, in turn leading to instability 15:35
timo true 15:36
jnthnwrthngtn I note that this code probably does gather/take and wonder if that makes issues more likely 15:38
I wonder what'd happen if we did something like blocking in normal execution, except only do it when we've run out of log buffers 15:39
jnthnwrthngtn So we get concurrent specialization and execution to a point 15:39
But stop and wait if we get too ahead 15:40
Hm, quick impl of that fixes it 15:44
MasterDuke fixes it == you don't see a speedup disabling expr jit? 15:46
jnthnwrthngtn Uhh...I thought so but in fact it only makes it less likely, so there must be something about log boundary handling that makes it interesting. 15:52
oops, gotta go for lesson, bbiab 16:01
nine Ok, got NativeCall callbacks up and running :) 17:42
lizmat whee!
nine++ 17:43
brrt \o/ 17:49
Nicholas "dispatch all the things" 17:51
brrt so, if I get it correctly, the current hypothesis is that it's reprops which are slow in expr JIT 17:58
... did we by any chance have an optimization there, that we don't have in the expr jit
timotimo: do I recall correctly that you had a devirtualization for reprops in the lego jit but not in the expr jit? 17:59
timo: ^
MasterDuke well, lego jit does stuff like github.com/MoarVM/MoarVM/blob/mast...#L799-L856 but the template for atpo_i is just github.com/MoarVM/MoarVM/blob/mast...1041-L1050 18:02
timo correct, the exprjit only has a tiny start of devirt in a branch, but it's erally just a new repr method that the jit calls 18:05
nine Ok, fixed the segfault in t/04-nativecall/00-misc.t which actually wasn't because of the dispatcher but was a pre-existing issue with cloned native subs and serialization. Of course no idea why this hasn't been an issue so far. 18:13
MasterDuke m: my int @a = (^100); my int $b; for ^10_000_000 -> int $i { $b = @a[$i % 64] }; say now - INIT now; say $b 18:29
camelia 0.409680642
MasterDuke m: my int @a = (^100); my int $b; my int $i = (^100).pick; say $i; for ^10_000_000 { $b = @a[$i] }; say now - INIT now; say $b
camelia 52
nine jnthnwrthngtn: I think there's a bug in pass-decontainerized: my $track-arg := nqp::dispatch('boot-syscall', 'dispatcher-track-arg', $args, $i); runs only if in the first run the arg is in a Scalar container. But what if it is not and instead is in one in a following run? Then no guard would trigger and we wouldn't run the dispatcher again and wouldn't decontainerize.
MasterDuke ^^^ seems very counterintuitive
huh. postcircumfix:<[]> is the third most expensive for both according to a profile, and has essential the same time. but <unit> is twice as long for the version with .pick and <anon> is also longer (both are 1 and 2 when sorted by exclusive time) 18:35
ha. mod version enters 19k frames, but the pick version enters 50m 18:38
nine jnthnwrthngtn: also it's missing a nqp::dispatch('boot-syscall', 'dispatcher-guard-type', $track-arg); in any case! 18:42
Sadly just restoring tc->stack_top after a callback doesn't seem to be enough. 18:46
rakudo: src/core/callstack.c:472: MVM_callstack_unwind_frame: Assertion `(char *)tc->stack_top < tc->stack_current_region->alloc' failed. 18:47
brrt then I think that's a direction to investigate 19:05
MasterDuke ha 19:21
m: my int @a = (^100); my int $b; my Int $i = (^100).pick; say $i; for ^50_000_000 -> int $n { $b = @a[$i] }; say now - INIT now; say $b
camelia 88
MasterDuke m: my int @a = (^100); my int $b; my int $i = (^100).pick; say $i; for ^50_000_000 -> int $n { $b = @a[$i] }; say now - INIT now; say $b
camelia 59
MasterDuke i noticed a `inline-preventing instruction: getlexref_i` in the spesh log of that ^^^ second version 19:23
m: my int @a = (^100); my int $b; my int $c; for ^50_000_000 -> int $i { $c = $i % 64; $b = @a[$c] }; say now - INIT now; say $b # and now we can make the mod version slow 19:24
camelia 8.875664986
MasterDuke really a dramatic difference
timo: weren't you talking recently about how to do better with lexrefs? 19:26
timo more like how we have to do better :P 19:54
jnthnwrthngtn nine: When one does track-attr, it implies both a type and concreteness guard on the thing we're reading from. 22:13
nine: Since it actually doesn't store attribute name/class handle, just an offset
If we add explicit guards they are deduplicated, but it's wasteful. 22:14
(Wasteful to make the two syscalls when the track-attr one does the same job anyway)
jnthnwrthngtn The experiment to make spesh only somewhat concurrent seems to be a failure: there's a (somewhat mitigatable) startup penalty, 10% spectest time penalty, a minor but negative effect across microbenchmarks...and to top it off, it doesn't even reliably fix the instability anyway. 23:29
A change to how stack depth is tracked provides a slightly improved chance of the triangle number script running in a better time with expr JIT enabled. Increasing the spesh buffer sizes has a bigger chance of doing that, but we still sometimes see the worse result. 23:32
However, the latter two make it clear that discrepancies in timing and log buffer send points between runs (likely aided by hash randomization) are a dominating factor. 23:34
The expr JIT probably does carry some "blame" (the repr op devirt missing), but it seems the primary factor is that it being enabled causes us to fill spesh log buffers, stop logging for a while, and end up with some problems as a result. 23:35
(Where the problems seem to be sub-optimal specialization, perhaps due to wrong ordering)
Sleep time, will poke at it a bit more tomorrow. 23:38