Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes. Set by lizmat on 24 May 2021. |
|||
00:02
reportable6 left
00:04
reportable6 joined
03:32
linkable6 left,
evalable6 left
06:02
reportable6 left
06:06
reportable6 joined
07:34
linkable6 joined
07:35
evalable6 joined
07:56
cognominal_ joined
08:00
cognominal left
|
|||
Nicholas | good *, #moarvm | 08:14 | |
MasterDuke | releasable6: status | 08:23 | |
releasable6 | MasterDuke, Next release in ≈15 days and ≈10 hours. 3 blockers. Changelog for this release was not started yet | ||
MasterDuke, Details: gist.github.com/5403529993f6bb901d...8fabfc4930 | |||
MasterDuke | the last two blockers might've already been fixed? | 08:25 | |
any objections to merging github.com/MoarVM/MoarVM/pull/1555 ? | |||
Nicholas | I'm not competant to review it, so I can't usefully comment. (But obviously, d'oh, I can't really object either. Which was your actual question) | 08:38 | |
MasterDuke | those commits have had quite a large number of spectests run, with no (new) problems. however, if people want to wait until after the release since we did already have the large new-disp merge that's fine | 08:42 | |
08:55
MasterDuke left
09:35
MasterDuke joined
|
|||
jnthnwrthngtn | moarning o/ | 09:41 | |
Nicholas | \o | ||
jnthnwrthngtn | MasterDuke: I think 15 days is plenty of time to shake out issues, and running with JIT disabled is a good way to see if any issues might relate to them. | 09:42 | |
MasterDuke: I assume you've done spectest with blocking + nodelay also? | |||
MasterDuke | no, but i can run that now | ||
jnthnwrthngtn | OK, do nqp and rakudo build and test with that; if no regressions in those, I'd say merge it. | 09:47 | |
MasterDuke | wow, i don't usually run full spectests with those. so much slower! | 10:05 | |
Geth | MoarVM/master: 16 commits pushed by (Daniel Green)++, Unknown++, MasterDuke17++ review: github.com/MoarVM/MoarVM/compare/6...33aef886e7 |
10:09 | |
MasterDuke | i guess probably a good time for nqp+rakudo bumps to help with any bisecting if needed | 10:11 | |
lizmat | shall I do the honours then? | ||
MasterDuke | sure | 10:12 | |
lizmat | 2021.09-624-ge733aef88 # wow, that's a high number of commits since the release :-) | ||
hmmm... not sure if it's something to do with my MBP, but test-t times appear to have almost doubled for me? | 10:34 | ||
jnthnwrthngtn | lizmat: Hm, can you isolate it to a particular change? | 10:36 | |
lizmat | it was a few days ago since I last did it... :-( | ||
I thought: let's run it again, see if MasterDuke's changes helped | |||
MasterDuke | there haven't been all that many changes after the new-disp merge, right? so mine probably caused it? | 10:39 | |
jnthnwrthngtn | oh, gah, I was about to say "I don't see much change" but was running the MQTT test instead of test-t | 10:40 | |
lizmat | MasterDuke: I'm not sure | ||
please let someone else confirm my numbers | |||
it could well be something on my machine... | |||
seems I have a Spotlight indexing run atm | 10:42 | ||
will check again in 30 mins | 10:43 | ||
yeah, it was something local: | 11:22 | ||
1.105 as a new lowest for me | |||
sorry for the noise | 11:23 | ||
11:30
sena_kun joined
|
|||
MasterDuke | does anybody have any idea how to diagnose/debug why the expr jit currently can make things slower? | 11:43 | |
lizmat | what was the way to disable it again? | 11:46 | |
MasterDuke | MVM_JIT_EXPR_DISABLE=1 | ||
but, uh, i now get a segv in that mqtt test if i disable it | 11:47 | ||
Thread 1 "raku" received signal SIGSEGV, Segmentation fault. | |||
0x00007ffff78db71b in compose (tc=0x55555555a110, st=0x48000000c8ec8148, info_hash=0x7fffefa0dde8) at src/6model/reprs/P6opaque.c:691 | |||
691 if (st->REPR_data) | |||
lizmat | lowest test-t with expr jit disabled: 1.043 | 11:49 | |
m: say 1.105 / 1.043 | |||
camelia | 1.059444 | ||
lizmat | so 5% faster ? | ||
MasterDuke | and i just jitted newtype, newmixintype, and composetype (which calls compose in the emit.dasc implementation i added) | ||
think i see the problem | 11:51 | ||
lizmat | ah? | ||
Geth | MoarVM: a6ff2c031b | (Daniel Green)++ | src/jit/x64/emit.dasc Fix segfault in lego jit of composetype FUNCTION is aliased with TMP5, so TMP5 was being overwritten and that meant we were getting the wrong STABLE later. |
11:56 | |
12:02
reportable6 left
12:03
reportable6 joined
|
|||
lizmat | MasterDuke: another bump warranted ? | 12:03 | |
MasterDuke | it's unlikely that people are running with the expr jit disabled (the template for composetype is fine), so i wouldn't say it's vital, but it couldn't hurt | 12:05 | |
jnthnwrthngtn | MasterDuke: It'd help to figure out if the slowdown we see is either a) because the machine code produced is worse, or b) because we spend more time producing said machine code, and so spend more time interpreting | ||
If it's b) then we'd expect to see the difference fade away by increasing the amount of time the benchmark runs for. | |||
lizmat | is it easy to switch off the actual deployment of machine code ? | ||
to find out how much overhead it is | 12:06 | ||
jnthnwrthngtn | lizmat: Don't know of an easy way. We can probably somewhat see the effect in profiles of MoarVM though (by looking at functions involved in the expr JIT) | ||
The spesh log also has times taken to JIT things. | 12:07 | ||
We could grep those out and sum them | |||
lizmat | ah... but is that all of jitting, or just the expr jit ? | 12:08 | |
*jit | |||
jnthnwrthngtn | All | 12:09 | |
But you could still compare the numbers with it enabled and disabled | |||
MasterDuke | just built everything and ran all tests with the expr jit disabled, no problems | 12:10 | |
jnthnwrthngtn | I can believe there's a sitaution where the machine code produced is worse, but taking longer to produce the machine code in the first place is worth investigating. | ||
Analyzing what's going on if code quality is worse will be much harder, so it'd be better to not do that if it's not really to blame. | 12:11 | ||
bbi10 | 12:12 | ||
MasterDuke | 637077us total for with the expr jit | 12:19 | |
344700us total for without the expr jit | |||
from a spesh log of running the mqtt test | 12:20 | ||
874 instances of 'JIT was successful and compilation took' with the expr jit | 12:22 | ||
870 instances without the expr jit | |||
longest individual time with the expr jit was 26658us | 12:23 | ||
longest individual time without the expr jit was 18840us | |||
dogbert17 | I have a program which runs in 34s, without the expr-jit it's 26s | ||
lizmat | test-t on a 20x larger files shows with / without expr jit *ENABLED* as: 15.020 / 14.447 | 12:25 | |
so even on a longer running process, not using expr jit is faster | |||
MasterDuke | dogbert17: that would seem to indicate bad code being generated. can you check the compilation times in spesh logs | ||
lizmat | which to me indicates the generated code is not an advantage? | ||
MasterDuke | the routine that took the longest to compile with the expr jit was 'lexical_vars_to_locals' | 12:27 | |
316 BBs, Frame size: 9180 bytes (1568 from inlined frames), Specialization took 39093us (total 71578us), Bytecode size: 43270 byte | |||
jnthnwrthngtn | m: say 15.020 / 14.447 | 12:39 | |
camelia | 1.039662 | ||
jnthnwrthngtn | So around 4% rather than 5% after some time, so we could maybe interpret that as "compilation time is a factor but not the dominating one" | 12:40 | |
dogbert17: That's a really interesting case. Are there any indications in a spesh log of JIT being unsuccessful? | |||
Or alternatively can profile and see percent JITted or not | 12:41 | ||
lizmat: Percent JITted in a comparative profile of test-t is also interesting, also any difference in deopt rates. | 12:44 | ||
lizmat | feels all within noise levels for the standard test-t run | 12:49 | |
only significant difference I see is 2 On Stack Replacements with the expr jit enabled, and none with it disabled | 12:50 | ||
jnthnwrthngtn | Hm, and deopts? | ||
lizmat | both 7 deopts | ||
and no global deopts | |||
jnthnwrthngtn | Curious. | ||
No smoking gun there, then. | 12:51 | ||
Although the extra OSRs are a little curious | |||
lizmat | oddly enough, disabling the expr jit results in *more* jit compiled frames | ||
98.19% with disabled, 98..08% enabled | 12:52 | ||
but that feels like noise | |||
jnthnwrthngtn | That's frames in the dynamic sense, not the static one, so it's showing that we spend more time before the JITted version is available | 12:54 | |
I'd expect its repeatedly observable rather than noise, but it's also a small effect. | 12:55 | ||
MasterDuke | dogbert17: can you share that program? | ||
dogbert17 | jnthnwrthngtn: I can check | ||
jnthnwrthngtn | So is another hint we're looking at a machine code quality issue | 12:56 | |
(I asked about deopts in case there's a bug in the expr JIT guard generation that sees us deopt in cases we should not.) | |||
(But no evidence so far.) | |||
MasterDuke | fwiw, 39 'JIT was not successful' with the expr jit, 47 without (still for the mqtt test) | 12:57 | |
dogbert17 | Masterduke, jnthnwrthngtn: since I'm a nice guy :) I'll share the code. gist.github.com/dogbert17/7099a67e...3b6ac0a08f | 12:58 | |
12:58
brrt joined
|
|||
dogbert17 | hello brrt | 13:01 | |
# [012] dispatch not compiled: op MVMDispOpcodeBindFailureToResumption NYI | 13:02 | ||
brrt | ohai dogbert17 | 13:03 | |
dogbert17 | we're trying to figure out why a program runs faster when the expr jit is turned off | ||
brrt | ah, that's... a good thing | 13:05 | |
and the answer is 'we don't have a benchmarking suite' | |||
or if we do, we don't have a systematic way to run it | 13:06 | ||
dogbert17 | this is a bizarre case, from 34s with exprjit to 26s without | ||
in case you're intrigued the src gist is about ten lines up in the irc log | 13:08 | ||
brrt | I am | ||
(I am also chronically short in time) | |||
jnthnwrthngtn | dogbert17: Thanks for the script; I can reproduce the difference too (29.9 with, 22.3 without) | 13:11 | |
dogbert17 | cool but now I'm getting envious of your hardware :) | ||
brrt | hmmm... if it were bad code generation, that much worse? | 13:13 | |
there's an obvious fix though | |||
disable the expression JIT :-) | |||
imo the register allocator is suspect... | 13:15 | ||
and consider; the expr jit needs to be 'clever' about function calls, the lego jit does not | |||
13:17
sena_kun left
|
|||
MasterDuke | interesting, i can't repro the time difference | 13:24 | |
jnthnwrthngtn | Uhhh...did I mess something up or does --profile make the difference vanish? | ||
Or at least produce identical profiles | 13:25 | ||
dogbert17 | FWIW, there are two 'JIT was not successful and compilation took 123us' when the exprjit is enbled but three such messages when it's disabled | 13:28 | |
MasterDuke | what about sum of jit compilation times? | 13:29 | |
dogbert17 | normal, i.e. with exprjit I get 133756 and without 42342 | 13:30 | |
MasterDuke | so almost triple, but the actual time it took wouldn't explain the runtime difference | 13:32 | |
dogbert17 | and as jnthnwrthngtn wrote above, the profiles look remarkably similar | 13:34 | |
MasterDuke | what about perf, does it show any noticeable differences? | 13:35 | |
dogbert17 | MasterDuke: strange that you couldn't repro though | ||
MasterDuke | it does look like i'm seeing a difference now, it's just pretty small. ~24s with expr jit, ~22.5 without | 13:38 | |
wild thought, but what if you clear out your precomp directory? i just had to do that to fix a problem after i tested building nqp/rakudo with the expr jit disabled | 13:41 | ||
jnthnwrthngtn | This is a bit odd: if I make it 400 rather than 500 then the difference is pretty small | 13:44 | |
dogbert17 | if I run the program normally I get many runs taking 27s (more or less the same as with exprjit disabled) but all of a sudden runtime jumps to 34s | 13:45 | |
what could cause the runtime to differ so much between executions | |||
and no, my system isn't loaded | |||
jnthnwrthngtn | m: say 6.085 / 5.228 | ||
camelia | 1.163925 | ||
jnthnwrthngtn | m: say 29.2 / 22.3 | 13:46 | |
camelia | 1.309417 | ||
jnthnwrthngtn | m: say 1.070 / 0.837 | ||
camelia | 1.278375 | ||
brrt | that is very odd yes | 13:47 | |
MasterDuke | dogbert17: what if you disable hash randomization and/or run with spesh blocking? | ||
dogbert17 | MasterDuke: let me try with spesh blocking | 13:48 | |
with MVM_SPESH_BLOCKING=1 all runs, 10 atm, takes 28s | 13:54 | ||
is it just a coincidence | 13:55 | ||
MasterDuke | i'm seeing about the same 1.5s difference with MVM_SPESH_BLOCKING=1 | 13:57 | |
and still the same if i disable hash randomization | 14:04 | ||
jnthnwrthngtn | Did a callgrind run; 88,387,766,589 IR with expr JIT, 81,491,356,897 without | 14:06 | |
dogbert17 | ha, perf top shows that when a run is slow MVM_fixed_size_alloc is on top of the chart, when the program suddenly run fast it's in like third or fourth place | 14:07 | |
jnthnwrthngtn: about ten percent difference | |||
nah; i was mistaken MVM_fixed_size_alloc is always on top regardless | 14:08 | ||
MasterDuke | so now we just dump all generated machine code for with/without and compare, should just take a min or two, right? | 14:11 | |
jnthnwrthngtn | The callgrind output is a bit odd | 14:12 | |
It shows 53.45% under MVM_jit_code_enter with the expr JIT and 71.55 without it | 14:13 | ||
And oddly 35 million calls to MVM_frame_dispatch with it, 41 million calls without? | 14:15 | ||
! | |||
dogbert17 | how is that possible? | 14:16 | |
jnthnwrthngtn | 40 million calls to dispatch_monomorphic with expr JIT, only 21 million without | ||
brrt | that is indeed very, very odd | ||
jnthnwrthngtn | That last one is...wat | ||
4285 calls to deopt_one with, 9,128 without | 14:17 | ||
None of this makes ense | |||
MasterDuke | would it be any easier to debug this before the new-disp merge? | ||
jnthnwrthngtn | *sense | ||
Other question: did this discrepancy exist before the new-disp merge? | |||
MasterDuke | i'm pretty sure we were talking about it before the merge. don't remember if everybody was on the branch though | 14:18 | |
who has a 2021.09 lying around... | |||
dogbert17 | jnthnwrthngtn: I believe that it did | ||
although I'm not 100% certain | 14:19 | ||
MasterDuke | shareable6: 2021.09 | 14:20 | |
shareable6 | MasterDuke, whateverable.6lang.org/2021.09 | ||
MasterDuke | so it's much slower overall with ^^^, and the number vary quite a bit | 14:30 | |
but with expr jit the numbers were consistently ~42s. without had greater variation, as low as 33s once, but usually ~40s | 14:32 | ||
dogbert17 | this is so bizarre | 14:42 | |
MasterDuke | www.youtube.com/watch?v=C2cMG33mWVY | 14:43 | |
brrt | :-D | 15:08 | |
timo | oh, could yall tr comenting out reprops from jit/graph.c | 15:13 | |
since the exprjit doesn't do devirt of reprops et, removing that from the lego jit could get us an idea how much we save from that feature | |||
i can't work right now, a cat is sitting right in front of monitor making the bottom half prett much unusable | 15:14 | ||
MasterDuke | just comment out the cases in consume_reprop()? | 15:15 | |
timo | i'm not sure if that causes trouble | 15:16 | |
actually, there's one spot in consume_reprop where we can turn devirt off | |||
by making sure the facts near the top don't give us the type | |||
so just null it out or skip looking at the facts or something | |||
Nicholas | timo: the cat doesn't have some sort of icon you can use to minimise it? Or it does, but your mouse is scared of it? | 15:17 | |
timo | cdn.discordapp.com/attachments/557...715493.jpg | 15:19 | |
MasterDuke | ok, commented out all but the default case at the top of consume_reprop() and ran dogbert17's script with MVM_SPESH_BLOCKING=1 | 15:24 | |
~25s with expr jit, ~27s without | 15:25 | ||
timo | praise the devirtualization | ||
MasterDuke | so without slows down by ~4-5s | 15:26 | |
jnthnwrthngtn | I compared MVM_SPESH_INLINE_LOG output between the two of them and there are some curious differences there | 15:27 | |
For example: | |||
-Can inline slip-all (1003) with bytecode size 180 into push-all (2091) | |||
-Can inline push (4911) with bytecode size 28 into push-all (2091) | |||
+Can NOT inline slip-all (1003) with bytecode size 416 into push-all (2091): no spesh candidate available and bytecode too large to produce an inline | |||
+Can inline unspecialized push (4911) with bytecode size 124 into push-all (2091) | |||
Notice how the dependent things aren't specialized yet in the second case | |||
timo | hm, max stack depth getting updated at unlucky spots during deep recursion? | 15:28 | |
jnthnwrthngtn | Maybe yes, given that's the sort order | 15:29 | |
Turning on the spesh log seems to hide the issue though | 15:30 | ||
timo | does spesh blocking help for that particular part of the issue? | 15:31 | |
jnthnwrthngtn | It gets me a much smaller difference | 15:33 | |
Which is perhaps the reprops one you just mentioned? | |||
But is much smaller in magnitude than the whole difference | |||
So it seems the repr ops thing is one part of it | |||
But also that the spesh thread working for longer causes longer periods where we don't record stats, in turn leading to instability | 15:35 | ||
timo | true | 15:36 | |
jnthnwrthngtn | I note that this code probably does gather/take and wonder if that makes issues more likely | 15:38 | |
I wonder what'd happen if we did something like blocking in normal execution, except only do it when we've run out of log buffers | 15:39 | ||
15:39
brrt left
|
|||
jnthnwrthngtn | So we get concurrent specialization and execution to a point | 15:39 | |
But stop and wait if we get too ahead | 15:40 | ||
Hm, quick impl of that fixes it | 15:44 | ||
MasterDuke | fixes it == you don't see a speedup disabling expr jit? | 15:46 | |
jnthnwrthngtn | Uhh...I thought so but in fact it only makes it less likely, so there must be something about log boundary handling that makes it interesting. | 15:52 | |
oops, gotta go for lesson, bbiab | 16:01 | ||
16:20
ilogger2 left
16:30
brrt joined
16:32
ilogger2 joined
17:30
rypervenche left
|
|||
nine | Ok, got NativeCall callbacks up and running :) | 17:42 | |
lizmat | whee! | ||
nine++ | 17:43 | ||
brrt | \o/ | 17:49 | |
Nicholas | "dispatch all the things" | 17:51 | |
17:53
squashable6 left
17:57
squashable6 joined
|
|||
brrt | so, if I get it correctly, the current hypothesis is that it's reprops which are slow in expr JIT | 17:58 | |
... did we by any chance have an optimization there, that we don't have in the expr jit | |||
timotimo: do I recall correctly that you had a devirtualization for reprops in the lego jit but not in the expr jit? | 17:59 | ||
timo: ^ | |||
MasterDuke | well, lego jit does stuff like github.com/MoarVM/MoarVM/blob/mast...#L799-L856 but the template for atpo_i is just github.com/MoarVM/MoarVM/blob/mast...1041-L1050 | 18:02 | |
18:02
reportable6 left
|
|||
timo | correct, the exprjit only has a tiny start of devirt in a branch, but it's erally just a new repr method that the jit calls | 18:05 | |
nine | Ok, fixed the segfault in t/04-nativecall/00-misc.t which actually wasn't because of the dispatcher but was a pre-existing issue with cloned native subs and serialization. Of course no idea why this hasn't been an issue so far. | 18:13 | |
18:14
brrt left
|
|||
MasterDuke | m: my int @a = (^100); my int $b; for ^10_000_000 -> int $i { $b = @a[$i % 64] }; say now - INIT now; say $b | 18:29 | |
camelia | 0.409680642 63 |
||
MasterDuke | m: my int @a = (^100); my int $b; my int $i = (^100).pick; say $i; for ^10_000_000 { $b = @a[$i] }; say now - INIT now; say $b | ||
camelia | 52 1.77087144 52 |
||
nine | jnthnwrthngtn: I think there's a bug in pass-decontainerized: my $track-arg := nqp::dispatch('boot-syscall', 'dispatcher-track-arg', $args, $i); runs only if in the first run the arg is in a Scalar container. But what if it is not and instead is in one in a following run? Then no guard would trigger and we wouldn't run the dispatcher again and wouldn't decontainerize. | ||
MasterDuke | ^^^ seems very counterintuitive | ||
huh. postcircumfix:<[]> is the third most expensive for both according to a profile, and has essential the same time. but <unit> is twice as long for the version with .pick and <anon> is also longer (both are 1 and 2 when sorted by exclusive time) | 18:35 | ||
ha. mod version enters 19k frames, but the pick version enters 50m | 18:38 | ||
nine | jnthnwrthngtn: also it's missing a nqp::dispatch('boot-syscall', 'dispatcher-guard-type', $track-arg); in any case! | 18:42 | |
Sadly just restoring tc->stack_top after a callback doesn't seem to be enough. | 18:46 | ||
rakudo: src/core/callstack.c:472: MVM_callstack_unwind_frame: Assertion `(char *)tc->stack_top < tc->stack_current_region->alloc' failed. | 18:47 | ||
camelia | 5===SORRY!5=== Error while compiling <tmp> Confused at <tmp>:1 ------> 3src/core/callstack.c:7⏏05472: MVM_callstack_unwind_frame: Asserti expecting any of: colon pair |
||
19:04
brrt joined
19:05
reportable6 joined
|
|||
brrt | then I think that's a direction to investigate | 19:05 | |
MasterDuke | ha | 19:21 | |
m: my int @a = (^100); my int $b; my Int $i = (^100).pick; say $i; for ^50_000_000 -> int $n { $b = @a[$i] }; say now - INIT now; say $b | |||
camelia | 88 1.496132246 88 |
||
MasterDuke | m: my int @a = (^100); my int $b; my int $i = (^100).pick; say $i; for ^50_000_000 -> int $n { $b = @a[$i] }; say now - INIT now; say $b | ||
camelia | 59 8.831049123 59 |
||
MasterDuke | i noticed a `inline-preventing instruction: getlexref_i` in the spesh log of that ^^^ second version | 19:23 | |
m: my int @a = (^100); my int $b; my int $c; for ^50_000_000 -> int $i { $c = $i % 64; $b = @a[$c] }; say now - INIT now; say $b # and now we can make the mod version slow | 19:24 | ||
camelia | 8.875664986 63 |
19:25 | |
MasterDuke | really a dramatic difference | ||
timo: weren't you talking recently about how to do better with lexrefs? | 19:26 | ||
19:33
brrt left
|
|||
timo | more like how we have to do better :P | 19:54 | |
21:24
leont left,
tbrowder left,
SmokeMachine left
21:26
SmokeMachine joined
21:28
tbrowder joined
21:38
leont joined
|
|||
jnthnwrthngtn | nine: When one does track-attr, it implies both a type and concreteness guard on the thing we're reading from. | 22:13 | |
nine: Since it actually doesn't store attribute name/class handle, just an offset | |||
If we add explicit guards they are deduplicated, but it's wasteful. | 22:14 | ||
(Wasteful to make the two syscalls when the track-attr one does the same job anyway) | |||
23:29
squashable6 left
|
|||
jnthnwrthngtn | The experiment to make spesh only somewhat concurrent seems to be a failure: there's a (somewhat mitigatable) startup penalty, 10% spectest time penalty, a minor but negative effect across microbenchmarks...and to top it off, it doesn't even reliably fix the instability anyway. | 23:29 | |
A change to how stack depth is tracked provides a slightly improved chance of the triangle number script running in a better time with expr JIT enabled. Increasing the spesh buffer sizes has a bigger chance of doing that, but we still sometimes see the worse result. | 23:32 | ||
However, the latter two make it clear that discrepancies in timing and log buffer send points between runs (likely aided by hash randomization) are a dominating factor. | 23:34 | ||
The expr JIT probably does carry some "blame" (the repr op devirt missing), but it seems the primary factor is that it being enabled causes us to fill spesh log buffers, stop logging for a while, and end up with some problems as a result. | 23:35 | ||
(Where the problems seem to be sub-optimal specialization, perhaps due to wrong ordering) | |||
Sleep time, will poke at it a bit more tomorrow. | 23:38 |