Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes.
Set by lizmat on 24 May 2021.
00:02 reportable6 left 01:04 reportable6 joined
japhb m: my int @a = (^100); my int $b; my int $i = (^100).pick; say $i; for ^50_000_000 -> int $n { $b = @a.AT-POS($i) }; say now - INIT now; say $b 01:16
camelia 97
16.736326012
97
japhb ???
That's almost exactly twice as slow, which is very suspicious 01:17
moon-child twice as slow as what?
japhb moon-child: Masterduke posted the same thing 6 hours ago, except with `$b = @a[$i]`. Which itself was several times slower with $i being a native int, and thus tripping over failed optimizations of lexical refs, rather than an Int. 01:19
Combined, both problems lead to my version being over 11x slower than the faster version. 01:20
moon-child oh; I need to read more scrollback
01:32 squashable6 joined 02:36 bloatable6 left, evalable6 left, committable6 left, squashable6 left, shareable6 left, quotable6 left, greppable6 left, coverable6 left, nativecallable6 left, bisectable6 left, benchable6 left, statisfiable6 left, linkable6 left, sourceable6 left, unicodable6 left, reportable6 left, tellable6 left, notable6 left, releasable6 left, statisfiable6 joined 02:37 notable6 joined, releasable6 joined, greppable6 joined, evalable6 joined, bloatable6 joined, shareable6 joined 02:38 nativecallable6 joined, committable6 joined, linkable6 joined, sourceable6 joined 02:39 squashable6 joined, quotable6 joined 03:36 bisectable6 joined 03:38 tellable6 joined, benchable6 joined 04:36 unicodable6 joined 04:37 coverable6 joined
japhb I've updated my draft MoarVM JIT AArch64 port scoping doc to address all the feedback I've gotten so far: gist.github.com/japhb/0c2108affd31...b3eb6a7947 05:05
It has several new sections (and new subsections of existing sections).
moon-child 'even x64 may trap on unaligned SIMD access' there are unaligned versions of most simd ops 05:09
(so, not a blocker; just a nice-to-have in the same respsects as regular alignment improvements) 05:10
japhb moon-child: Do those unaligned variants require more recent updates to x64 SIMD? At least one reviewer considered this problem a blocker, so I'm curious if it's just a difference in expected ISA extensions .... 05:34
moon-child doesn't look like it. E.g. movapd and movupd are both sse2 (amd64 baseline) 05:44
Nicholas good *, * 05:46
I didn't really meant to cause an accidental digression on SIMD. The reason I'd mentioned it was not in the context of x86_64 JITs, but really more C code. Don't assume that C compilers on x86_64 will keep compiling your code which relies on lax alignment (ie undefined behaviour) 05:48
it's a trap! :-)
06:00 linkable6 left, evalable6 left
nine jnthnwrthngtn: but pass-decontainerized does not do a track-attr, but a track-arg! 06:12
Well both, but track-arg comes first and the value's type is used in the conditional
Nicholas japhb: looks good to me. I like how you've phrased some of the things we mentioned, and I can't see what to add/change
nine jnthnwrthngtn: ah, we're both right. Yes the track-attr does imply that guard, but we only do the track-attr if the value is of a certain type. If it's not in the first run, we never do the track-attr, thus no guard gets installed 06:14
japhb moon-child: Ah OK, thank you
Nicholas: Excellent, thank you.
Anyone have any objections to adding the current porting scope draft to github.com/MoarVM/MoarVM/tree/master/docs/jit ? Seems like this research ought to be captured for posterity, no matter who actually picks up the porting project. 06:18
Nicholas I don't, but I guess that brrt ought to check the draft first (but is that just "a PR and ask brrt to review it?") 06:21
japhb Nicholas: That's a decent point,
s/','/'.'/ 06:22
Geth MoarVM: japhb++ created pull request #1560:
Doc research on scoping for an AArch64 JIT port
06:34
japhb (Feel free to pile on as reviewers if y'all so desire; I've already added brrt at Nicholas's suggestion.) 06:36
07:01 linkable6 joined 07:04 patrickb joined 07:05 reportable6 joined 07:40 brrt joined
brrt good * #moarvm 07:57
Nicholas good *, brrt
patrickb o/ 08:05
08:57 brrt left 09:02 evalable6 joined
jnthnwrthngtn moarning o/ 10:00
Nicholas \o 10:01
jnthnwrthngtn nine: Ah, I guess the point you're making is that a given callsite may deal in both containerized and non-decontainerized args. 10:02
In the same positions
10:02 evalable6 left, linkable6 left
jnthnwrthngtn Which could happen, yes, although the majority of callsites will be monomorphic in everything including this, a bunch that are polymorphic otherwise will still be monomorphic in this, so we're talking about a small number of cases 10:04
10:04 linkable6 joined
jnthnwrthngtn And then the question is whether having all the permutations of deconts possible stacked up at the callsite is wise. 10:04
10:05 evalable6 joined
jnthnwrthngtn Explicitly looking for if the site is blowing up may be a more bullet-proof solution. 10:05
otoh, in the case of a non-multi dispatch we'd end up with type guards on values that'd not be there otherwise in many cases 10:06
on the third hand, those guards might be inserted to pick a linked specialization
In summary, worse might be better or better might be better :) 10:07
Nicholas jnthnwrthngtn: I think you meant "gripping hand". ie en.wikipedia.org/wiki/The_Gripping_Hand 10:10
10:35 brrt joined
jnthnwrthngtn Hm, I can't remember if I already had my second coffee or not...this isn't a good sign. 10:42
Geth MoarVM/master: 8 commits pushed by (Jonathan Worthington)++ 10:45
10:53 sena_kun joined
nine jnthnwrthngtn: I came across the situation with github.com/niner/Inline-Perl5/blob...5.pm6#L257 where the first caller passed a non-containerized value and the second caller a Scalar and value gets passed on to $!p5.p5_get_type which then exploded 11:16
11:31 frost joined
nine So, back to square 1 with the stack_top assertions 11:34
jnthnwrthngtn nine: Oh...you re-used it for NativeCall 11:52
nine: Yeah, this isn't good re-use. NativeCall absolutely depends on it. The normal situation has it as an optimization. 11:53
nine I didn't re-use the sub, but copied and modified the code. Then I ran into this issue and figured that it'd probably affect the original sub as well
jnthnwrthngtn Yes, but their contexts are different
nine Ah, ok, that makes sense then.
jnthnwrthngtn Pre-coffee me earlier didn't catch on to your problem being what it does with NativeCall :)
nine In hindsight, I could have mentioned that :) 11:54
12:02 reportable6 left 12:03 reportable6 joined
nine jnthnwrthngtn: I think the most important take away from this is that misunderstanding aside, I understand dispatchers well enough to be able to figure out such issues :) 12:06
jnthnwrthngtn bus-factor++ :) 12:10
12:16 ggoebel joined
jnthnwrthngtn Curiously, on the machine I'm working on today, the script dogbert17 posted yesterday is less time-variable than my home one...but it does sometimes take rather longer to run 12:39
From doing a few measurements it at first looked like simplifying the bytecode size threshold scheme yielded an improvement, but I wasn't sure so hacked up a script to do 100 runs and give me a histogram. Turns out that with that to look at, it's actually no help at all 12:40
dogbert17 I noticed that the code can run slowly even if MVM_JIT_EXPR_DISABLE=1. That doesn't happen very often though. 12:46
lizmat fwiw, I've noticed over the years that sometimes Raku just runs a lot slower, for the same program 12:47
I've always assumed some worst case in hashing
jnthnwrthngtn Yes, but the worse case in hashing is potentially so much worse because it affects optimization decisions that in turn have a significant impact. 12:50
Anyway, it's clear from this that doing a lot of measurements and looking at the histogram is going to be important here, because it's easy to do a change and a handful of runs and think there's an improvement. 12:51
Whereas one can just get lucky/unlucky 12:52
nine Benchmarking is hard :/ 12:53
jnthnwrthngtn Indeed, especially with VMs
Which do complex opts based on sampling
MoarVM is in very good company in having problems in this area.
dl.acm.org/doi/pdf/10.1145/3133876 for anyone curious 12:54
But the punchline is 12:55
""Repeating our experiment on 3 different machines, we found that at most 43.5% of ?VM, benchmark? pairs consistently reach a steady state of peak performance."
This is how it looks in my office machine for the triangle numbers case: gist.github.com/jnthn/a3bfc7d0a32c...e46ee63706 12:57
(this is without any changes applied)
lizmat wow
nine That 9.2 is when MoarVM went for a coffee break? 12:58
jnthnwrthngtn That or just a couple of important missed inlinings. 12:59
Obtaining full spesh logs distorts the problem, but I did dump out inlinings when looking at this before writing the tool to do the histogram and noticed in the longest runs we missed inlining of slip-all, for example. 13:01
dogbert17 what can cause that to happen
jnthnwrthngtn One experiment I did (added to gist) is what happens if we have larger spesh log buffers. That turns out not to help at all, and in fact seems to make things slightly worse 13:03
Specialization order is partly decided on by stack depth, so this can be involved. 13:05
(if the numbers are inaccurate for example) 13:06
I do wonder a bit if the fact that the stack model doesn't really account for continuations is involved
nine With a MVM_CALLSTACK_RECORD_NESTED_RUNLOOP Inline::Perl5 makes it through its tests as well! Well, except for a single test that ends up passing a Proxy to a native function 13:15
jnthnwrthngtn Yay :) 13:16
How is the performance at this point? I know it's not wired through to JIT and stuff yet.
13:20 psydroid left, AlexDaniel left
nine Can't really tell yet. csv-ip5xs.pl breaks if run with more than 1931 lines of input - even with spesh disabled 13:22
13:22 AlexDaniel joined
jnthnwrthngtn OK 13:22
13:22 psydroid joined
jnthnwrthngtn Hm, so... 13:22
gist.github.com/jnthn/a3bfc7d0a32c...-frame-txt
Turns out that maintaining stack depth in the sim stack frame rather than as a property of the sim stack itself does have a very noticeable effect: while still not eliminating the problem, it seems to increase the likelihood of reaching one of the lower timings 13:23
nine So some 1900 times the native sub returns a Pointer. But suddenly it returns Int instead. What the? 13:24
jnthnwrthngtn o.O
nine That'd be a classic spesh issue, except for that I'm pretty sure I've spelt MVM_SPESH_DISABLE=1 right 13:25
Next guess is GC
dogbert17 missing root? 13:29
timo my guess is rr knows :P
dogbert17 haha 13:30
timo: btw, has the cat moved away from your monitor?
timo yes 13:33
Geth MoarVM/spesh-stability: a3b6e7bb34 | (Jonathan Worthington)++ | 2 files
Increase stack depth tracking accuracy

Stack depth is used in order to make decisions about what order to specialize things in. Since we sample the interpreted program, we can end up with missing data at some points, and further we update the statistics in batches, and can have buffer ends being reached in all kinds of situations. Tracking the stack depth by keeping it on the frame rather than as a global property seems to lead to more accurate results and thus more regularly reaching peak performance.
13:48
timo you may not like it, but this is what peak performance looks like 13:49
jnthnwrthngtn dogbert17: I'm curious if that branch makes any difference on your machine 13:50
nine Ooh...looks like it is GC, but not something mundane like missing rooting. Looks more like a destructor is doing a native call (with a callback even) at an inopportune time
jnthnwrthngtn o.O 13:53
nine OTOH removing the DESTROY methods doesn't actually change anything. So that call might be innocent after all 13:54
Nicholas timo: I'm confused by the context of "this is what peak performance looks like" and how it relates tot he cat
jnthnwrthngtn It might relate to my commit message instead :P
timo we postpone destruction / finalizer calls to after garbage collection finishes don't we?
nine we do
timo that is supposed to make things like your first assumption not able to happen :D 13:56
nine timo: That's probably why it took me a year to figure this one out: github.com/niner/Inline-Perl5/comm...251d9f0d6d 13:57
Well that and that it simply was incredibly hard to reproduce 14:00
dogbert17 jnthnwrthngtn: let me check 14:01
timo wow, that's tricky 14:03
14:04 frost left
timo i wonder if we can do anything better in spesh dumping in terms of slowing the spesh thread down 14:08
in theory a second thread could be made responsible for rendering spesh graphs to text and outputting, then freeing the spesh data and such
dogbert17 jnthnwrthngtn: executed the program ten times, eight of thos were fast and two slow
jnthnwrthngtn dogbert17: Whereas before none of them were fast? 14:10
timo remember that paper that showed steady states of performance are often reached after a surprisingly long time?
dogbert17 I need to get the numbers ... 14:11
jnthnwrthngtn timo: Is it a different one than that which I linked earlier? 14:12
nine jnthnwrthngtn: preliminary results obtained by disabling the p5_sv_refcnt_dec function completely: master 22.78s, origin/new-disp-nativecall (dispatching to generated function bodies) 10.92s, dispatch to boot-foreign-code (calling the generic MVM_nativecall_invoke): 11.73s 14:13
timo oh, earlier? 14:14
i missed the link haha
than looks like the one i was thinking of
jnthnwrthngtn nine: so iiuc, with boot-foreign-code we are running in half the time of master, and competitive with the generated function bodies but without having to do all the generation pain? 14:16
nine Yes, that's about it. It's also roughly the performance we've had pre-new-disp
jnthnwrthngtn I assume the master timiing is with the generated function bodies too?
Oh, master is master after new-disp
OK :)
I'd be very glad to see the generated function bodies vanish before rakuast :D 14:17
nine master is not using the generated function bodies, because the $!do replacing trickery doesn't work as well anymore for unknown reason. Or rather it's unknown why it worked before new-disp
jnthnwrthngtn hah :)
nine It doesn't because we're replacing the $!do of the original function objects but call clones (which retain the original $!do). No idea why that should be different with new-disp 14:18
jnthnwrthngtn Me either 14:19
So in theory one we integrate the boot-foreign-code appraoch with the JIT, we should be winning quite nicely
*once
nine That's what I hope :) 14:20
Will be a good base for NativeCall 2.0 as well 14:21
dogbert17 jnthnwrthngtn: it seems (after ten 'before' tests) that we had six fast executions (with a two sec variability) and four slow ones (approx. 8 secs slower)
jnthnwrthngtn dogbert17: Aha, so the "runs slower with expr JIT" was really "often runs slower with expr JIT"? 14:22
dogbert17 indeed 14:23
if expr JIT is off the program will run 'fast' 9-10 times of ten 14:25
timo so the average time went up, but the fastest reachable state was perhaps better or at least the same?
jnthnwrthngtn timo: More like the chance of it running fast went up 14:26
OK, so that seems to help, even if not a full solution
dogbert17 yes 14:27
I guess it helped in your tests as well 14:28
timo ah i meant when we turn it on, not when we turn it off
jnthnwrthngtn Yes, compare baseline (first histogram) and stack-depth-in-frame (final histogram) in gist.github.com/jnthn/a3bfc7d0a32c...e46ee63706 14:29
dogbert17 just to avoid any misunderstandings :) 1 - most programs seem to run a bit faster when the expr JIT is turned of. 2 - some programs (e.g. the gist) can sometimes run quite a bit slower than usual. 14:48
so with the gist on my machine with expr JIT enabled: slow ~35s, fast ~27s with a 40 vs 60 percent chance of occuring. Without expr JIT a fast run is at ~26s and a slow one at ~32s, but statistically it seems run fast 90-95 percent of the time. 14:50
with the patch from jnthnwrthngtn and with expr JIT on (i.e. normal settings) the stats are like 80% fast and 20% slow 14:54
dogbert17 continues his ramblings 14:56
another example is thundergnats 'White Noise' script, with expr JIT on the frame rate on my machine is ~74, if it's turned off the FPS is ~80 14:57
jnthnwrthngtn I guess the smaller factor in the expr JIT being slower is missing devirt, and then wider variability in timings is about how well we make optimization decisions 14:58
With the expr jit taking a bit longer to produce code making it more likely we get dubious timings
dogbert17 perhaps it's not relevant here but do you remember this commit comment: 14:59
"This does mean that there is a discrepancy between when the template JIT wishes to add labels and when the lego JIT wishes to, which is worth a closer look, however even if that is figured out, it's still better not to do work in the template JIT that will ultimately be a throwaway due to a missing template." 15:00
jnthnwrthngtn Yes, though I don't think that's a factor here.
dogbert17 ok, just thought I should throw it out there :) 15:01
jnthnwrthngtn Best I can tell the problems relating to decision making mostly impact upon inlining choices
*specialization ordering decisions 15:02
timo i've been pondering like regularly throwing old spesh stuff out and redoing it with more info, and with outdated info tossed out. that is probably dependent on making that one object a collectable, forgot which it was, but it was difficult 15:03
15:04 ggoebel left
timo could be more trouble than it's worth, dunno 15:08
SETTING::src/core.c/ThreadPoolScheduler.pm6:602 does an allocation :o 15:10
15:10 ggoebel joined
timo was pretty cool when the TPS supervisor didn't allocate at all 15:10
jnthnwrthngtn timo: MasterDuke++ has a PR that does much of the work on letting us discard old candidates 15:12
Based on excessive deopts
I need to look at it now we've got new-disp merged
timo ah, yes, MVMSpeshCandidate was heap-allocated and we wanted it gc-managed so we can let it die 15:13
lizmat timo: what is exactly doing the allocation?
the loop itself?
timo MVM_frame_takeclosure it looks like 15:14
lizmat timo: remind me while that is a loop { } and not an nqp::while(1) ? 15:15
timo dunno
lizmat I guess readability, but still
afk for a few hours&
timo the second time it hit the gc was in the clone op, probably cloning a Block object 15:16
it does take a while for it to hit gc, so that's nice
15:17 brrt left
timo one of the frames being invoked also has "please allocate on hea" set, which we set if we notice a frame tends to need to go on the heap, like when we take continuations 15:19
15:20 ggoebel left
timo yeah it allocates at a very leisurely pace, nothing to worry about i guess 15:23
[Coke] Nicholas: I was ready to come back and mention the gripping hand after reviewing, see you beat me... handily. 15:56
<hades.gif> 16:00
MasterDuke yeah, that remove-spesh-candidates PR was 99% done, hopefully just needed a final review and confirmation that something i thought was a bit odd was ok. i haven't tried rebasing it yet, i can probably start on that this weekend
16:40 patrickb left
jnthnwrthngtn OK, so if we don't try and reconstruct stack depth in spesh, but instead just track stack depth in the interpreter and send that to spesh, we get a completely accurate stack depth, even accounting for continuations. 16:51
That gets us to histograms like this: gist.github.com/jnthn/a3bfc7d0a32c...-depth-txt 16:52
Which is a bimodal distribution. I did another run and it did the same. 16:53
So I think this means that:
1. Precise stack depths get rid of a lot of the cases where we do a little bit worse, leading to the majority of runs clustering around the lower end. 16:54
2. Perhaps something else, not relating to the stack depths and thus ordering decisions, is going on to give us the few cases on the second, less common (on this machine) mode. 16:55
dogbert17 does this mean that you have another patch? 17:00
Geth MoarVM/spesh-stability: 824fb58b72 | (Jonathan Worthington)++ | 7 files
Track stack depths precisely in interpreter

And pass those along to spesh. This means that it always has accurate stack depths to use when making decisions about ordering of the specializations.
17:07
jnthnwrthngtn dogbert17: Yes, I'm curious what the results are on your machine
dogbert17 let's see 17:09
ten attempts, five slow and five fast 17:17
jnthnwrthngtn Interesting. 17:20
Though curiously while I made it bi-modal, there were a few more entries towards the second mode than before 17:22
dogbert17 fwiw, 12 runs with MVM_SPESH_BLOCKING=1 were all 'fast' 17:27
Geth MoarVM: e22a190b7d | (Geoffrey Broadwell)++ | 2 files
Doc research on scoping for an AArch64 JIT port

  * Reasons to work on AArch64 as our second JIT platform
  * Currently known JIT porting risks
  * Required knowledge for porters
  * Potential development environments
  * Porting roadmap sketch
  * Alternate roadmap paths to consider
  * Example of expected x64 versus AArch64 tile differences
  * Resource links discovered during research
17:30
MoarVM: 78fd4944f2 | (Geoffrey Broadwell)++ (committed using GitHub Web editor) | 2 files
Merge pull request #1560 from japhb/master

Doc research on scoping for an AArch64 JIT port
MoarVM/spesh-stability: 5040b94723 | (Jonathan Worthington)++ | 2 files
Slightly lower seems-monomorphic thresholds

When we have a threshold of 150 invocations before specialization, it only takes one or two lost datapoints in sampling in order to end up determining something is not type-stable from the statistics. Loosen this up a little.
17:53
MoarVM/spesh-stability: 1a2cf2d5b4 | (Jonathan Worthington)++ | src/spesh/log.c
Ensure we always spesh log full type tuples

If when we log an entry we don't have enough space in the spesh log to send all of the parameter types, then send the current log now, so we can record the complete entry and parameter types in the new one. This is because the spesh stats incorporation will, at log boundaries, discard any incomplete type tuple it sees.
jnthnwrthngtn dogbert17: That final one seems to have gotten rid of the second mode for me, although I've only done one run (they're time-consuming, and I should go home and cook dinner :)) 17:55
17:59 sena_kun left 18:02 reportable6 left
dogbert17 jnthnwrthngtn++, will check. Dinner sounds like an excellent idea :) 18:03
19:01 nebuchadnezzar left 19:04 reportable6 joined 19:19 dogbert17 left, dogbert17 joined
jnthnwrthngtn Uff, back on my machine at home the results look no better at all 20:33
lizmat m: sub a() {}; say a.^ver # jnthnwrthngtn does that seem like a sane thing ? 20:34
camelia 6.c
jnthnwrthngtn Well, it's the same as saying Sub.^ver 20:37
And I guess that's defiend since 6.c :) 20:38
lizmat I'm surprised we don't need the &
jnthnwrthngtn oh hah :D 20:39
Well, then it's same as Nil.^ver, and same story :)
(It's calling the sub, which returns Nil)
It can't be about the Sub itself, because .^ means it's about the type. 20:40
(Even if an & were present, I mean)
lizmat m: class a:ver<0.0.1> { }; sub a() { }; say a.^ver
camelia v0.0.1
lizmat so how does work then ?
I guess it's in the grammar handling .^ right ? 20:41
jnthnwrthngtn Not at all, it's just how we parse names
If it's a type it's a term, if not it's a listop
The decision is made before we see the .^ 20:42
See term:sym<name> (iirc) in the grammar
lizmat ok
jnthnwrthngtn m: class say {}; say 42
camelia 5===SORRY!5=== Error while compiling <tmp>
Two terms in a row
at <tmp>:1
------> 3class say {}; say7ā5 42
expecting any of:
infix
infix stopper
statement end
statement modifier
stā€¦
lizmat well, at least I've figured out how App::Mi6 insists on versioning some of my modules as "v6.c" :-) 20:43
jnthnwrthngtn Another example :)
lizmat when using App::Mi6 you cannot have a distribution "foo" exporting just a sub "foo" 20:46
as it will always set the version to "v6.c" because of this behaviour :-(
contemplating solutions overnight&
moon-child just noticed: moarvm.org/contributing.html still mentions freenode 21:03
21:17 linkable6 left, evalable6 left, evalable6 joined 21:19 linkable6 joined 21:36 ggoebel joined
dogbert17 jnthnwrthngtn: I got 6 fast and 4 slow with your latest fixes 21:54
btw, did you get some nice food?
22:06 ggoebel left 22:38 ggoebel joined 22:49 colemanx left 23:01 ggoebel left
jnthnwrthngtn I cooked some pasta and sausage dish and it was nice :) 23:04
Goodness, I have numbers and they are weird 23:17
Unlike the sort-of-modality I saw on my office machine before it became bimodal, it's bimodal from the start (before any of my changes) on my machine at home 23:18
The following percentages of the number of times the desirable modality is achieved 23:19
52% before any changes
63% with just simple depth improvements
65% with just complex depth improvements
65% with just lower type stability percentages
64% with just logging changes for full type tuples
57% with all improvements together
It gets better, if you combine lower stability percentages with logging changes for full types you get 53% and a trimodal distribution 23:35
At this point I'm hoping this is a bad dream I'll wake up from, or that I'm miraculously drunk on two beers and doing the measurements wrong. 23:36
dogbert17 it's an odd problem, not easily solved it seems 23:37
guess it depends on the beers, are they from some monestary in Belgium by any chance :) 23:41
jnthnwrthngtn Alas no, they're two tasty but not especially strong IPAs.
53% again for depth improvements + lower thresholds, although the histogram is even weirder 23:52