Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes. Set by lizmat on 24 May 2021. |
|||
00:00
linkable6 joined,
evalable6 joined
00:02
reportable6 left
|
|||
timo is too bed | 00:05 | ||
i'm not actually in bed | 01:01 | ||
01:43
frost joined
|
|||
timo | hmm, for a callsite transformation cache design, it'll have to not only be threadsafe, but ownership is also important. i guess only interned callsites are allowed to go in anyway | 01:48 | |
drop_arg is apparently never called on an interned callsite object | 02:18 | ||
oh, or drop_arg is not used instead of drop_args | 02:19 | ||
using directly the intern cache seems to be an okay idea | 02:27 | ||
02:44
linkable6 left,
evalable6 left
02:49
[Coke] left
02:51
[Coke] joined
|
|||
timo | MVM_callsite_drop_positionals, even when it looks through the intern cache first and tries to intern at the end if the incoming cs was interned but nothing appropriate was found in the intern cache | 03:26 | |
is at only 0.16% as per perf | |||
callstack_find_topmost_dispatch_recording is at 0.16% but a little further up | 03:27 | ||
0.02% drop_positionals when there's neither the looking through the cache nor the intern attempt at the end | 03:30 | ||
all of this measured for rakudo -e '' | |||
03:45
linkable6 joined
03:47
evalable6 joined
|
|||
japhb | Exclusive times, I assume? | 03:55 | |
05:00
notable6 left,
benchable6 left,
nativecallable6 left,
committable6 left,
bisectable6 left,
squashable6 left,
releasable6 left,
quotable6 left,
bloatable6 left,
unicodable6 left,
evalable6 left,
statisfiable6 left,
greppable6 left,
coverable6 left,
shareable6 left,
sourceable6 left,
tellable6 left,
linkable6 left,
linkable6 joined,
sourceable6 joined,
tellable6 joined
05:01
coverable6 joined,
squashable6 joined,
unicodable6 joined,
evalable6 joined
05:02
benchable6 joined,
nativecallable6 joined
05:35
codesections left
05:36
codesections joined
|
|||
Nicholas | good *, * | 05:43 | |
en.wikipedia.org/wiki/X86_calling_...onventions -- Microsoft x64 calling convention ... ... System V AMD64 ABI ... If the callee is a variadic function, then the number of floating point arguments passed to the function in vector registers must be provided by the caller in the AL register. | 05:50 | ||
I *thought* that there was also some requirement on integer arguments. Anyway, that one bites | 05:51 | ||
06:01
releasable6 joined,
statisfiable6 joined
06:05
reportable6 joined
07:00
quotable6 joined,
notable6 joined
07:02
committable6 joined
07:24
sena_kun joined
07:57
dogbert17 left
08:00
greppable6 joined
08:01
bisectable6 joined
08:25
dogbert17 joined
08:31
discord-raku-bot left
08:32
discord-raku-bot joined
09:00
bloatable6 joined
10:01
shareable6 joined
|
|||
MasterDuke | huh. on (roughly) master, i got 19s then 14s for m-test, and 130s and 126s for m-spectest | 10:26 | |
lizmat | so faster than master ? | 10:38 | |
MasterDuke | well, i was comparing to yesterday when i was on new-disp and got 19s and 19s for m-test, and 178s and 171s for m-spectest | 10:40 | |
jnthnwrthngtn | MasterDuke: Curious, I get 18s and 15s here | 10:47 | |
No way is spectest going to be faster than master given spectest is hugely dependent on startup time. | |||
MasterDuke | well, you do have a much faster machine. but i'm surprised my second run isn't any faster than the first | ||
jnthnwrthngtn | Yes, that's the part I'm surprised about. | 10:48 | |
I mean, nativecall is already pre-compiled, so it should be no trouble | |||
(For the second run) | |||
MasterDuke | maybe it was just a bad hash randomization/spesh not being blocking interaction | ||
jnthnwrthngtn | I figure that's the main slowdown | ||
MasterDuke | timo++'s prs should help with startup/spectest, correct? | 10:49 | |
lizmat | perhaps it's an effect of my work ? | 10:50 | |
:-) | |||
MasterDuke | my new-disp times don't account for your commits from today, so hopefully they'll be faster next time i run it | 10:51 | |
jnthnwrthngtn | lizmat: I highly doubt the setting changes you're doing would have an impact on this, if that's what you're meaning | 10:52 | |
lizmat | ah, ok | 10:53 | |
jnthnwrthngtn takes a look at timo++'s work | 10:54 | ||
11:01
linkable6 left,
evalable6 left
11:02
linkable6 joined
|
|||
jnthnwrthngtn | Hm, I was going to say that fix to args tail is making an assumption that something with no capture is an arg drop, and that could be fragile if we later do a replace arg, although replace would be drops + insert... It'd take an insert multiple to be a problem I guess | 11:03 | |
MasterDuke | there are a bunch of ops where the interpreter implementation is roughly `if (REPR(foo)->ID != MVM_REPR_ID_something || !IS_CONCRETE(foo)) MVM_exception_throw_adhoc(tc, msg) else <do something>`. sometimes the jit version it just <do something>, and sometimes it has that `if` | 11:07 | |
lizmat | jnthnwrthngtn: sanity check: use nqp; sub a(Range:D $a) { dd nqp::iscont($a) }; my $r = 1..3; a $ gives 1 on new-disp | 11:12 | |
I thought they were to be deconted? | |||
MasterDuke | is there an easy way to know if the jitted version does/does not need the repr and/or concreteness checks? | 11:13 | |
lizmat | /they were/$a is/ | ||
MasterDuke | m: use nqp; sub a(Range:D $a) { dd nqp::iscont($a) }; my $r = 1..3; a $r | 11:14 | |
camelia | 1 | ||
jnthnwrthngtn | lizmat: Hm, I'd have expected master to do the same given the Range constraint | 11:20 | |
lizmat: Is the :D significant? | |||
lizmat | no | ||
jnthnwrthngtn | One can't rely on the caller side of have decont'd in new-disp, it just may have done it | 11:21 | |
Oh! | |||
m: say Range ~~ Iterable | |||
camelia | True | ||
jnthnwrthngtn | The container is required here because it's an Iterable | ||
Otherwise it would flatten | |||
So the signature binder re-wraps it | |||
So yeah, it's correct | 11:22 | ||
MasterDuke: If the C thing being called does the checks, then the interpreter (and JIT if needed) can have them removed. If it doesn't, they're needed in both | |||
MasterDuke: One of the reasons we'll gradually move towards syscalls, though, is that we can enforce the types outside of the C function and elide them | 11:23 | ||
11:23
brrt joined
|
|||
jnthnwrthngtn | lunch, bbiab | 11:23 | |
MasterDuke | ok. for some reason i thought we could assume in some cases that things were concrete and/or the right REPR when being run by the jit. but i'll make sure any checks done by the interpreter are also done in the jitted versions i'm making | 11:26 | |
lizmat | down to 1.251 / .703 | 11:47 | |
12:02
reportable6 left
12:03
evalable6 joined
|
|||
jnthnwrthngtn | MasterDuke: Yes, though do check the interpreter really needs them also | 12:04 | |
MasterDuke | ah, how do i check that? | 12:05 | |
jnthnwrthngtn | lizmat: With further ops converted to $ ? | ||
MasterDuke: Look at the C function called by the interp and see if it repeats the check | |||
MasterDuke | oh, ha. so far i'm just creating the c funcs, so not repeating anything, but i can check the existing ones i come across | 12:06 | |
Geth | MoarVM/new-disp: edc4fb9d57 | (Timo Paulssen)++ (committed by Jonathan Worthington) | 7 files add dispatcher-drop-n-args to optimize allocations Instead of creating a MVMCapture and MVMCallsite for each step of removing arguments, we now offer a syscall that drops multiple arguments that live at the same index in one go. The result is that the transformations tree can now contain null entries for the capture entry, which we have to interpret and deal with. |
||
jnthnwrthngtn | timo: I did a small change in order I could get some asserts in that will help us if we try to do further things like this | 12:09 | |
timo: Merged the NQP and Rakudo ones as is (well, rebsaed Rakudo one) | 12:13 | ||
lizmat | jnthnwrthngtn: yes | 12:14 | |
fwiw, no noticeable change in test-t after timo's work | 12:20 | ||
MasterDuke | any changes in spectest? | 12:21 | |
lizmat | I don't time that atm | ||
MasterDuke | `t/02-rakudo/03-corekeys-6d.t .................................... Dubious, test returned 1 (wstat 256, 0x100)`, doh | 12:23 | |
lizmat | did I broke that? | 12:30 | |
jnthnwrthngtn | lizmat: It was primarily aimed at startup, and does managed to give sound couple of thousand allocations less at lesat. | 12:31 | |
*least | |||
MasterDuke | no, this is on my branch off of master. it was just a flap, but that test is pretty simple... | 12:32 | |
jnthnwrthngtn | ah, actually more than that, the thousands are only those one profile-compile starts measuring but there are others in NQP setup | 12:33 | |
lizmat | argh I didn't pull rakudo itself, so I missed that part of timo's work | 13:03 | |
new timings in a mo | |||
13:04
reportable6 joined
13:05
frost left
|
|||
jnthnwrthngtn | Be sure to pull NQP too | 13:05 | |
lizmat | no noticeable change in test-t | ||
perl Configure.pl --force-rebuild --gen-moar=new-disp --gen-nqp=new-disp --make-install | 13:06 | ||
will do that, but not for Rakudo itself :-) | |||
afk for a few hours& | |||
13:36
brrt left
|
|||
jnthnwrthngtn | Just been doing some comparative measurements of master/new-disp (mostly microbenchmarks, also measuring a Cro app). We're doing a lot better at a bunch of the targetted features, of course, but also a bit better on various things that are effectively just multi/method dispatch based. | 13:50 | |
MasterDuke | very cool | 13:51 | |
jnthnwrthngtn | The Cro app gets a couple of hundred more requests per second, around 10% more. | 13:52 | |
Nicholas | ./rakudo-m -Ilib t/spec/S32-list/grep.rakudo.moar | ||
jnthnwrthngtn | What's worse, other than the obvious (startup) | ||
Nicholas | MoarVM oops in spesh thread: Spesh: failed to fix up inline 1 () -1 -1 | ||
that was MoarVM edc4fb9d57d245929ee5d4d013b22bef1a63bf9b | |||
(not sure if the rest matters) | |||
jnthnwrthngtn | Are almost entirely I/O benchmarks (reading lines, writing lines) | ||
Nicholas | ASAN made no comment | ||
jnthnwrthngtn | The reason for the I/O ones is seemingly that OSR does not function | 13:54 | |
MasterDuke | ah. that might explain why my spesh log processing one-liner is slower | ||
jnthnwrthngtn | This is due to a7c0cc8d2b, which is a bug fix | ||
Somewhere on the I/O path we do a role composition, and do a `ctx` op. We actually only want it for the current context, not for traversal. | 13:55 | ||
Nicholas | my assertion failure was with rakudo back at 9c587d92d0cdb2aa86c2ca70ed15b5c478443b02 -- Use new dispatcher-drop-n-args syscall | 13:56 | |
jnthnwrthngtn | However, we don't have a way to indicate that, and so it assumes it's wanted for traversal and marks everything in the caller chain | ||
We then are unable to OSR | |||
m: say 0.6814 / 0.4247 | 13:57 | ||
camelia | 1.604427 | ||
Nicholas | "obviously" (it seems to be, as a poor quality teddy bear) that the brute force solution to this is a second op that is just "the current context". But is there a better way? | ||
jnthnwrthngtn | A 60% slowdown. That's not nice. | 13:58 | |
So that probably needs a solution | |||
The other case that I don't have an explnation for yet is an object creation benchmark | 13:59 | ||
MasterDuke | is `ctxlexpad` sort of "the current context"? | ||
jnthnwrthngtn | ctxlexpad turns out to be the identity function :/ | 14:01 | |
I suspect it hadn't used to be | |||
Nowadays the thing from ctx is just directly indexable for the current context | 14:02 | ||
Nicholas | "in spesh thread" - this might be the first "win" from commit 998ea76a17cb8dbafc6dc392d15d40a487d236c3 | 14:04 | |
14:04
linkable6 left
|
|||
jnthnwrthngtn | Nicholas: I've been happy about that at least a copule of times before recently | 14:05 | |
s/before// | |||
14:05
linkable6 joined
|
|||
Nicholas | ah OK. It's the first that *I* noticed. I slack more. | 14:05 | |
(a *lot* more) | |||
timo | hm looks like the discord bridge works only one-way at the moment | 14:12 | |
cdn.discordapp.com/attachments/633...004322.jpg | 14:20 | ||
jnthnwrthngtn | Nicholas: I just did a spectest with blocking/nodelay to verify my change to get OSR back and also see that inline fixup exception | 15:01 | |
Geth | MoarVM/new-disp: 6f2b01c275 | (Jonathan Worthington)++ | 8 files Introduce non-traversable contexts These are for when we will only read the lexicals of the exact frame we obtained it in, and thus can avoid marking the whole callstack up as needing caller position information preserved. |
15:06 | |
MoarVM/new-disp: baf1423327 | (Jonathan Worthington)++ | 3 files Be more precise about OSR caller positions There are two situations in which we set the caller info needed flag: one when we throw an exception and want to produce a backtrace, and another when we need to do context introspection. Only the latter is in absolute need of accurate position information, and thus must poison OSR. This, together with non-traversable contexts, lets us get OSR back in various situations, including some common cases of I/O, fixing a performance regression relative to `master`. |
|||
Nicholas | running that spectest with a non-ASAN build with valgrind produced quite a bit of excitement at optimize_bb_switch (optimize.c:2299) and optimize_bb_switch (optimize.c:2280) | 15:07 | |
Conditional jump or move depends on uninitialised value(s) | |||
and once at at 0x4B82E97: build_cfg (graph.c:487) | |||
Geth | MoarVM/new-disp: ea63d91730 | (Jonathan Worthington)++ | src/core/ext.c Fix uninitialized read in spesh graph building |
15:09 | |
jnthnwrthngtn | Nicholas: The second of those was easy, the first I've spent a while trying to figure out and can't | ||
(A while before now, that is) | |||
Geth | MoarVM: MasterDuke17++ created pull request #1550: Add '.new()' suggestion to type object errors |
||
Nicholas | jnthnwrthngtn: the ones you can't figure out - is there (at least) a short(er) way to trigger then? | 15:11 | |
them | |||
jnthnwrthngtn | Nicholas: I didn't repro them, just went through the code involved a few times | 15:12 | |
Hm, or if I did it didn't give me any extra clues... | |||
Geth | MoarVM/attempt_use_intern_cache_for_drop_positionals: 23ebdab1ce | (Timo Paulssen)++ | src/core/callsite.c If possible, use the intern cache for transforms Doesn't actually seem faster than allocating them every time we do transformations. I have only measured using an empty raku program, however, since I was hoping to make startup cheaper. |
15:18 | |
timo | ah, yes. sometimes you ctx, but sometimes you ctxn't | 15:19 | |
jnthnwrthngtn | Well, there really are no annotations for inline 1 in the spesh graph... | 15:21 | |
MasterDuke | i just switched to new-disp and pulled all three repos, built, and ran two `make m-test m-spectest`. got 22s and 19s for m-test, and 176s and 171s for m-spectest | 15:22 | |
Nicholas | timo: reason why you spotted that PyPy blog post and I didn't - it's not on the front page. | 15:24 | |
timo | that's odd | 15:25 | |
Nicholas | not totally. | ||
timo | you think it's not "common interest" or whatever? | ||
Nicholas | I didn't dig into *how* they made the side, but it looked like it might have been that it required "manual" work to update the front page. (No idea if that's a script to bake a new front page, or what) | 15:26 | |
I think that this was oversight. But I failed to be helpful and try to create a decent bug report | |||
MasterDuke | hm, does look like maybe my spesh log processing one-liner is a bit faster after that OSR fix though... | 15:27 | |
timo | that's the code that uses -n that you mentioned the other day, yes? | 15:30 | |
MasterDuke | yeah | ||
jnthnwrthngtn | OK, I figured out the inline fixup bug and it's terrible | 15:31 | |
Nicholas | um, like "headdesk, how did I make that mistake?" or "oh, erk, this is gnarly to get right?" | 15:32 | |
jnthnwrthngtn | It occurs when all of the following happens: | ||
1. We are doing a nested inline | 15:33 | ||
2. The thing we are inlining, which has its own inlines, has an inline that shrank to zero instructions | |||
3. The annotations about it end up on an sp_bindcomplete, which we delete as part of inlining | |||
It processes the annotations on the bindcomplete instruction and fixes them up. We then delete said instruction. The annotations then move onto the next instruction so as not to get lost. | 15:34 | ||
We then fix them up again | |||
Making them bogus | |||
timo | and then they bug us | ||
MasterDuke | am i correct in thinking that if possible, it's better to jit something via writing some asm in emit.dasc than moving it to a function and calling that from the interpreter and the jit? | 15:37 | |
timo | we're essentially making the same trade-off the compiler does when deciding whether to inline a given function | 15:39 | |
Geth | MoarVM/new-disp: 75560fd2ec | (Jonathan Worthington)++ | src/spesh/inline.c Correct deletion of sp_bindcomplete We cannot do it immediately, as annotation motion might cause us to fix up the same annotation twice, which is wrong. Thus do the deletion after all fixups of annotations are completed. |
15:40 | |
timo | if the code to Do The Thing is about as short as the stuff to call the function and the parts of the function that deal with being called, then we can probably prefer emit.dasc | ||
jnthnwrthngtn | Nicholas: That seems to do it. | ||
MasterDuke | i'm going to guess they are here github.com/MoarVM/MoarVM/blob/mast...3010-L3027 | 15:45 | |
jnthnwrthngtn | With OSR reinstated we now beat master at the I/O benchmarks, and have caught up with Ruby in a "write a million lines of utf-8" one | 15:49 | |
timo | in theory spesh could put markcode* into the repr-speshed ops and allow MVMCode REPR to optimize it into sp_get_i* and sp_bind_i* or whatever | 15:50 | |
jnthnwrthngtn | timo: Yes, I'd already figured we want something like that, just didn't quite figure out how | 15:51 | |
(As in, a nice way to factor it) | |||
timo | getstaticcode and gedcodecuid could also | ||
do you mean how i described it isn't that nice way to factor it? | 15:53 | ||
japhb | jnthnwrthngtn: Nice to hear we're caught up with Ruby on that benchmark, but where is Ruby on the utf-8 I/O efficiency scale? Is this a major achievement? | ||
MasterDuke | somehow i missed that comma 2021.08 was released, i'll have to give its profile viewer a try | 15:54 | |
jnthnwrthngtn | japhb: More efficient than Python, less than Perl. | 15:55 | |
japhb: Although I should add: less than *recent* Perl. | |||
(I think there were UTF-8 I/O speedups there) | 15:56 | ||
japhb: Major achievement only really in so far as our I/O handle impl and coordination of encoding is all in Raku, whereas I suspect in Ruby one ends up far more quickly in C code. | 15:57 | ||
OK, so about the object creation benchmark we've lost something on from master: profiler says we JIT 99.95% of frames on master but 36.35% of frames on new-disp. 4% less inlining. | 15:58 | ||
timo | oof | ||
jnthnwrthngtn | Though oddly, I can't find any "bailed completely" in the spesh log | 15:59 | |
lesson, bbl | 16:01 | ||
timo | are there any prof_enter that should have become enterspesh in the spesh log? | ||
MasterDuke | that sounds like exactly what a profile of my one-liner shows | 16:02 | |
16:02
AlexDaniel left,
psydroid left
16:04
AlexDaniel joined
16:14
psydroid joined
|
|||
japhb | jnthnwrthngtn: Ah, interesting re: UTF-8 I/O efficiency. That all gives me good context, thanks. | 16:28 | |
MasterDuke | timo: show can i know if a prof_enter should have been prof_enterspesh? if it's in the 'after'? | 16:30 | |
timo | yeah | ||
MasterDuke | they're all in the 'Before', don't see in an 'After' | 16:36 | |
timo | OK | ||
16:36
Altai-man joined
16:37
Altai-man left
|
|||
MasterDuke | any other ideas? | 16:45 | |
timo | i'd perhaps perf record and see if there's actually a big portion of samples in interp_run rather than jitted frames which would be identified from having the perf map on | 16:47 | |
except i've seen a boatload of 0x000000asdfgh frames in perf report results as well even with the perf map turned on | 16:48 | ||
MasterDuke: could you give me your -n code right quick? i thought i had it but i don't | 16:59 | ||
MasterDuke | raku -ne 'BEGIN my (%h, $f); if .starts-with(q|Spesh of |) and /^"Spesh of " $<func>=(<-[\ ]>+)/ { $f = ~$<func> } elsif .contains(q|JIT: bailed completely because of <|) and /"JIT: bailed completely because of <" $<op>=(<-[>]>+)/ { %h{q|l_|~$<op>}.push($f) } elsif .contains(q|expr bail: Cannot get template for: |) and /"expr bail: Cannot get | ||
template for: " $<op>=(\w+)/ { %h{q|t_|~$<op>}.push($f) }; END for %h.keys.sort -> $k { say qq|$k: %h{$k}.Bag()| }' | |||
timo | haha that's long | ||
MasterDuke | 23.75% MVM_interp_run | 17:00 | |
10.01% MVM_string_utf8_decodestream | |||
yeah, guess i could pull those strings out into a variable | 17:01 | ||
of course i only duplicated them when it was too slow with just the regex | 17:03 | ||
of course i only duplicated them when it was too slow with just the regex | 17:04 | ||
timo | hehe. | 17:05 | |
MasterDuke | 23% interp_run is way higher than usual, seems to suggest stuff actually isn't getting jitted. but why can't we tell why? | 17:07 | |
timo | <anon> from -e:1 here has 1.3 mega entries and 0% jit | 17:08 | |
hum. | 17:11 | ||
there's no complete bail | |||
but it does say "jit not successful" | |||
jnthnwrthngtn | oh | 17:13 | |
timo | it does succeed jitting in my non-profiled version here | 17:14 | |
wonder what's wrong there | |||
jnthnwrthngtn | I'd missed the "jit was not sucessful" | ||
timo | oh, is it normal to have more than one return_o in a resulting frame? | ||
jnthnwrthngtn | timo: "resulting"? | 17:16 | |
It's OK for there to be more than one return_o in general | |||
timo | "After:" | ||
jnthnwrthngtn | Oh | 17:17 | |
Well, did the before have it? | 17:18 | ||
timo | ok i searched further, there's more than one -e:1 and the longer one is also not usccessfully jitted without profile | ||
nine | Darn.... the "Type check failed for return value; expected CompUnit::Handle:D but got BOOTIO (BOOTIO)" is still here. Will have a look at this on Saturday I guess | ||
jnthnwrthngtn | nine: I was gonna see how new-disp did on agrammon and it also blew up with that | ||
nine | Oooh...so it's not just this one application. | 17:19 | |
Gives hope for a reduced test case | |||
jnthnwrthngtn | Agrammon is kinda the opposite of a reduced test case, but I dunno how big your application is :D | ||
nine | tree says 47 directories, 201 files | ||
jnthnwrthngtn | Hm, it may be smaller | 17:20 | |
(Agrammon, that is) | |||
Wonder if it's a deopt-o | |||
nine | I guess the deciding factor is just: load tons of modules so load-precomp-file gets speshed | ||
not sensitive to inlining for a change | 17:21 | ||
jnthnwrthngtn | I couldn't imagine there would be as many ways to screw up deopt as I've managed to create... | ||
nine | MVM_JIT_DISABLE=1 MVM_SPESH_OSR_DISABLE=1 MVM_SPESH_PEA_DISABLE=1 MVM_SPESH_INLINE_DISABLE=1 and its still there | ||
jnthnwrthngtn | timo: I shoved in some debug prints and it turns out that we make a JIT graph but fail to compile it | 17:22 | |
nine ought to start making dinner though | |||
jnthnwrthngtn | timo: And it happens without profiling, so I suspsect that really is the problem | ||
timo | right, i see it regardless of profiling or not as well | ||
so somewhere in the final jitting step it's failing but not bailing | 17:23 | ||
jnthnwrthngtn | I made a nice batch of potato salad yesterday and so dinner preparation is easy today :) | ||
timo | if you want i'll reverse-step in rr to see where exactly it stops | ||
jnthnwrthngtn | I was trying to do exactly that, set a breakpoint on printf, and it didn't hit it, wat. | 17:25 | |
17:25
Xliff joined
|
|||
timo | we vsprintf | 17:26 | |
that may not go through printf | |||
jnthnwrthngtn | ah | 17:34 | |
ah, with MVM_JIT_DEBUG I get: | |||
JIT ERROR: Negative offset for dynamic label 18 | |||
Ohh. With MVM_JIT_EXPR_DISABLE=1 it goes away | 17:43 | ||
And we get 99.9% JIT | |||
So it's apparently about the expression JIT | 17:44 | ||
Though despite that it still doesn't really get back all the perf... | |||
Ah, the bigger discrepancy may well be that `new` doesn't get inlined | 17:46 | ||
Ah, just a "bytecode too large". Guess I need to look at why | 17:48 | ||
But first, food | |||
dogbert17 | so, it's time for brrt to make an appearance ... | 18:01 | |
18:02
reportable6 left
18:05
reportable6 joined
|
|||
MasterDuke | wow, with MVM_JIT_EXPR_DISABLE=1 i get 12.36% MVM_string_utf8_decodestream and 9.25% MVM_interp_run | 18:13 | |
Xliff | Anybody wanna play around on a 64-core Google Engine instance? | 18:38 | |
Looks like Raku may have a parallelism bug ... or maybe my script does. | 18:39 | ||
Working script conks out when parallel compiling my p6-ICal project with no reason. | |||
I have no idea what to look for. | |||
If no takers, I'll pack it up and shut it down. It's costing me $3/hour | |||
I'll check again in a couple of hours | |||
Nicholas | I sugest that you shut it down for now, assuming it doesn't cost you lots more than $3 to set it all up again | 18:40 | |
I'm going AFK soon, and I think most folks are not really awake | |||
[Coke] | I'm in the right time zone, but also am zonking. | ||
Nicholas | Thanks to the GCC compiler farm I have access to 32 core x86_64 machines. Which, sure, aren't 64 cores. But aren't costing me. | 18:42 | |
(and insane PPC machines, I think thanks to IBM. But if the bug is in the JIT, they won't help) | |||
Xliff | I just tried at concurrency level 32 and the hangup did NOT occur. | 18:50 | |
I will try my whole set of projects at level 32 and see what I get! ;q | 18:51 | ||
Nicholas | jnthnwrthngtn: yes, you fixed the failure in t/spec/S32-list/grep.rakudo.moar | 18:54 | |
(forgot to confirm) | |||
19:05
linkable6 left,
evalable6 left
19:06
evalable6 joined
|
|||
timo | i wonder if i should try exposing "percentage of calls that aren't jitted because they were inlined calls from a non-jitted frame" in moarperf? | 19:09 | |
at the moment you can open the "callers" table in the routines list, then you see one with 99.4% inlined, 0.0757% jitted and one with - inlined but 99.5% jit | 19:10 | ||
MasterDuke | huh, could be useful | 19:14 | |
timo | we don't have separate nodes in the call graph for inlined calls vs regular calls, so we can't "follow" inlined calls to the original inliner so to speak | 19:16 | |
anybody feel like we should maybe statically determine what branches have not been taken at all and throwing them out of our spesh graphs and put an unconditional deopt there? | 19:37 | ||
nine | timo: sounds like it help with inlining by getting the bytecode size under the limit. Also doesn't sound like something very common? | 19:45 | |
timo | can search for "never dispatched" | ||
it's common for subs that have a path that throws an exception in some cases | |||
like division that has to check for zero for example | 19:46 | ||
jnthnwrthngtn | Exception paths often aren't taken and could indeed be handled with deopt | 20:00 | |
And yeah, lack of inline cache entries is a really good hint. | 20:01 | ||
We don't even have to record branch stats that way | |||
timo | since we have a dispatch_* every few meters now anyway ... :) :) | 20:03 | |
jnthnwrthngtn | Indeed :) | 20:05 | |
Hm, this is weird. `method new` is not getting any type tuples recorded in its stats | 20:06 | ||
(Mu.new, that is) | 20:07 | ||
omg | 20:17 | ||
I missed updating a spot in the spesh stats code for the change to the way named parameters are handled | 20:21 | ||
As a result, we lost type info for everything with named args | 20:22 | ||
timo | :D | ||
jnthnwrthngtn | Anyway, that gets things much better :) | 20:31 | |
Will do a blocking + nodelay test to make sure the extra optortunities don't shake out new problems | |||
Geth | MoarVM/new-disp: 7d3cba4e2d | (Jonathan Worthington)++ | src/spesh/stats.c Correct handling of named arg type stats This wasn't updated for the new calling conventions, and thus we would consider type tuples with named arguments to have incomplete type info, and thus specialize them suboptimally. |
20:47 | |
21:37
jgaz joined
|
|||
MasterDuke | whoops | 21:58 | |
22:11
jgaz left
|
|||
jnthnwrthngtn | Good for my private benchmark set though; it's caught two regressions that have both been fixed today. | 22:13 | |
22:15
jgaz joined
|
|||
jnthnwrthngtn | I'm content so far as performance goes with the merge now. The startup hit is the only significant regression I'm aware of. Guess we'll see what blin's verdict is. | 22:16 | |
MasterDuke | i still see the problem where my script is only 55% jitted | 22:19 | |
22:19
jgaz left
|
|||
jnthnwrthngtn | Was it more JItted on master? | 22:23 | |
And does the rate go up with MVM_JIT_EXPR_DISABLE=1 ? | |||
MasterDuke | yes, it looks like the rate goes up with MVM_JIT_EXPR_DISABLE=1. i didn't get a profile like that, but interp_run goes from ~23% to ~10% | 22:24 | |
jnthnwrthngtn | OK, probably it's the thing I discovered earlier then | ||
lizmat | down to 1.241 / .719 # all within noise, but feels a bit faster | ||
jnthnwrthngtn | MasterDuke: If you run MVM_JIT_DEBUG=1 does it spit out a message about a negative label? | 22:25 | |
MasterDuke | let me see... | 22:26 | |
JIT ERROR: Negative offset for dynamic label 185 | |||
JIT ERROR: Negative offset for dynamic label 65 | |||
jnthnwrthngtn | That's the one | ||
I do wonder if we can isolate it to a particular template | |||
Also wonder how widespread this is | 22:27 | ||
Ah, I see a bunch of them during the Rakudo build if I set it while doing that | 22:28 | ||
Several of them in Test::CSV too | 22:29 | ||
MasterDuke | ha. profile with no env variables is 2mb. profile with MVM_JIT_EXPR_DISABLE=1 is 11mb | ||
and 89% jitted instead of 55% | |||
both have 35k deopts | 22:30 | ||
whoops, without is 44% jitted, not 55% | |||
jnthnwrthngtn | Yeah, this is worth trying to hunt down. | 22:33 | |
MasterDuke | interesting. on master, the profile says it's 20s slower (than new-disp with MVM_JIT_EXPR_DISABLE=1), but 93% jitted instead of 89%. still ~33k deopts | 22:42 | |
jnthnwrthngtn | MasterDuke: What's the wallclock times on master/new-disp? | 22:51 | |
MasterDuke | master is ~58s | 22:52 | |
building new-disp... | 22:54 | ||
timo | were you able to rr it by breakpointing vsnprintf or whatever, jnthnwrthngtn? | 22:55 | |
MasterDuke | new-disp is ~55s | 22:56 | |
new-disp + MVM_JIT_EXPR_DISABLE=1 is ~45s | |||
hot damn | 22:57 | ||
jnthnwrthngtn | MasterDuke: Ah, so the issue not that new-disp is slower, but that it should be even faster. OK, that's a nice problem. :) | ||
(I'd misunderstood it as new-disp being slower) | 22:58 | ||
MasterDuke | a couple days ago it was ~15s slower, so there have been some good improvements recently | ||
timo | worst case ever would be: when this problem from exprjit happens, start again from the start but turn exprjit off, either completely, or after the given BB or ins or whatever that caused trouble | ||
jnthnwrthngtn | timo: I just put a breakpoint on the line where printf was. I did trace it back a bit further, but then spotted there's a jit debug option, ran with that, and it told me where it was bailing out | 22:59 | |
timo | i wonder if i can find anything interesting by finding the exact spot where it happens tho | 23:00 | |
jnthnwrthngtn | At a wild guess, we create a label but never emit it, so it never gets fixed up by dynasm | ||
timo | did JIT_DEBUG also spit out the graphs and tile lists in the spesh log? perhaps that only comes after the error, so wouldn't show anything in our troubled case | 23:01 | |
jnthnwrthngtn | Hm, not sure...I think that's another option? | ||
The error is really late, fwiw | |||
It's storing the labels produced by dynasm into the spesh candidate and notices a negative one | 23:02 | ||
timo | yeah, after all the nodes have been put down, which includes one node for each exprjit graph | ||
jnthnwrthngtn | Yeah, we've even produced machine code by that point | ||
I wonder if labels get negative offsets before dynasm emits code and fixes them up as it does so | 23:03 | ||
Thus the "not emitted label" theory | |||
I don't know the expr jit well enough to know how plausible/likely that is | |||
timo | right. same, really | 23:06 | |
23:07
linkable6 joined
|
|||
timo | 13: (branch (label $name)) | 23:13 | |
and then the label is nowhere to be seen! | |||
there is only (label $name) and (label :fail) in the tile list logs? ok that just means that's the tile that implements a piece of the tree, that's why it says $name there | 23:16 | ||
but it's definitely missing another appearance of the (label ...) tile | 23:17 | ||
it's a bit tricky to navigate the huge spesh logs we have, especially when there's thousands and thousands of lines just for updating stats but no specializations | 23:23 | ||
i'll print out some pointers or something to help me find the spot where actually a thing happens | 23:25 | ||
yeah i don't actually know where to look here to see what's going on | 23:51 | ||
i can dump the compiled bytecode before it tries to do the dynamic label fixup | 23:53 | ||
but i think i still need brrt to make sense of this problem | 23:57 |