00:16
brrt joined
|
|||
Geth | MoarVM/make_unbox_removal_available: 7 commits pushed by (Bart Wiegmans)++, (Timo Paulssen)++
|
01:18 | |
timotimo | it's not based off of the spesh-refactor-iffy branch because that doesn't have the getrusage change so "make test" fails | 01:19 | |
i didn't yet inspect spesh logs to see if this actually works properly and eliminates boxing actually | 01:20 | ||
i found a frame from BOOTSTRAP that uses the optimization. neat. it doesn't actually remove the boxing, just decides based on the result of atpos_i instead of the result of box_i | 01:39 | ||
oh, it's find_best_dispatchee | 01:40 | ||
interestingly it is marked as still having 18 usages | |||
it's only that one routine multiple times in the speshlog %) | 01:42 | ||
more on this tomorrow i guess | 01:53 | ||
01:55
ilbot3 joined
|
|||
MasterDuke | ooh, speeding up find_best_dispatchee would be nice, i know i've seen it toward the top of profiles before | 02:03 | |
02:17
brrt joined
02:19
AlexDaniel joined
04:18
brrt joined
05:29
domidumont joined
05:37
domidumont joined
05:42
brrt joined
|
|||
Geth | MoarVM: bdw++ created pull request #826: Refactor optimize_iffy |
05:44 | |
brrt | \o | ||
05:59
domidumont joined
07:02
brrt joined
|
|||
brrt | .tell timotimo that refactor-iffy should be based on top of the latest master so that shouldn't be a problem | 07:18 | |
yoleaux | brrt: I'll pass your message to timotimo. | ||
brrt | also, if you feel possitive that this optimize_unbox actually works, feel free to add it to the refactor-iffy branch :-) | 07:19 | |
07:26
robertle joined
08:10
zakharyas joined
08:21
zakharyas joined
09:26
zakharyas joined
10:21
committable6 joined
|
|||
timotimo | as i see it, the optimization works, but is essentially worthless because the boxed data is - in the few examples i looked at - always held alive by other pieces of the spesh graph | 10:42 | |
yoleaux | 07:18Z <brrt> timotimo: that refactor-iffy should be based on top of the latest master so that shouldn't be a problem | ||
timotimo | and bailing out immediately when i see a PHI node makes it only ever work during the first few BBs, which is very rare | 10:43 | |
grmbl, i see a bunch of PHI nodes with only two arguments ... i wonder what happens if i consider these harmless | 10:56 | ||
lizmat | timotimo: if I profile 'Nil for ^1000000', I see 2 strange things | 11:02 | |
timotimo | uh oh | ||
lizmat | 1. it allocates almost a million Ints | ||
2. but not quite (999985) so it looks like it's lost a few (perhaps in the final nursery?) | |||
timotimo | the allocation tracking happens immediately after a thing gets allocated, but it's not entirely accurate, since a little heuristic is still involved | 11:03 | |
lizmat | ok, so that would explain the 15 missing :-) | 11:04 | |
timotimo | there's ops that we don't know whether they actually do allocate or just give back something that had been allocated earlier | ||
lizmat | I guess I will have to look at Optimizer.nqp to see why the Int's are allocated ? | ||
timotimo | so what i did was check the memory address and only count it if it's in the newest few bytes of the nursery | ||
you should be able to see it in the spesh log | |||
since the profiler tells you which routine allocates them you can narrow it down in the spesh log by filename:linenumber | 11:05 | ||
and then there'll be ops like add_I or [p6]?box_i | |||
oh, the unbox optimization doesn't follow set instructions | 11:06 | ||
well, that's dumb :) | |||
oh, huh. we can't just blindly follow just any set | 11:07 | ||
brrt | timotimo: we can still reduce the unbox operation (evne if not the box operation) | 11:11 | |
and that in turn might give us a KNONW_VALUE etc | |||
timotimo | fair enough | 11:12 | |
lizmat | timotimo: looking at optimize_for_range in Optimizer.nqp: what is the reason we bind rather than assign ? | 11:19 | |
m: use nqp; my int $i = -1; nqp::while(nqp::islt_i(($i = nqp::add_i($i,1)),1000000),nqp::null); say now - INIT now | 11:20 | ||
camelia | 0.0180523 | ||
lizmat | m: use nqp; my $i = -1; nqp::while(nqp::islt_i(($i := nqp::add_i($i,1)),1000000),nqp::null); say now - INIT now | ||
camelia | 0.103382 | ||
lizmat | roughly 5.5x slower | ||
timotimo | huh, where's the check for int boundaries? i.e. where int goes over into Int? | 11:21 | |
m: int64.Range.say | 11:22 | ||
lizmat | get_bound | ||
camelia | -9223372036854775808..9223372036854775807 | ||
timotimo | ah, good | ||
lizmat | m: Int32.Range.say | ||
camelia | ===SORRY!=== Error while compiling <tmp> Undeclared name: Int32 used at line 1. Did you mean 'int32', 'uint32'? |
||
lizmat | m: int32.Range.say | ||
camelia | -2147483648..2147483647 | ||
lizmat | that are the values in get_bound | ||
I guess we need to restrict that because of 32bit builds ? | 11:23 | ||
timotimo | no, int is always 64bit | ||
lizmat | also on 32bit builds ? | ||
timotimo | yes | ||
lizmat has learned | |||
timotimo | ... at least i believe so? | ||
lizmat | was pretty sure it isn't | 11:24 | |
timotimo | huh. | ||
i mean we always reserve 64 bit of storage for int registers in our frames | 11:25 | ||
and every _i reprop takes an MVMint64 | |||
and all the _i math ops use the .i64 member of the register struct | |||
brrt | anyway, i'm decently sure the right way to do it, is not to do the chase the code between the box and unbox operation, but to have a set to a unique register inserted prior to the unbox, and use that unique registers' value | 11:26 | |
timotimo | but then our frames will grow :( | 11:27 | |
i mean, if the chase fails we can still do that | 11:30 | ||
lizmat | so how does one create the equivalent of 'my int $i' in QAST ? | 11:40 | |
QAST::Var.new( :name($it_var), :scope('local'), :decl('var'), :returns(int) ) ?? | |||
timotimo | i think :returns is it, yeah | 11:41 | |
lizmat | m: my int $i; $i := 42 | 11:42 | |
camelia | ===SORRY!=== Error while compiling <tmp> Cannot bind to natively typed variable '$i'; use assignment instead at <tmp>:1 ------> my int $i; $i := 42ā<EOL> |
||
lizmat | so how come we're binding in the optimizer then ? | ||
(line 1950) | 11:43 | ||
timotimo | not sure, maybe := isn't exactly what the bind op does? | ||
lizmat | m: use nqp; my int $i; nqp::assign($i,42); say $i # works | 11:45 | |
camelia | 42 | ||
lizmat | m: use nqp; my int $i; nqp::bind($i,42); say $i # doesn't work | ||
camelia | ===SORRY!=== Cannot bind to non-reference QAST::Var '$i' |
||
lizmat | m: use nqp; my $i; nqp::assign($i,42); say $i # does work, but isn't native | 11:46 | |
camelia | 42 | ||
timotimo | not sure what's up with that | 11:52 | |
lizmat | ok, I guess I'll leave it for another time | 11:53 | |
timotimo | i mean, nqp has native ints and you bind these, so some kind of binding must be possible there | ||
lizmat | ok, so maybe something else is generating the Ints | 11:54 | |
(I couldn't see in the spesh log :-) | |||
timotimo | that's still Nil for ^1000000 right? | 11:55 | |
lizmat | yup | ||
it allocates N-15 Ints | |||
Nil for ^1000000 takes 156 msecs in my profile | 11:57 | ||
the equivalent with native ints in nqp takes 12 msecs (and basically *no* allocations) | |||
so it feels to me we are missing out on a good optimization opportunity | 11:58 | ||
but I just can't seem to find where the Int is actually allocated | |||
timotimo | the block with the Nil in it gets a native int argument and stores it in a box | 12:02 | |
puts it into $_ and returns | |||
lizmat | ah, so you're saying it's actually an artefact ? | 12:03 | |
timotimo | no, it totally allocates the Ints | 12:04 | |
just not in the iterator machinery | |||
it happens after the call, so to say | |||
or "as part of" | |||
lizmat | ah, indeed | 12:06 | |
m: for ^100000 -> int $ { } # much faster | 12:07 | ||
camelia | ( no output ) | ||
timotimo | it's nice to see that the body gets inlined for the Nil for ^foo case | ||
we don't really do any kind of lexical analysis in spesh yet | |||
lizmat | m: for ^100000 -> { } | 12:08 | |
camelia | Too many positionals passed; expected 0 arguments but got 1 in block <unit> at <tmp> line 1 |
||
timotimo | perhaps the static optimizer would be able to throw out the $_ in the block? or at least turn it into a local ... it's marked "is dynamic", though, so that's only ever safe if no calls are made etc | ||
lizmat | perhaps we should just support "for ^100000 -> { }" in the mean time | 12:10 | |
timotimo | probably quite possible | 12:11 | |
12:18
brrt joined
|
|||
lizmat | timotimo: profiling "my @a = ^1000; for ^1000 { my $b = @a.min }" seems to generate a broken HTML-file | 12:18 | |
"The average nursery collection time was NaNms" and stuff like that | 12:19 | ||
brrt | . | ||
lizmat | timotimo: should I just make tickets for things like that? | ||
brrt | i think the cost of making the frame a bit bigger is totally not a problem | 12:24 | |
especially if that avoids doing a repr op (function call) | |||
12:25
AlexDaniel joined
|
|||
jnthn | m: for ^100000 -> { } | 12:46 | |
camelia | Too many positionals passed; expected 0 arguments but got 1 in block <unit> at <tmp> line 1 |
||
jnthn | That error is correct; it's surely a thinko | 12:47 | |
lizmat | jnthn: my point was: if I just want to repeat something N times | ||
and not interested in the iteration number | |||
jnthn | for ^N { } | ||
foo xx N | |||
lizmat | that would still pass $_ to the block, with its associated overhead | 12:48 | |
jnthn | Yeah, that isn't optimized out yet. | ||
lizmat | you mean, if $_ isn't referenced in the block ? | ||
jnthn | Just write a `loop` if wanting to squeeze that much out | 12:49 | |
I think it's tricky-ish in that $_ is marked dynamic | |||
Perhaps something for spesh that can see lack of $_ usage though | |||
(Like, some layers deep) | |||
lizmat | but you don't like "for ^100000 -> { }" as a valid syntax | 12:50 | |
I don't see it as a thinko, but as a compiler hint that I'm not interested in the iterator value | 12:51 | ||
jnthn | No | ||
I've sure made the thinko after forgetting to stick a variable there and been glad of the error | 12:52 | ||
Not often, but still... | |||
12:56
zakharyas joined
13:00
zakharyas joined
|
|||
lizmat | jnthn: understood | 13:06 | |
timotimo | lizmat: seems like the profiler is not expecting no GCs at all to happen ;) ;) | 13:16 | |
lizmat | aha! :-) | ||
perhaps a nqp::force_gc is in order then :-) | |||
timotimo | haha, no :) | 13:17 | |
oh! | |||
now i see what you meant | |||
lizmat | just the one, at the end, if there were none yet, indeed | ||
timotimo | holy wow | 13:18 | |
lizmat | ? | ||
timotimo | the call graph | ||
it seems to somehow infinitely recurse or something like that | 13:24 | ||
OK, this time it's an example of actually wrong inlining | 13:28 | ||
i mean in the profiler report | |||
it's thinking -e:1 is calling itself over and over, and so the call graph balloons up to gigantic sizes | |||
lizmat | ok, glad to have been able to provide you with a test case :-) | 13:30 | |
jnthn | OSR-sensitive? | ||
timotimo: It just occurred to me that if we're in an inline (or nested inline) and we deopt, then we never fall out of the inlines | 13:31 | ||
timotimo: Or more like, never reach a natural end of them and hit the profiler op saying we're leaving them | |||
I'm not sure how well the profiler copes with that | |||
Of course we will leave things but they won't be inlined any more :) | 13:32 | ||
timotimo | there's only one prof_exit, though, right? | ||
so as long as deopt doesn't change the current profile call node, the prof_exit should still hit the right data? | |||
jnthn | You'd hit prof_exit, but not from an inline, I guess | 13:33 | |
Hmm, do we actually have separate ops for that? | |||
timotimo | we only have different enter ops, only one kind of exit | 13:34 | |
jnthn | I can't remember whether exit is sensitive to inline or not. If not, well, ignore all I said :P | ||
timotimo | MVM_frame_unwind_to is the only other place we'd ever call "log exit" | 13:36 | |
and the implementation of the exit op is really just going back one step back the call graph | |||
jnthn | Hmm, ok | 13:38 | |
timotimo | there are 292 deopt_one's in the min function, though | 13:40 | |
jnthn | hah, there's a MoreVMs workshop... :P 2018.programming-conference.org/tr...reVMs-2018 | 14:01 | |
14:49
ggoebel joined
15:00
zakharyas joined
|
|||
timotimo | the call graph begins going further and further to the right as soon as the enter mode for this particular staticframe goes from 0 (i.e. regular interpreted entry) to 4 (inlined & jitted entry) | 15:05 | |
actually, it goes to 3 once in between, which is regular jit entry | 15:06 | ||
with jit disabled it also starts going further and further to the right, but this time it goes to entry mode 2, which is spesh_inline | 15:08 | ||
so at least the jit isn't at fault :) | |||
[Coke] | I hope someone submits a moarvm morevm paper | 15:13 | |
timotimo | i wouldn't want more VMS in my life | 15:15 | |
(this is a joke about VMS, which i know next to nothing about) | |||
jnthn: there's an FH around the inlined portion and the prof_exit is one BB before the FH Goto; could this be our issue? | 15:18 | ||
jnthn | ooh | 15:20 | |
Yes, it maybe could be that we rewrite control exceptions into gotos | |||
Across inline boundaries | |||
Which is very clever but...maybe very bad for the poor profiler :) | |||
timotimo | poor profiler indeed | ||
jnthn | Easy way to find out: disable that opt :) | ||
timotimo | aye | 15:21 | |
is that only optimize_throwcat? | 15:22 | ||
no other opt with a goto in it | |||
jnthn | Yeah, believe so | 15:23 | |
timotimo | i put in a very early return and that didn't help | 15:24 | |
if we have anything that unwinds the stack, it'll allegedly also prof_exit for us | |||
barely any cases where prof_log_unwind happens, though. perhaps it isn't that after all | 15:27 | ||
i should already have bisected this %) | 15:28 | ||
15:30
zakharyas joined
15:31
zakharyas joined
15:35
FROGGS joined
|
|||
timotimo | that didn't work so well :| | 15:35 | |
with an FH_GOTO we are supposed to unwind, so why is this going so wrong ... | 15:40 | ||
16:19
zakharyas joined
16:22
robertle joined
16:23
zakharyas joined
|
|||
timotimo | this is not relevant to this particular bug, but we can remove quite a few more prof_allocated than we do so far, for example when returning from an inlinee we prof_allocated the return value; i think that's superfluous at least some of the time. | 17:03 | |
17:25
domidumont joined
18:23
bisectable6 joined
18:31
ggoebel joined
19:07
committable6 joined
19:28
AlexDaniel joined
19:52
ggoebel joined
19:56
committable6 joined
|
|||
AlexDaniel | github.com/perl6/nqp/issues/431#is...-376655247 | 20:02 | |
robertle | I think I have some sort of patch to get moar/nqp work on armhf: NQP#431 not sure it's the right strategy though, any input welcome... | ||
synopsebot | NQP#431 [open]: github.com/perl6/nqp/issues/431 Bus error on armhf during t/moar/02-qast-references.t test | ||
AlexDaniel | timotimo: by the way, what should be done with your patch here? github.com/rakudo/rakudo/issues/12...-372052002 | 20:03 | |
20:03
releasable6 joined
|
|||
robertle | I'll check on the | 20:08 | |
state of that one once I get the armhf build to work... | |||
20:16
committable6 joined
|
|||
AlexDaniel | robertle: TL;DR on that one is that the patch actually fixes the issue on big endian systems, but I think dod38fr applied the patch manually, and so it's not in master | 20:18 | |
robertle | that's my understanding as well, but also that there is a remaining problem around t/02-rakudo/08-slangs.t that isn't understood yet | 20:19 | |
but getting confidence in that patch and it into master would be fab | |||
i'll try to find out what is going on in 08-slangs.t next | 20:20 | ||
[Coke] | if we get it as a PR, would that trigger some automatic testing? | ||
20:53
AlexDaniel joined
22:25
ggoebel joined
|
|||
timotimo | through painstaking single-stepping through bytecode and the interpreter i have come to the following conclusion: | 22:37 | |
throwpayloadlexcaller doesn't cause an unwind, so no exit is logged. this causes the call graph to become lopsided, or possibly popsicle-shaped i guess | 22:38 | ||
OK, so, we're reaching MVM_frame_unwind_to and the frame to unwind to is actually the tc->cur_frame | 22:48 | ||
so i suppose the LocatedHandler should include how many layers of inlines it went through so that the unwind can call MVM_profile_log_exit the right number of times | 22:54 | ||
jnthn | Aha | 23:00 | |
23:01
Voldenet joined
|
|||
timotimo | i'm not deep in the handler and exception stuff | 23:04 | |
could you have a look? | |||
you surely have a good inkling of what line to put a ++ of some variable into that'd count this? | 23:05 | ||
i've found one spot that makes no difference, and one that gives me too many exits | 23:06 | ||
jnthn | Well, in theory, every time you cross an inline boundary, that's a frame one | 23:10 | |
*gone | |||
Though at this point we're just finding the handler, not unwinding | |||
timotimo | right | ||
jnthn | Which could come somewhat later | ||
timotimo | i think there might be some hypothetical parts to the search, too? there's multiple while loops involved | 23:11 | |
jnthn | Does unwind_to know the address etc. we're unwinding from? | 23:12 | |
If so, it would be possible to call the profiler and have it walk the inlines table or some such | |||
Thing is | |||
timotimo | it has the abs_addr and the rel_addr parameters | ||
jnthn | I think this could happen: | 23:13 | |
1) Throw happens. We locate the handler, counting frames along the way | |||
2) Stash that count away | |||
3) Invoke the handler | |||
4) Handler does something that does global deopt | |||
That'd perhaps invalidate the count stashed away in step 2? | |||
(Handlers run on the stack top, in general) | 23:14 | ||
timotimo | right, they would | ||
well, if it does a global deopt, we don't have any inlines active any more, right? | |||
jnthn | So it might be more robust to try and do the figuring out at unwind time | ||
Correct | |||
timotimo | right | ||
this should happen not only if unwind_to gets passed a frame that's equal to tc->cur_frame, right? | 23:15 | ||
jnthn | Indeed, that's what I was about to mention | ||
But as I was going to do so I realized you'd walk over all the inline boundaries down the stack anyway | |||
So probably still have a sensible number...in absence of deopt | 23:16 | ||
timotimo | we wouldn't ever step *into* an inlinee with this, right? | ||
like, i wouldn't have to find the staticframe to MVM_profiler_log_enter for this case? | |||
jnthn | No, calling the handler is always a standard invocation at this point if it's a block handler | ||
I don't see that changing any time soon | 23:17 | ||
timotimo | good | ||
jnthn | So I don't think there's any trouble in that direction | ||
My gut feeling is it'd be more robust to do it at unwind time, however | |||
timotimo | right, i'll try that | ||
i.e. near the end of MVM_frame_unwind_to where it changes the actual address | 23:18 | ||
jnthn | I guess you can perhaps look at the current PC or JIT label or return address to figure out "where am I" | ||
Yes, exactly | |||
timotimo | hm, but also counting all inlines active in frames we completely skip | ||
jnthn | Nice think about this approach is it's also one nicely branch-predictable conditional to call into the profiler code | ||
*thing | |||
Or, in the non-profiling case, not call it | 23:19 | ||
timotimo | yeah, if not profiling, we'd ideally do nothing | ||
jnthn | Probably whatever logic is used for dynamic variable search is also somewhat applicable | ||
That's probably a somewhat relevant bit of prior art in the "reconstruct call stack as if it wasn't inlined" area | 23:20 | ||
timotimo | the find_contextual_by_name function mentions that what it's doing "isn't correct for leaf frames" because that'd never happen with getdynlex | ||
jnthn | .oO( Some day we need to do this for stack traces too... ) |
||
timotimo | so can't steal it as easily it seems | ||
jnthn | *nod* | ||
Yeah, this is true | 23:21 | ||
iirc exceptions stash away a throw_address or something though, which may help with the leaf frame? | |||
timotimo | i remember seeing that from dump_stacktrace | ||
jnthn | :) | 23:22 | |
OK, sleep time for me...good luck! :) | |||
timotimo | good night! | 23:24 | |
i might not stay up very late today (famous last words) | |||
OK, so throw_address is actually only available if we have an exception object, which we don't have in the case of throwpayload. however, existing code also just uses tc->interp_cur_op instead in that case. | 23:29 | ||
omfg it seems to work | 23:39 | ||
untested for more than one nested inline | 23:44 | ||
23:49
MasterDuke joined
|
|||
MasterDuke | timotimo: no more giant profiles? | 23:51 | |
timotimo | hopefully! | 23:54 | |
MasterDuke | ++timotimo | 23:58 |