00:16 brrt joined
Geth MoarVM/make_unbox_removal_available: 7 commits pushed by (Bart Wiegmans)++, (Timo Paulssen)++ 01:18
timotimo it's not based off of the spesh-refactor-iffy branch because that doesn't have the getrusage change so "make test" fails 01:19
i didn't yet inspect spesh logs to see if this actually works properly and eliminates boxing actually 01:20
i found a frame from BOOTSTRAP that uses the optimization. neat. it doesn't actually remove the boxing, just decides based on the result of atpos_i instead of the result of box_i 01:39
oh, it's find_best_dispatchee 01:40
interestingly it is marked as still having 18 usages
it's only that one routine multiple times in the speshlog %) 01:42
more on this tomorrow i guess 01:53
01:55 ilbot3 joined
MasterDuke ooh, speeding up find_best_dispatchee would be nice, i know i've seen it toward the top of profiles before 02:03
02:17 brrt joined 02:19 AlexDaniel joined 04:18 brrt joined 05:29 domidumont joined 05:37 domidumont joined 05:42 brrt joined
Geth MoarVM: bdw++ created pull request #826:
Refactor optimize_iffy
05:44
brrt \o
05:59 domidumont joined 07:02 brrt joined
brrt .tell timotimo that refactor-iffy should be based on top of the latest master so that shouldn't be a problem 07:18
yoleaux brrt: I'll pass your message to timotimo.
brrt also, if you feel possitive that this optimize_unbox actually works, feel free to add it to the refactor-iffy branch :-) 07:19
07:26 robertle joined 08:10 zakharyas joined 08:21 zakharyas joined 09:26 zakharyas joined 10:21 committable6 joined
timotimo as i see it, the optimization works, but is essentially worthless because the boxed data is - in the few examples i looked at - always held alive by other pieces of the spesh graph 10:42
yoleaux 07:18Z <brrt> timotimo: that refactor-iffy should be based on top of the latest master so that shouldn't be a problem
timotimo and bailing out immediately when i see a PHI node makes it only ever work during the first few BBs, which is very rare 10:43
grmbl, i see a bunch of PHI nodes with only two arguments ... i wonder what happens if i consider these harmless 10:56
lizmat timotimo: if I profile 'Nil for ^1000000', I see 2 strange things 11:02
timotimo uh oh
lizmat 1. it allocates almost a million Ints
2. but not quite (999985) so it looks like it's lost a few (perhaps in the final nursery?)
timotimo the allocation tracking happens immediately after a thing gets allocated, but it's not entirely accurate, since a little heuristic is still involved 11:03
lizmat ok, so that would explain the 15 missing :-) 11:04
timotimo there's ops that we don't know whether they actually do allocate or just give back something that had been allocated earlier
lizmat I guess I will have to look at Optimizer.nqp to see why the Int's are allocated ?
timotimo so what i did was check the memory address and only count it if it's in the newest few bytes of the nursery
you should be able to see it in the spesh log
since the profiler tells you which routine allocates them you can narrow it down in the spesh log by filename:linenumber 11:05
and then there'll be ops like add_I or [p6]?box_i
oh, the unbox optimization doesn't follow set instructions 11:06
well, that's dumb :)
oh, huh. we can't just blindly follow just any set 11:07
brrt timotimo: we can still reduce the unbox operation (evne if not the box operation) 11:11
and that in turn might give us a KNONW_VALUE etc
timotimo fair enough 11:12
lizmat timotimo: looking at optimize_for_range in Optimizer.nqp: what is the reason we bind rather than assign ? 11:19
m: use nqp; my int $i = -1; nqp::while(nqp::islt_i(($i = nqp::add_i($i,1)),1000000),nqp::null); say now - INIT now 11:20
camelia 0.0180523
lizmat m: use nqp; my $i = -1; nqp::while(nqp::islt_i(($i := nqp::add_i($i,1)),1000000),nqp::null); say now - INIT now
camelia 0.103382
lizmat roughly 5.5x slower
timotimo huh, where's the check for int boundaries? i.e. where int goes over into Int? 11:21
m: int64.Range.say 11:22
lizmat get_bound
camelia -9223372036854775808..9223372036854775807
timotimo ah, good
lizmat m: Int32.Range.say
camelia ===SORRY!=== Error while compiling <tmp>
Undeclared name:
Int32 used at line 1. Did you mean 'int32', 'uint32'?
lizmat m: int32.Range.say
camelia -2147483648..2147483647
lizmat that are the values in get_bound
I guess we need to restrict that because of 32bit builds ? 11:23
timotimo no, int is always 64bit
lizmat also on 32bit builds ?
timotimo yes
lizmat has learned
timotimo ... at least i believe so?
lizmat was pretty sure it isn't 11:24
timotimo huh.
i mean we always reserve 64 bit of storage for int registers in our frames 11:25
and every _i reprop takes an MVMint64
and all the _i math ops use the .i64 member of the register struct
brrt anyway, i'm decently sure the right way to do it, is not to do the chase the code between the box and unbox operation, but to have a set to a unique register inserted prior to the unbox, and use that unique registers' value 11:26
timotimo but then our frames will grow :( 11:27
i mean, if the chase fails we can still do that 11:30
lizmat so how does one create the equivalent of 'my int $i' in QAST ? 11:40
QAST::Var.new( :name($it_var), :scope('local'), :decl('var'), :returns(int) ) ??
timotimo i think :returns is it, yeah 11:41
lizmat m: my int $i; $i := 42 11:42
camelia ===SORRY!=== Error while compiling <tmp>
Cannot bind to natively typed variable '$i'; use assignment instead
at <tmp>:1
------> my int $i; $i := 42ā<EOL>
lizmat so how come we're binding in the optimizer then ?
(line 1950) 11:43
timotimo not sure, maybe := isn't exactly what the bind op does?
lizmat m: use nqp; my int $i; nqp::assign($i,42); say $i # works 11:45
camelia 42
lizmat m: use nqp; my int $i; nqp::bind($i,42); say $i # doesn't work
camelia ===SORRY!===
Cannot bind to non-reference QAST::Var '$i'
lizmat m: use nqp; my $i; nqp::assign($i,42); say $i # does work, but isn't native 11:46
camelia 42
timotimo not sure what's up with that 11:52
lizmat ok, I guess I'll leave it for another time 11:53
timotimo i mean, nqp has native ints and you bind these, so some kind of binding must be possible there
lizmat ok, so maybe something else is generating the Ints 11:54
(I couldn't see in the spesh log :-)
timotimo that's still Nil for ^1000000 right? 11:55
lizmat yup
it allocates N-15 Ints
Nil for ^1000000 takes 156 msecs in my profile 11:57
the equivalent with native ints in nqp takes 12 msecs (and basically *no* allocations)
so it feels to me we are missing out on a good optimization opportunity 11:58
but I just can't seem to find where the Int is actually allocated
timotimo the block with the Nil in it gets a native int argument and stores it in a box 12:02
puts it into $_ and returns
lizmat ah, so you're saying it's actually an artefact ? 12:03
timotimo no, it totally allocates the Ints 12:04
just not in the iterator machinery
it happens after the call, so to say
or "as part of"
lizmat ah, indeed 12:06
m: for ^100000 -> int $ { } # much faster 12:07
camelia ( no output )
timotimo it's nice to see that the body gets inlined for the Nil for ^foo case
we don't really do any kind of lexical analysis in spesh yet
lizmat m: for ^100000 -> { } 12:08
camelia Too many positionals passed; expected 0 arguments but got 1
in block <unit> at <tmp> line 1
timotimo perhaps the static optimizer would be able to throw out the $_ in the block? or at least turn it into a local ... it's marked "is dynamic", though, so that's only ever safe if no calls are made etc
lizmat perhaps we should just support "for ^100000 -> { }" in the mean time 12:10
timotimo probably quite possible 12:11
12:18 brrt joined
lizmat timotimo: profiling "my @a = ^1000; for ^1000 { my $b = @a.min }" seems to generate a broken HTML-file 12:18
"The average nursery collection time was NaNms" and stuff like that 12:19
brrt .
lizmat timotimo: should I just make tickets for things like that?
brrt i think the cost of making the frame a bit bigger is totally not a problem 12:24
especially if that avoids doing a repr op (function call)
12:25 AlexDaniel joined
jnthn m: for ^100000 -> { } 12:46
camelia Too many positionals passed; expected 0 arguments but got 1
in block <unit> at <tmp> line 1
jnthn That error is correct; it's surely a thinko 12:47
lizmat jnthn: my point was: if I just want to repeat something N times
and not interested in the iteration number
jnthn for ^N { }
foo xx N
lizmat that would still pass $_ to the block, with its associated overhead 12:48
jnthn Yeah, that isn't optimized out yet.
lizmat you mean, if $_ isn't referenced in the block ?
jnthn Just write a `loop` if wanting to squeeze that much out 12:49
I think it's tricky-ish in that $_ is marked dynamic
Perhaps something for spesh that can see lack of $_ usage though
(Like, some layers deep)
lizmat but you don't like "for ^100000 -> { }" as a valid syntax 12:50
I don't see it as a thinko, but as a compiler hint that I'm not interested in the iterator value 12:51
jnthn No
I've sure made the thinko after forgetting to stick a variable there and been glad of the error 12:52
Not often, but still...
12:56 zakharyas joined 13:00 zakharyas joined
lizmat jnthn: understood 13:06
timotimo lizmat: seems like the profiler is not expecting no GCs at all to happen ;) ;) 13:16
lizmat aha! :-)
perhaps a nqp::force_gc is in order then :-)
timotimo haha, no :) 13:17
oh!
now i see what you meant
lizmat just the one, at the end, if there were none yet, indeed
timotimo holy wow 13:18
lizmat ?
timotimo the call graph
it seems to somehow infinitely recurse or something like that 13:24
OK, this time it's an example of actually wrong inlining 13:28
i mean in the profiler report
it's thinking -e:1 is calling itself over and over, and so the call graph balloons up to gigantic sizes
lizmat ok, glad to have been able to provide you with a test case :-) 13:30
jnthn OSR-sensitive?
timotimo: It just occurred to me that if we're in an inline (or nested inline) and we deopt, then we never fall out of the inlines 13:31
timotimo: Or more like, never reach a natural end of them and hit the profiler op saying we're leaving them
I'm not sure how well the profiler copes with that
Of course we will leave things but they won't be inlined any more :) 13:32
timotimo there's only one prof_exit, though, right?
so as long as deopt doesn't change the current profile call node, the prof_exit should still hit the right data?
jnthn You'd hit prof_exit, but not from an inline, I guess 13:33
Hmm, do we actually have separate ops for that?
timotimo we only have different enter ops, only one kind of exit 13:34
jnthn I can't remember whether exit is sensitive to inline or not. If not, well, ignore all I said :P
timotimo MVM_frame_unwind_to is the only other place we'd ever call "log exit" 13:36
and the implementation of the exit op is really just going back one step back the call graph
jnthn Hmm, ok 13:38
timotimo there are 292 deopt_one's in the min function, though 13:40
jnthn hah, there's a MoreVMs workshop... :P 2018.programming-conference.org/tr...reVMs-2018 14:01
14:49 ggoebel joined 15:00 zakharyas joined
timotimo the call graph begins going further and further to the right as soon as the enter mode for this particular staticframe goes from 0 (i.e. regular interpreted entry) to 4 (inlined & jitted entry) 15:05
actually, it goes to 3 once in between, which is regular jit entry 15:06
with jit disabled it also starts going further and further to the right, but this time it goes to entry mode 2, which is spesh_inline 15:08
so at least the jit isn't at fault :)
[Coke] I hope someone submits a moarvm morevm paper 15:13
timotimo i wouldn't want more VMS in my life 15:15
(this is a joke about VMS, which i know next to nothing about)
jnthn: there's an FH around the inlined portion and the prof_exit is one BB before the FH Goto; could this be our issue? 15:18
jnthn ooh 15:20
Yes, it maybe could be that we rewrite control exceptions into gotos
Across inline boundaries
Which is very clever but...maybe very bad for the poor profiler :)
timotimo poor profiler indeed
jnthn Easy way to find out: disable that opt :)
timotimo aye 15:21
is that only optimize_throwcat? 15:22
no other opt with a goto in it
jnthn Yeah, believe so 15:23
timotimo i put in a very early return and that didn't help 15:24
if we have anything that unwinds the stack, it'll allegedly also prof_exit for us
barely any cases where prof_log_unwind happens, though. perhaps it isn't that after all 15:27
i should already have bisected this %) 15:28
15:30 zakharyas joined 15:31 zakharyas joined 15:35 FROGGS joined
timotimo that didn't work so well :| 15:35
with an FH_GOTO we are supposed to unwind, so why is this going so wrong ... 15:40
16:19 zakharyas joined 16:22 robertle joined 16:23 zakharyas joined
timotimo this is not relevant to this particular bug, but we can remove quite a few more prof_allocated than we do so far, for example when returning from an inlinee we prof_allocated the return value; i think that's superfluous at least some of the time. 17:03
17:25 domidumont joined 18:23 bisectable6 joined 18:31 ggoebel joined 19:07 committable6 joined 19:28 AlexDaniel joined 19:52 ggoebel joined 19:56 committable6 joined
AlexDaniel github.com/perl6/nqp/issues/431#is...-376655247 20:02
robertle I think I have some sort of patch to get moar/nqp work on armhf: NQP#431 not sure it's the right strategy though, any input welcome...
synopsebot NQP#431 [open]: github.com/perl6/nqp/issues/431 Bus error on armhf during t/moar/02-qast-references.t test
AlexDaniel timotimo: by the way, what should be done with your patch here? github.com/rakudo/rakudo/issues/12...-372052002 20:03
20:03 releasable6 joined
robertle I'll check on the 20:08
state of that one once I get the armhf build to work...
20:16 committable6 joined
AlexDaniel robertle: TL;DR on that one is that the patch actually fixes the issue on big endian systems, but I think dod38fr applied the patch manually, and so it's not in master 20:18
robertle that's my understanding as well, but also that there is a remaining problem around t/02-rakudo/08-slangs.t that isn't understood yet 20:19
but getting confidence in that patch and it into master would be fab
i'll try to find out what is going on in 08-slangs.t next 20:20
[Coke] if we get it as a PR, would that trigger some automatic testing?
20:53 AlexDaniel joined 22:25 ggoebel joined
timotimo through painstaking single-stepping through bytecode and the interpreter i have come to the following conclusion: 22:37
throwpayloadlexcaller doesn't cause an unwind, so no exit is logged. this causes the call graph to become lopsided, or possibly popsicle-shaped i guess 22:38
OK, so, we're reaching MVM_frame_unwind_to and the frame to unwind to is actually the tc->cur_frame 22:48
so i suppose the LocatedHandler should include how many layers of inlines it went through so that the unwind can call MVM_profile_log_exit the right number of times 22:54
jnthn Aha 23:00
23:01 Voldenet joined
timotimo i'm not deep in the handler and exception stuff 23:04
could you have a look?
you surely have a good inkling of what line to put a ++ of some variable into that'd count this? 23:05
i've found one spot that makes no difference, and one that gives me too many exits 23:06
jnthn Well, in theory, every time you cross an inline boundary, that's a frame one 23:10
*gone
Though at this point we're just finding the handler, not unwinding
timotimo right
jnthn Which could come somewhat later
timotimo i think there might be some hypothetical parts to the search, too? there's multiple while loops involved 23:11
jnthn Does unwind_to know the address etc. we're unwinding from? 23:12
If so, it would be possible to call the profiler and have it walk the inlines table or some such
Thing is
timotimo it has the abs_addr and the rel_addr parameters
jnthn I think this could happen: 23:13
1) Throw happens. We locate the handler, counting frames along the way
2) Stash that count away
3) Invoke the handler
4) Handler does something that does global deopt
That'd perhaps invalidate the count stashed away in step 2?
(Handlers run on the stack top, in general) 23:14
timotimo right, they would
well, if it does a global deopt, we don't have any inlines active any more, right?
jnthn So it might be more robust to try and do the figuring out at unwind time
Correct
timotimo right
this should happen not only if unwind_to gets passed a frame that's equal to tc->cur_frame, right? 23:15
jnthn Indeed, that's what I was about to mention
But as I was going to do so I realized you'd walk over all the inline boundaries down the stack anyway
So probably still have a sensible number...in absence of deopt 23:16
timotimo we wouldn't ever step *into* an inlinee with this, right?
like, i wouldn't have to find the staticframe to MVM_profiler_log_enter for this case?
jnthn No, calling the handler is always a standard invocation at this point if it's a block handler
I don't see that changing any time soon 23:17
timotimo good
jnthn So I don't think there's any trouble in that direction
My gut feeling is it'd be more robust to do it at unwind time, however
timotimo right, i'll try that
i.e. near the end of MVM_frame_unwind_to where it changes the actual address 23:18
jnthn I guess you can perhaps look at the current PC or JIT label or return address to figure out "where am I"
Yes, exactly
timotimo hm, but also counting all inlines active in frames we completely skip
jnthn Nice think about this approach is it's also one nicely branch-predictable conditional to call into the profiler code
*thing
Or, in the non-profiling case, not call it 23:19
timotimo yeah, if not profiling, we'd ideally do nothing
jnthn Probably whatever logic is used for dynamic variable search is also somewhat applicable
That's probably a somewhat relevant bit of prior art in the "reconstruct call stack as if it wasn't inlined" area 23:20
timotimo the find_contextual_by_name function mentions that what it's doing "isn't correct for leaf frames" because that'd never happen with getdynlex
jnthn
.oO( Some day we need to do this for stack traces too... )
timotimo so can't steal it as easily it seems
jnthn *nod*
Yeah, this is true 23:21
iirc exceptions stash away a throw_address or something though, which may help with the leaf frame?
timotimo i remember seeing that from dump_stacktrace
jnthn :) 23:22
OK, sleep time for me...good luck! :)
timotimo good night! 23:24
i might not stay up very late today (famous last words)
OK, so throw_address is actually only available if we have an exception object, which we don't have in the case of throwpayload. however, existing code also just uses tc->interp_cur_op instead in that case. 23:29
omfg it seems to work 23:39
untested for more than one nested inline 23:44
23:49 MasterDuke joined
MasterDuke timotimo: no more giant profiles? 23:51
timotimo hopefully! 23:54
MasterDuke ++timotimo 23:58