00:02
raiph joined
|
|||
dalek | arVM: 4538f61 | jnthn++ | src/ (3 files): Cache dynlex lookups. As suggested by TimToady++, we stash them within frames, so we get lifetime management for free (including if continuations happen). We poke it a few frames down the stack at various intervals, to try and maximize the benefit. Can likely tune this a bit more yet. |
00:12 | |
timotimo | nice :) | 00:14 | |
jnthn | Need to lose another 1.48s before I can say I can build Rakudo in 70s. :) | 00:15 | |
timotimo | how much is that worth? | 00:16 | |
er. | |||
how much did your last commit improve build times? | 00:17 | ||
00:17
avuserow joined
|
|||
jnthn | Was about another second off the Rakudo build. | 00:19 | |
So, at least a % | |||
timotimo | sweet! | ||
you said you timed it at about 1.8% recently; so maybe you halved the time spent in dynvar lookups? :) | 00:20 | ||
jnthn | Yeah; I'll need to do a C level profile again at some point. | 00:21 | |
Wowza. Attempting to optimize junctions creates 103966 closures when compiling CORE.setting... | 00:22 | ||
timotimo | attempting? | ||
jnthn | Well, we may succeed | ||
timotimo | how did i do that :( | ||
too many non-inlined blocks? | |||
jnthn | non-inlinable | 00:24 | |
3 nested subs | 00:25 | ||
The optimizer (and I'm guilty too) has some very large methods in it. | |||
timotimo | ah, those nested subs could be un-nested and just take more arguments | ||
that would help, right? | |||
jnthn | Which aren't too maintainer friendly, but aren't exactly spesh-friendly or optimizer friendly either | ||
Well, trying an easier refactor that's probably as effective. | |||
timotimo | OK | 00:26 | |
jnthn | Basically, pull the transform into a separate method from the analysis. | 00:27 | |
Yes, that helps a lot. | 00:29 | ||
Though it's not the biggest source of issues, just the most stand-out one | |||
timotimo | how do you measure what part of the process generates how many closures? | 00:33 | |
jnthn | Patch to takeclosure in frame.c that just prints out the outer frame name. | 00:35 | |
timotimo | ah, OK | 00:37 | |
and a | sort | uniq -c | sort -n | |||
00:50
colomon joined
01:48
FROGGS_ joined
01:56
cognominal joined
02:01
FROGGS_ joined
03:00
jimmyz joined
|
|||
jimmyz | Stage parse : 36.933, 1.4s lower since yesterday :) | 03:00 | |
03:01
tadzik joined,
ventica joined,
cognominal joined
03:02
avuserow joined
|
|||
xiaomiao | I wonder what the standard deviation of those benchmarks is ;) | 03:31 | |
37sec +-1 sec, that's about 3% ... that could be "noise" | |||
03:52
ilbot3 joined
03:56
ventica joined
05:18
avuserow joined
05:35
bcode joined
05:57
avuserow joined
|
|||
sergot | o/ | 06:18 | |
07:19
ventica joined
07:24
cognome joined
07:30
cognome_ joined
07:50
ventica joined
08:01
ventica joined
08:13
zakharyas joined
08:18
Ven joined
|
|||
masak | \o | 08:39 | |
nwc10 | o/ | ||
08:47
FROGGS[mobile] joined
08:53
brrt joined
09:06
brrt joined,
brrt left
|
|||
jnthn | o/ | 09:21 | |
nwc10 | OK, so one key part of testing is "don't fill the disk" | 09:31 | |
masak | heh. | 09:32 | |
09:33
colomon joined
09:52
jose__ joined
|
|||
nwc10 | m: say 6.7812e+01/6.8598e+01 | 10:31 | |
camelia | rakudo-moar 89c8e4: OUTPUT«0.988541939998251» | ||
nwc10 | jnthn: that's the setting build speedup, once the disk is only 90% used | ||
jnthn | nwc10: Speedup since when, exactly? :) | 10:36 | |
nwc10 | er, last time I measure it. Which was probably yesterday morning. | ||
grammar/fingers gah | 10:37 | ||
I'm going to measure parrot performance again, to see if it gained more | 10:38 | ||
11:05
carlin joined
|
|||
dalek | arVM: b6a9cad | jnthn++ | src/core/frame.c: Fix an uninitialized variable bug. |
12:20 | |
12:22
klaas-janstol joined
12:42
oetiker joined
|
|||
dalek | arVM: e92aa36 | jnthn++ | src/6model/ (13 files): De-virtualize most reader functions. No point to call the same thing every time through a function pointer. |
12:53 | |
arVM: 9a3a96d | jnthn++ | src/6model/serialization.c: Bump minimum serialization format version. This in turn enables us to assume we have varints in the thing we are reading, which we have for quite a while now. |
|||
arVM: 6da5b90 | jnthn++ | src/6model/ (9 files): De-virtualize read_var_int. |
|||
arVM: f55e682 | jnthn++ | src/6model/ (14 files): De-virtualize serialization write functions. Again, the abstraction was unused and unrequired. |
13:03 | ||
nwc10 | m: say 6.597e+01/6.7812e+01 | 13:55 | |
camelia | rakudo-moar 4d347f: OUTPUT«0.972836666076801» | ||
nwc10 | er, so that's 2.5% speedup since this morning | ||
FROGGS[mobile] | O.o | 14:06 | |
14:15
zakharyas joined
|
|||
[Coke] | does the .msi have rakudo-moar in it? | 14:36 | |
so, I have someone who is a cpan module author, who has written XS stuff, has hacked on perl core in the past... and he's too intimidated to use perl 6. | 14:37 | ||
timotimo | to *use* it? | ||
interesting, should be a good "test subject" :) | |||
[Coke] | even to get a copy setup to play with. | 14:38 | |
timotimo | he's on windows, yeah? | ||
froggs had a rakudo star moarvm msi release candidate at one point | |||
[Coke] | well, step one, we need a better story on perl6.org. | ||
timotimo | nobody tested it, so it disappeared again | ||
btyler | perl6 is extremely intimidating at first, because the vast majority of the code you encounter 'casually' is straight from the core rakudo crowd | 14:39 | |
and that code tends to be rather dense, in the interest of maximally demonstrating power in minimal space | 14:40 | ||
[Coke] | I think we could borrow some ideas from the mojolico.us site. | ||
btyler | most perl 5 code you might encounter randomly is more or less baby perl | ||
[Coke] | btyler: he's not even at code. too many options before that. | 14:41 | |
btyler | ah, sorry, projected from my own experience too much :) | ||
timotimo | that does happen, yeah :( | 14:44 | |
i thought about giving perl6.org an "express lane" | |||
[Coke] creates a playground to test with... | 14:47 | ||
timotimo | what is this "playground"? :) | ||
[Coke] | a fork. | ||
timotimo | ah, of course | 14:48 | |
[Coke] trips over the prereqs. whoops. | 14:56 | ||
hoelzro | [Coke]++ | 15:00 | |
[Coke] | ... I thought I was in #perl6 this whole time. | 15:05 | |
timotimo | ah | ||
[Coke] | whoops | ||
15:18
ventica joined
|
|||
dalek | Heuristic branch merge: pushed 16 commits to MoarVM/moar-jit by jnthn | 15:29 | |
jnthn | brrt: Updated moar-jit to master, are confirming it works. :) | ||
timotimo | "are confirming"? | 15:35 | |
jnthn | *after | ||
timotimo | ah, excellent! | ||
and even jit-moar-ops is in there | |||
things are looking mighty fine :) | 15:36 | ||
jnthn | Except that things are slower with the JIT enabled... | 15:37 | |
timotimo | yeah | 15:38 | |
probably just spending too much time aborting frames, still? | |||
jnthn | Not sure yet | 15:39 | |
Seeing if I can discover anything. | |||
timotimo | have you counted how often the jit-invocation opcode got hit? | ||
dalek | arVM/moar-jit: 3f22397 | jnthn++ | Configure.pl: Make dynasm rule work on nmake. |
15:46 | |
arVM/moar-jit: bafbc3b | jnthn++ | src/jit/emit_win32_x64.c: Win32 JIT output was behind. |
|||
jnthn | Oddly, my profiler claims that we spend 6% of the time in JITted code, but the time spent in the interpreter only goes down by 1% | 16:00 | |
timotimo | oh, huh? | 16:02 | |
but the jitted code ought to be at least a bit faster, right? | 16:03 | ||
hm, except | |||
if gcc strongly optimizes the interpreter loop, maybe it handles moving stuff from register to register directly instead of going through our locals storage? | |||
i don't quite see how that would be doable without "unrolling" the interpreter loop, though | 16:04 | ||
16:04
ventica joined
|
|||
lizmat | btyler / [Coke] : TheDamian gave a nice example of how he ported a perl 5 utility of his to perl 6 | 16:07 | |
at OSCON, wonder where that code lives nowadays | |||
japhb | lizmat: Is there a video of that? | 16:25 | |
timotimo | i want to know, too | ||
lizmat | yes, check out OSCON videos :-) | ||
japhb | 2014? | ||
timotimo | jnthn: does the jit dump the generated bytecode to files, perhaps? | 16:29 | |
Got negative offset for dynamic label 6 - i wonder where that comes from? | 16:30 | ||
jnthn | Not by default, afaict | 16:31 | |
timotimo | even with a jit log i get 32.407 for stage parse | 16:32 | |
that's not too bad, is it? | |||
jnthn | If you set MVM_JIT_DISABLE=1 here, it comes out slower than with JIT, though. | 16:33 | |
uh, faster than with JIT | |||
timotimo | hold on. | 16:34 | |
only about 0.3 seconds | 16:35 | ||
hm. maybe 0.5 | 16:36 | ||
16:39
cognome joined
|
|||
timotimo | 788 frames compiled | 16:39 | |
japhb | I'm not sure we can expect the JIT to be loads faster than spesh until we move from "the easy way that works" to "optimizing all the cycles". A JIT is an expensive thing, and you have to win it back with seriously tuned output. | ||
timotimo | 882 bails | ||
japhb | Especially while the execution flow has to bounce in and out of JIT land | ||
timotimo | sp_findmeth is still the king | ||
with 271 | 16:40 | ||
(probably because of much improved bytecode? maybe we have less frames all-in-all now?) | |||
japhb | Getting it working with just neutral performance v. spesh is already a good thing, because it would mean the generated code is enough faster to make up for the cost of generating it. | ||
timotimo | yes | 16:41 | |
japhb | oh, timotimo: did you look at the flame chart info I sent you in #perl6 earlier? | 16:42 | |
timotimo | yes, pretty! | ||
japhb | Man, I want that for my Perl 6 code .... | ||
timotimo | well, with the "perf" line from that one blog post you can already get that for the c-level stuff | 16:43 | |
16:45
cognominal joined,
cognome joined
|
|||
jnthn | Thing is that it's hard to explain it as "JIT takes time", when my profiler is telling me 0.1% of the time is spent doing that. | 16:46 | |
timotimo | hm. how does that measure time spent in c functions called from the jit? | 16:47 | |
oh, that number is for "jitting frames" | |||
jnthn | yES | 16:48 | |
*yes | |||
I'm just wondering if it's because CORE.setting's deopt count is epic. | |||
timotimo | how come we have "loadlib" ops in "name", "type", "box_target", "positional_delegate" and "associative_delegate"? | ||
jnthn | And falling back out of the JIT when deopting is more expensive than a switch-code-in--interpreter deopt. | 16:49 | |
timotimo | and has_accessor? | ||
jnthn | timotimo: um...not sure I follow? | ||
timotimo | in the jit bail log i see a bunch of failures with the loadlib opcode | ||
i ... don't think i understand what it does | |||
ah, that op would expect to hit the cache a bunch of times | 16:50 | ||
i hope the lock contention isn't too bad on that when we get to multithreaded apps. but i don't even know under what circumstances loadlib opcodes are generated | 16:51 | ||
jnthn | loadlib is hot? | ||
timotimo | don't think it is | ||
jnthn | That'd be...odd | ||
timotimo | just 9 bails | ||
ah, loadlib is probably just used to get a handle to a library and then findsym would be used to get at whatever symbols it'd expose | 16:52 | ||
that sounds like something that could spesh well. | |||
jnthn | What are you seeing loadlib in? | 16:54 | |
timotimo | jit bail log | ||
jnthn | For? | ||
timotimo | the core setting | 16:55 | |
don't let me distract you, it's probably nothing | |||
oh, that could be the methods of the Perl6::Compiler | 16:58 | ||
jnthn | timotimo: Did you do some work on reducing guards at some point? | 17:03 | |
origin/split_get_use_facts <- was that pending review? | 17:04 | ||
17:08
FROGGS joined
|
|||
FROGGS | o/ | 17:08 | |
jnthn | o/ FROGGS | ||
TimToady | \o | 17:14 | |
carlin | ∿ | 17:15 | |
17:32
colomon joined
|
|||
dalek | arVM: 9d377a3 | (Timo Paulssen)++ | src/ (3 files): split get_facts and use_facts from get_and_use_facts. |
17:45 | |
arVM: be8cfdf | (Timo Paulssen)++ | src/spesh/optimize.h: fix teh build |
|||
arVM: b57061e | jnthn++ | src/spesh/osr.c: Ensure OSR-triggered optimize is used next invoke. |
|||
arVM: 8df127a | jnthn++ | src/ (3 files): Merge remote-tracking branch 'origin/split_get_use_facts' |
|||
arVM: 49f19ca | jnthn++ | src/spesh/log.h: Tweak spesh log run count. Bump minimum bytecode version to 2. |
|||
jnthn | timotimo: merged the branch, thanks :) | 17:46 | |
timotimo | oh, that | 18:53 | |
nice :) | |||
nwc10 | Result: PASS | 19:01 | |
jnthn | Nice. Time to break more stuff :P | 19:14 | |
nwc10 | other people could just write more tests | 19:15 | |
timotimo | jnthn: about the loadlib thing i said earlier: there's a bunch of frames that look exactly like this: gist.github.com/timo/9e49a3806f02857a484f | ||
jnthn | What on earth... | 19:16 | |
[Coke] | do we have a pic of some kind somewhere to show the flow of a program through rakudo when it's on Moar? (esp. with the new spesh/jit stuff?) | 19:17 | |
timotimo | my thoughts exactly. | ||
jnthn | No. If you're lucky I might draw one for my YAPC::EU talk though :) | 19:18 | |
[Coke] | jnthn: perfect, that'd be fine! | 19:20 | |
DAMMIT, it's in Sofia!? | |||
I have free beer waiting for me in Sofia! | 19:21 | ||
... I cannot remember the name of the guy who owes me the beer. *sadface*. it's been too long. | |||
timotimo | jnthn: what's keeping us from closing the loop on the "put argument names into callsites" optimization? | 19:22 | |
jnthn | timotimo: No much; it's just fiddly and annoying to do and will have a fairly low ROI | 19:23 | |
timotimo | OK then | 19:24 | |
timotimo pushes it further to the back :P | 19:25 | ||
jnthn: would you be interested to sketch out ideas for how to turn spesh into a profiling thingie in the future? | 19:27 | ||
19:29
ventica joined
|
|||
carlin | [Coke]: ahh, so that's why rakudo 2014.07 is codenamed Sofia | 19:29 | |
19:32
FROGGS joined
|
|||
nwc10 | m: say 6.636e+01/6.597e+01 | 19:35 | |
camelia | rakudo-moar 085ab9: OUTPUT«1.00591177808095» | ||
nwc10 | slight negative speedup since lunchtime. | 19:36 | |
jnthn | Hmm | ||
Wonder what's to thank for that... | |||
nwc10 | but, given I've had repeatable speed diferences depending on the order that object files are linked | ||
there is some level of insanity in performance metrics | |||
arVM: 0043778 | jnthn++ | src/ (3 files): Split out part of frame deserialization. The split out part will be able to happen lazily, the first time we need it. (At present that won't be much of a win as we touch many of the frames at startup to install static lexical information; the plan is to move this information into the bytecode file also). |
|||
timotimo | nwc10: maybe we should start putting -flto into our gcc commandlines? | 20:02 | |
jnthn | timotimo: How much difference does it make? | 20:03 | |
nwc10 | I have no good idea about that | ||
timotimo | haven't measured yet | ||
jnthn: that commit above combined with the plan you mention in it ... would that make a difference for memory usage? | 20:05 | ||
like, not using 99% of the frames in core setting would free up a bit of memory? | |||
jnthn | timotimo: That's the hope, yes | 20:07 | |
timotimo: And maybe a bit of a startup saving too | |||
timotimo | i'd like that a whole lot | ||
dalek | arVM: 0098c0c | jnthn++ | src/ (5 files): Preparations for lazy frame deserialization. |
20:53 | |
arVM: cdda218 | jnthn++ | src/core/bytecode.c: Switch on lazy frame deserialization. Or at least, the parts we can easily get away with putting off until later. While it needs further work to take further advantage, NQP shows a 2.2% and Rakudo shows a 1.4% memory reduction for the empty loop program. |
|||
timotimo | 1.4% would be about 2 megabytes? | 20:54 | |
jnthn | Yeah, just short of | 20:55 | |
21:07
zakharyas joined
21:23
btyler joined
|
|||
dalek | arVM: c65b2a6 | jnthn++ | docs/bytecode.markdown: Spec static lexical values table in bytecode. |
21:34 | |
arVM: 9ba5d15 | jnthn++ | src/mast/compiler.c: No longer need to support Parrot cross-compiler. It's almost certainly broken beyond repair to cross-compile from Parrot to Moar anyway, so no need to keep these last bits around. |
22:03 | ||
arVM: ac33547 | jnthn++ | lib/MAST/Nodes.nqp: Update MAST::Frame to hold static lex values. |
|||
arVM: c0984eb | jnthn++ | src/ (4 files): Write static lex values; read but don't apply them |
23:25 | ||
arVM: e64c5eb | jnthn++ | src/core/bytecode.c: Read in static lexicals. |
|||
arVM: f25affb | jnthn++ | src/mast/nodes_moar.h: MAST nodes can be identified by exact type. |
23:27 | ||
timotimo | oh, that ought to help a lot | 23:30 | |
we do istype on mast nodes all the time | 23:31 | ||
oh, that's only for inside the mastcompiler | |||
but it should still help | |||
jnthn | It's a small improvement...the cache-only istype is quite cheap anyway | ||
timotimo | #define EMPTY_STRING(vm) (MVM_string_ascii_decode_nt(tc, tc->instance->VMString, "")) | 23:32 | |
we have a per-tc (or per vm?) empty string nowadays | |||
jnthn | per vm | ||
where on earth do we use that macro.. | |||
oh, once per compilation | |||
no big saving | |||
timotimo | ./src/mast/compiler.c: hll_str_idx = get_string_heap_index(vm, ws, EMPTY_STRING(vm)); | ||
jnthn | but yeah, feel free to tweak it | ||
timotimo | oke | 23:33 | |
dalek | arVM: ff15814 | (Timo Paulssen)++ | src/mast/nodes_moar.h: we can use the vm's empty string constant here. |
23:37 | |
timotimo | should i perhaps teach the ascii encoding about strlen(0) strings re-routing them to the global empty string constant if it exists? | 23:39 | |
jnthn | I think they are widely interned... | 23:41 | |
And utf8 would be a better one to teach it | |||
timotimo | mhm | 23:42 | |
jnthn | Fun fact: somewhere in Grammar.pm is a frame with 612 labels | 23:43 | |
timotimo | oh, cute | ||
is that after inlining? | |||
jnthn | No! | ||
timotimo | oh wow! | ||
jnthn | Well, aside from NQP's block flattening of course. | 23:44 | |
timotimo | seems pretty jumpy | ||
jnthn | Yeah | ||
Well, I'm pondering some MAST::Label changes. | |||
Today, we always make a string name for a MAST::Label, passing it to its constructor | |||
timotimo | could be integers, too, right? | 23:45 | |
jnthn | However, we never - afaik - in the compiler make two MAST::Labels with the same identifier | ||
Well, they could be integers, yes. | |||
The alternative is that they just work by object identity | |||
Which I believe would work with the current codebase. | |||
Saving 8 bytes per MAST::Label | |||
timotimo | hey, with jit enabled and latest master i get 30.5 seconds stage parse on my laptop :3 | ||
oh, even better | 23:46 | ||
jnthn | But I was then thinking "hm, I have no hash key" | ||
And wondering what happens if I make a linear scan of the labels. | |||
It'll be a C array so not *too* bad. | |||
timotimo | even if you have a frame with 612 labels? | ||
jnthn | A hash may be O(1) but the constant overhead isn't automatically cheap. | 23:47 | |
Well, that's an extreme/rare case. | |||
timotimo | that's right | ||
jnthn | Most frames are tiny. | ||
We might lose out on the odd extreme one. | |||
timotimo | so at least the linear search is going to be limited to each frame individually | ||
jnthn | Right. | ||
timotimo | that does sound sensible; do you have a histogram of frame sizes or something? | ||
jnthn | No | ||
I just looked for maximum ones | |||
But I'm quite used to reading spesh logs :) | |||
And labels <=> basic blocks are clsoe | 23:48 | ||
*close | |||
timotimo | ah, yes | ||
i've not seen any with 4 digit BBs in core setting :) | |||
jnthn | I wonder how many labels we create in compilation... | ||
m: say 21160 * 8 | 23:52 | ||
camelia | rakudo-moar fb0521: OUTPUT«169280» | ||
jnthn | That's how much we'd save on MAST::Label directly | ||
But we save all the strings too | |||
m: say 21160 * (6 * 8 #`(string size) + 10 #`(conservative label length estimate) * 4 #`(per grapheme)) | 23:54 | ||
camelia | rakudo-moar fb0521: OUTPUT«1862080» | ||
jnthn | Not so much I guess. | 23:55 | |
Though there's at least 1 intermediate string too, which is the numification of the number stuck onto it. | |||
Well, may give it a go tomorrow to see how it helps | 23:56 |