github.com/moarvm/moarvm | IRC logs at colabti.org/irclogger/irclogger_logs/moarvm Set by AlexDaniel on 12 June 2018. |
|||
01:23
SmokeMachine left
01:26
SmokeMachine joined
01:37
SmokeMachine left
01:38
lucasb left
01:52
SmokeMachine joined
03:09
quotable6 left,
squashable6 left,
unicodable6 left,
undersightable6 left,
notable6 left,
benchable6 left,
reportable6 left,
statisfiable6 left,
greppable6 left,
evalable6 left,
committable6 left,
bisectable6 left,
shareable6 left,
releasable6 left,
nativecallable6 left,
bloatable6 left,
coverable6 left,
nativecallable6 joined
03:10
releasable6 joined,
notable6 joined,
bisectable6 joined
03:11
committable6 joined,
undersightable6 joined,
squashable6 joined,
quotable6 joined
03:12
statisfiable6 joined,
benchable6 joined,
coverable6 joined
03:13
shareable6 joined,
bloatable6 joined,
unicodable6 joined,
greppable6 joined
03:14
evalable6 joined,
reportable6 joined
05:03
robertle left
06:08
nebuchadnezzar left,
nebuchadnezzar joined
06:15
domidumont joined
06:49
brrt joined
|
|||
brrt | \o | 07:02 | |
nwc10 | o/ | ||
07:11
patrickb joined
08:34
robertle joined
|
|||
jnthn | So yesterday I implemented some "is this point in a branch relative to this one" thing as a quick heuristic check, but was thinking, hmm, this is ad-hoc, but there must be a proper way to do this... | 09:10 | |
Realized in the pub last night: I wanted postdominance. | |||
Though not sure if it's worth calculating if all we really need it for is a heuristic. | 09:11 | ||
lizmat | .oO( post dominance for a post modern language ) |
09:12 | |
nwc10 | pubs++ | ||
09:15
zakharyas joined
09:27
dogbert11 joined
09:29
dogbert17 left
|
|||
brrt | what is postdominance? | 09:58 | |
timotimo | probably very similar to the dominance we use to get the SSA form right, but instead of from the beginning towards the end it'd go from the end towards the beginning? | 09:59 | |
jnthn | brrt: Dominance but with the arrows reversed | ||
Or intuitively, "what do we always exit via" | 10:00 | ||
(Where dominance is "what do we always enter via") | |||
Woodi | would be nice to have in precompiled modules but if runtime needs just heurestics then can be costly ? | 10:02 | |
timotimo | we only calculate those things for things that are measured to be hot during run time | 10:06 | |
calculating it for everything while precompiling modules is wasteful | |||
especially since earlier optimizations can already throw out branches completely | |||
jnthn | Well, also probably what the VM optimizer considers a basic block is similar to what Perl 6 does anyway, especially given it has to be re-calc'd after inlines | 10:07 | |
brrt | oh, I see | 10:14 | |
.... that would've been a nice term to have, I had that precise problem half a year ago | 10:15 | ||
10:55
brrt left
11:17
zakharyas left
11:30
domidumont left
11:31
domidumont joined
|
|||
timotimo | the MVM_op_infos array is about 35 kib big | 11:45 | |
i do believe we have a few fields that don't need to be 8 bit wide | |||
timotimo tries bitfield support in C for this | 11:46 | ||
28 kib, cool. | |||
hmm. we have an MVMuint16 num_operands, but the maximum operand coulnt we allow is 8; the higher values are just for the benefit of PHI nodes | 11:47 | ||
hm, 28.6 kib actually | 11:48 | ||
having the operands in-line with the structs themselves means we pay 8 byte for every op, which is 917 bytes | 11:50 | ||
there's only 4 ops with 8 operands, 6 with 7, 14 with 6 | 11:53 | ||
silly me, it's not 917 bytes, it's 917 * 8 bytes | 11:54 | ||
m: say 2360 / (917 * 8) | |||
evalable6 | 0.321701 | ||
timotimo | about 66% space savings available with a more packed operands storage | 11:55 | |
m: say 917 * 8 | |||
evalable6 | 7336 | ||
timotimo | save 5 more kilobytes off these 28 | ||
that's my useless timewaste for the day, thank you very much | 11:56 | ||
11:58
brrt joined
|
|||
timotimo | but if we see what spesh and the bytecode validator tend to need vs not need, we might get the tiniest bit of performance out of less cache contention | 12:00 | |
brrt | Well, yes, spesh etc use the op_infos table quite heavily | 12:05 | |
so a saving would be nice :-) | |||
timotimo | oh, i have to add 917 * 2 back for the savings, because we'll obviously need an index into the ops list | ||
(though actually that can re-use parts that repeat, which would be cool) | |||
collect all operands bits, calculate overlaps, tile the shortest array with it, make update_ops.p6 take ten minutes to complete | 12:07 | ||
i'm kind of sort of interested now | |||
FWIW, tools/lib/MAST/Ops.pm has a setup much like this already :) | 12:08 | ||
completely without overlap finding, though | |||
brrt | well, if you first make update_ops.p6 take ten minutes, we'll all have a challenge to reduce that to a couple seconds | 12:11 | |
timotimo | :D | 12:12 | |
a simple greedy algorithm should be just fine and very fast | |||
brrt | so, you know what else I found | ||
I expect that tiling will Just Work on a linear IR | |||
timotimo | oh? | 12:13 | |
that sounds good | |||
no changes needed for that part of the code | |||
brrt | well, some changes | ||
timotimo | which i imagine is a difficult part of the code | ||
like the rest of the code :) | |||
brrt | I expect it will be simpler | ||
right now, we do a postorder (bottom-up) iteration, and then a top-down iteration | 12:14 | ||
but all we really need bottom-up for, is to ensure that nodes are processed before the nodes that reference them; | 12:15 | ||
that is easily ensured by the order of the linear IR | |||
timotimo | nice | ||
sounds like a whole phase can be tossed | |||
brrt | and the inverse is true for top-down | ||
not sure about the phase, but we can translate a tree traversal to a linear iteration, without a change in semantics | 12:16 | ||
timotimo | ah, sweet | ||
brrt | another thing that the tiler currently does, is create the basic blocks | ||
that would be handled in an earlier phase | |||
(and assigning labels) | 12:17 | ||
because it translates between the graph form, and the linear form | |||
so if we have a linear form throughout, we no longer need that translation (it'll happen in the perl comopiler) | |||
and finally. Because of the linear IR, I can ensure that any operand loads are emitted before the template that references them. This will eliminate entirely the missing-ancestor problem | 12:18 | ||
and the optimizer can move them and eliminate them if it proves that this is uneccesary | 12:19 | ||
I guess the point is that the adding an explicit relative order of operations allows you to work on that relative order | 12:20 | ||
which isn't possible in the graph IR as that leaves them unspecified | |||
timotimo | i've got a preliminary result | 12:25 | |
brrt | I'm interested? | 12:26 | |
timotimo | it's built a list of 722 elements | ||
compare that to 917 * 8 elements for "no overlap attempts" | 12:27 | ||
er | |||
that's "not tightly packed" | |||
and 2360 for "tightly packed" | |||
that's not all yet | |||
because when i find no match, currently i just append the whole thing to the end without checking for a tail overlap | |||
m: say 722 / (917 * 8) | 12:30 | ||
evalable6 | 0.098419 | ||
timotimo | m: say 722 / (2360) | ||
evalable6 | 0.305932 | ||
timotimo | i had meant to work on something else today, but that's not a terrible win | ||
m: say 722 - (917 * 8) | 12:31 | ||
evalable6 | -6614 | ||
timotimo | so a bit more than 6kb savings for the operands array | ||
12:33
brrt left
|
|||
timotimo | if i don't "have to" put the opcodes in at the same order as they are in the input, i can potentially make better decisions | 12:41 | |
Cowardly refusing to permutate more than 20 elements, tried 821 | 12:44 | ||
%) | |||
641 is my best score so far based on "just" reordering the ops randomly | 12:48 | ||
reproducible builds this is not | 12:49 | ||
except i srand($attempt-number) | |||
632, cool | |||
12:50
sena_kun joined
|
|||
timotimo | and it parallelizes trivially, of course | 12:53 | |
oh, hm, but random state is per thread context, and if any of the blocks had an await in it or were otherwise descheduled, they would resume with different random values | |||
12:55
zakharyas joined
|
|||
timotimo | oh, throwing out all the 1-parameter ops at the beginning could be interesting; there's 117 of them, and 12 0-parameter ops, too | 12:56 | |
they can just be put in at the end. either they overlap anywhere or they'll go in at the end anywhere, but they shouldn't cause a potential chain to break by being appended at the end | 12:57 | ||
13:00
Guest2775 left
|
|||
timotimo | i can't find the name of this problem; it should totally be a common comp-sci problem to solve | 13:02 | |
13:18
Guest55475 joined
|
|||
timotimo | oh! | 13:21 | |
write registers | |||
two opcodes that have a write register can only have overlapping beginnings | 13:22 | ||
only all-read-operand opcodes can be partially inside or overlapping beginning and end of other ops | |||
13:22
lucasb joined
|
|||
timotimo | ([71 71 71 71 167 71 39 63 39 39 71 55 55 39 71 63 71 71 71 71 39 71 71 63 167 23 63 63 39 39 39 71 63 63 63 71 39 39 63 39 23 23 71 71 71 71 71 71 63 71 63 39 79 39 55 55 55 71 71 79 63 39 71 71 63 63 71 39 63 63 39 71 71 71 39 39 71 39 23 79 71 23 23 55 167 55 167 63 55 71 63 71 71 71 63 39 39 71 167 143 ...] 396 31) | 13:27 | |
([2 1 1 0 0 2 1 1 1 1 1 1 1 1 1 2 0 0 0 1 1 0 1 0 2 3 2 1 0 2 0 1 4 1] 34 5) | |||
these are the lists followed by score followed by random seed for all operands with r/w and lex/reg/literal masked out followed by all operands with only their r/w lex/reg_literal | 13:28 | ||
if we keep those separate, which may be a PITA, we can get a much tighter packing | |||
13:36
robertle left
13:38
robertle joined
|
|||
timotimo | on top of the fact that the values themselves are now smaller, so they could be less than 8 bits per entry, but that's also annoying to deal with depending on the API we have to offer | 13:39 | |
anyway, the r/w, lit/reg/lex bits are only 3 out of 8, so i can't just store one in 4 bits per entry and the other in 4 bits per entry | 14:02 | ||
however | |||
if i do a 4/4 split, even though the values aren't entirely meaningful then, i can get the whole thing stored in 240 bytes | 14:03 | ||
whereas if i use 8 bits per entry with one list for the upper 5 and one for the lower 3 bits i'd spend 430 bytes | |||
either is better than spending the 624 | |||
though of course there's diminishing returns at some point, and i might be going into "actually useless" territory by now | 14:04 | ||
nine | Well if you mange to make it fit in 64 bytes or so, you'd be down to a single cache line :) | 14:49 | |
timotimo | rather unlikely :D | ||
MFW a sequence that doesn't try to put only parts of the new operand data at the end when no match was found yet is better than sequences that try to find overlaps for the end | 15:08 | ||
gaaah tantalizingly close to getting the "start" indices for one of the arrays to fit into 8 bits | 15:20 | ||
271 entries | |||
15:20
domidumont left
15:26
AlexDani` joined
15:29
AlexDani` is now known as AlexDaniel,
AlexDaniel left,
AlexDaniel joined
|
|||
AlexDaniel | ⚠ github.com/perl6/problem-solving/p...-488716403 | 15:30 | |
15:42
robertle left
15:59
patrickb left
16:28
brrt joined
16:36
patrickb joined
16:39
zakharyas left
16:41
squashable6 left
16:42
squashable6 joined
16:47
zakharyas joined
16:55
brrt left
16:58
domidumont joined
17:12
robertle joined
17:34
zakharyas left
18:41
domidumont left
18:44
brrt joined
19:21
zakharyas joined
|
|||
timotimo | only 35 of the first 820 have any marks; that's 2 bytes in every op | 19:43 | |
m: say (820 - 35) * 2 | |||
evalable6 | 1570 | ||
timotimo | and i think the marks are only ever needed by the validator anyway | 19:44 | |
brrt | you can extract them in a separate array? | 19:48 | |
so that the rest is more compact | |||
timotimo | sure | ||
would also be possible to have an array of spans that'd be quick to scan through | |||
the only two usages of the mark struct member are in validation.c and ext.c | 19:52 | ||
so that shouldn't be too bad | 19:55 | ||
moving it away would already be a good thing i'm sure | |||
i mean ... barely | 19:56 | ||
but, you know | |||
so, how widespread is bitfield support in C compilers that we target? | |||
MasterDuke | can we set up *BSDs in travis, appveyor, or circleci? that's the main platform missing right? otherwise i'd say create a PR and see what ci reports | 20:21 | |
brrt | if a C compiler doesn't support bitfields, it's not a C compiler at all imho | 20:22 | |
timotimo | in other words, it would be safe to use bitfields in moarvm code? | 20:23 | |
20:40
zakharyas left
20:44
brrt left
21:33
sena_kun left
21:40
sena_kun joined
21:49
sena_kun left
22:02
patrickb left
|
|||
mornfall | bitfields are fine, what isn't are assumptions about their layout | 22:18 | |
22:22
Kaiepi joined
22:49
sena_kun joined
23:06
lizmat_ joined
23:08
lizmat left
23:34
Kaypie joined,
Kaiepi left
23:37
sena_kun left,
sena_kun joined
|