MasterDuke is there a reason ctxouter isn't jitted? the implementation in interp.c looks pretty simple 09:28
MasterDuke hm. "18:37 brrt I don't recall the exact reason, but there was a reason ctxouter didn't work." 09:31
that was a year ago, not sure that a lot has been done to the jit in the meantime, so whatever reason probably still holds? 09:32
lizmat yeah, fraid so, altough it might be worth pinging brrt 09:50
lizmat hopes brrt is doing ok
sena_kun lizmat, I saw his messages an hour ago or so in another place. 10:04
lizmat ok, good to hear!
MasterDuke yep, this patch causes `Frame has no lexical with name '$?PACKAGE' at gen/moar/stage2/NQPHLL.nqp:1499 (/home/dan/Source/perl6/install/share/nqp/lib/NQPHLL.moarvm:SET_BLOCK_OUTER_CTX)` when running install-core-dist.raku after successfully building rakudo 14:09
and lots of rakudo's tests fail 14:10
MasterDuke hm. v2 of the patch has a very similar failure `Frame has no lexical with name '::?CLASS'` 15:56
i don't have any bash history for jit-bisect anymore, anybody remember how it's supposed to be run? 16:09
nine MasterDuke: are there any other JIT implementations of ops that use contexts and/or the framewalker? 16:16
Could be that MVM_context_apply_traversal relies on some book keeping data that's just not set up by the JIT
MasterDuke nine: i copied the implementation of ctxcallerskipthunks (it's interp.c implementation is identical except for the literal passed to MVM_context_apply_traversal) 16:17 and 16:18
a bisect is currently running 16:20
`JIT Broken Frame/BB: 1 / 91===SORRY!===Frame has no lexical with name '$_'` 16:22
nine Ah, I see. Then I'd guess that the error is actually in another JITed op and implemeting ctxouter just unlocks that 16:23
MasterDuke nine: care to see the log the jit bisect produced? i've never really understood them enough to find anything in them that points out where to look 16:24
nine can take a look 16:25
MasterDuke has it 16:26
jnthn So working on dispatch has led me to our calling conventions.
MasterDuke they need changing? 16:27
jnthn And looking at how we can efficiently implement the whole capture tweakery thing 16:28
Because the naive approach - well, also what we'd do when evaluating a dispatcher to record a guard/transform chain - is just to produce new MVMCaptures each time
nine're gonna tell us that it will be a lot faster to pass on arguments in the future? 16:29
jnthn But we don't want to do that for the real guard chain walk.
Anyway, focusing back on what we do today for a moment 16:30
prepargs <callsite> - OK, so the callsite contains the argument register kinds, and also now the named argument names 16:31
arg_o 0, r(0)
arg_o 1, r(2)
The integer in the middle there writes into the args buffer. But we always, afaik, emit those in order. That's pretty redundant.
But wait, the information that it's an object argument is redundant too, 'cus that's in the callsite 16:32
And in fact, why do we even have an args buffer at all? It means we have to copy twice. 16:33
First, register to args buffer
Then in binding, args buffer to parameter
nine Couldn't the callsite contain the list of work registers that contain the args? They are determined at compile time anyway 16:38
jnthn I don't think it should contain the actual work register indices 16:41
Because we can't intern callsites so widely then
But I think it could contain constants 16:42
So then we have
prepargs <callsite>
[list of 16-bit integers identifying registers]
dispatch ...
That way, every arg is 2 bytes instead of 6 bytes today (or 2 bytes instead of 14 bytes for named args) 16:43
timotimo list of integers, like, literally where we'd normally have bytecode?
jnthn Yes 16:44
They're effectively "varargs" to the prepargs
nine Basically a prepargs OP with a variable number of arguments
jnthn hah!
And what if we take it even further? 16:45
dispatch_o r(0), <callsite>, 'dispatcher-name'
And then followed by the list of 16-bit integers 16:46
So instead of a 2-argument call today being prepargs (2 + 4 bytes), 2 arg_o instructions (2 * 6 bytes) and one invoke instruction (2 + 2 + 2 bytes), for a total of 24 bytes *and* 4 instructions to interpret 16:47
It'd be 2 (dispatch_o instruction code) + 2 (result register) + 4 (callsite) + 4 (dispatcher name) + 3 * 2 registers (one register is the invokee) = 18 bytes 16:48
1 instruction to interpret 16:49
And no copying into an arg buffer
No arg buffer for the GC to have to collect
In fact, no arg buffer to allocate at all
So every frame takes less ->work too
The other thing I'm thinking to do is move flattening up front 16:51
So we do it at the callsite 16:52
And for cases where we have, say, up to N positional args flattened in, we resolve it to an interned callsite
Maybe some rule for named ones too
(Need to be careful that a malicious program doesn't explode the memory use :))
jnthn And if I hang this new way of doing things off the new `dispatch` instruction, I've got a gradual migration path for implementing this. :) 16:58
jnthn Ok, home time 17:04
jnthn hopes the time invested in the design work will mean he has an easier/shorter time of the impl work :) 17:05
MasterDuke nine: guess nothing jumped out at you in that bisect log? 18:50
nwc10 jnthn: er, hangon, currently each *call* causes allocation? Or "each call site on first call"? 19:00
timotimo which allocation are you refering to? 19:12
nwc10 17:49 < jnthn> No arg buffer for the GC to have to collect 19:13
timotimo that's more a "have a couple pointers that have to be put into a worklist" thing
nwc10 timotimo: I'm not familiar (at all) with the MoarVM calling convention, so I can't easily follow from jnthn's long description what is "plan he can rule out now" versus "current"
(sort of clear what "future" is intended to be, but of course "no plan survives contact with the enemy") 19:14
timotimo arg_buffer is actually a pointer into *work, eh? so maybe we're currently just allocating it at the end of the registers area or something?
lizmat also, will these plans affect the JIT in any way ? 19:17
will new ops need to be JIIted
I assume so
jnthn nwc10: Currently the registers area for a frame has an area known as the "args buffer"; we keep a pointer into it also. 19:56
nwc10: The GC needs to walk these registers based on the callsite describing which ones are objects/strings 19:57
It's not a big amount of work, but every little helps.
nwc10 ah OK thanks 19:58
jnthn lizmat: Remains to be seen exactly how it works out, but it's unlikely that the op the interpreter uses will be JITted directly.
lizmat yeah, figured as much 19:59
by having the ops do more, wouldn't that make it harder to JIT ?
jnthn We'll be able to do things from inlining (op disappears) through specialization linking and so turning it into a fastinvoke of a specialization and fall back to at least a variant that avoids some of the overheads. 20:00
lizmat: Only if we ever let the JIT see it. :) 20:01
lizmat ok, so you're saying the JIT is going to have simpler targets ?
jnthn Well, in the inlining case it's got no op, in the linked specialization case it's a lot like today. That covers the monomorphic majority without really needing any changes. But yeah, a nicer fallback form for the JIT is possible, perhaps even including JITting the guard tree as it exists at the point we produce the specialization. 20:03
jnthn tbh, I'm mostly worried at this point about how badly we'll behave on the megamorphic minority, 'cus as it stands the design hasn't got a great answer to that. 20:06
And the worst would be $so-many-types."$so-many-names"() :) 20:12
lizmat couldn't a guard be something like "type seen"? 20:14
jnthn Well, normally you'd see a type and a method name and they won't change much, so the approach of "guard on type and name" (if name ain't already a constant) works out fine. 20:15
But if you see 100 types and 100 method names, you don't want to build a tree of 10,000 entries 20:16
At some point you're better off with having a per-target-type hash 20:17
lizmat so why not start out with one?
jnthn ?
lizmat a per-target hash ?
or a per-target list ? 20:18
jnthn We do that today. 20:19
$ perl6 -e 'say X::AdHoc.^methods(:all).elems'
$ cat src/core.c/Exception.pm6 | grep class | wc -l 20:20
m: say 165 * 322
camelia 53130
jnthn Just for that one file, there's 53,000 serialized hash entries in CORE.setting's precomp thanks to this.
Even if we assume we manage to do it compact enough that there's 2 bytes each for the key and hash (it'll be wrose in a big comp unit like CORE.setting), that's 200KB. 20:22
That's *before* you use the type and we deserialize the per-type method cache hash.
lizmat I wonder if X::AdHoc needs that many methods 20:24
maybe Exception should be made outside of Any ?
jnthn Was just doing the calculation, and I reckon it's 40 bytes just for the hash bucket storage once expanded...
m: say 165 * 40 20:25
camelia 6600
jnthn This only happens for the types you use, but still...
lizmat: It's not really to do with exceptions, it's everything. I just picked it as a file that illustrates that Raku code is quite class-dense. 20:26
Or at least, can be.
lizmat yeah, but this was really outside of this discussion :-)
jnthn Especially given they have safety/performance benefits over hashes.
Anyway, no, I don't really think Exceptions not being Any would help matters. :) 20:27
lizmat it doesn't break the build, but it does break installing core modules
jnthn I'm just noting why the pre-calculation of a method cache for every type is costly now we have the size of standard library and people running the size of applications they do :)
And why I'm keen to move away from it as part of this set of changes, so we at least only build it for the cases that really need it. 20:28
(The other part of the story here is that I relied on this pre-calc to resolve a bootstrap loop also, and will probably have to find another way to circularity saw that too...) 20:32
