Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes.
Set by lizmat on 24 May 2021.
Geth MoarVM/master: 4 commits pushed by (Nicholas Clark)++, niner++ 08:56
nine Looks like LibXML suffers from at least 2 bugs. One spesh-related (but not JIT) and one that appears even without spesh. And one test seems to get stuck currently because LibXML is trying to fetch something from the network but just hangs instead (or runs into a timeout, wasn't patient enough to check) 09:53
dogbert17 nine: still down in the signed/unsigned rabbit hole? 10:59
nine That's on pause while I am debugging Blin issues 11:01
dogbert17 I stumbled upon a, what I believe is a nativecall related, bug yesterday 11:03
github.com/MoarVM/MoarVM/issues/1621 11:05
nine Oh, the non-spesh bug exposed by LibXML is about Proxy. Once we encounter a Proxy in the args list, we stop processing arguments and delegate to raku-native-dispatch-deproxy instead. This causes us to not install guards for the remaining arguments 11:10
dogbert17 so you found one of the bugs already, impressive 11:11
nine Now I've even got a fix. Though it's kinda awful and I don't understand 100 % why it doesn't work the other way. 11:59
MasterDuke how do i turn on github.com/rakudo/rakudo/blob/mast...#L301-L305 when building rakudo? moarvm/nqp/rakudo were all built with `make MVM_TRACING=1 install`, but when i run with --tracing it's instead going to the default (even though the value would be triggering that case) 12:20
also, when i just remove that #if, tracing still doesn't happen, though maybe that's because the MVM_interp_enable_tracing call is happening before the MVM_vm_create_instance call? 12:24
well, moving MVM_interp_enable_tracing after the MVM_vm_create_instance doesn't change anything 12:27
nine So, I've rewritten Proxy support in NativeCall so that instead of the clever but inefficient trick that would require that ugly fix, it works like multi dispatch and uses the ProxyReaderFactory. Works like a charm, except that now in a few cases the dispatch just does not get resumed 16:44
dogbert17 uh oh 17:00
MasterDuke every problem in computer science can be solved by adding a layer of indirection, time to break out the ProxyReaderFactoryFactory! 17:09
nine This isn't Java. The solution is obvioulsy a ProxyReaderFactoryProxy 17:39
.oO( with maybe a touch of sudo ? )
japhb lizmat: sudo MakeMeAProxyReaderFactoryProxy ? 20:27
sudo write-my-boilerplate 20:28
timo `no nonsense;` 21:20
MasterDuke huh. using alloca where possible in the rakudo runner took 56,532,210 more instructions for 100 runs of `MVM_SPESH_BLOCKING=1 raku -e ''` 21:31
timo oh? that is odd 21:34
MasterDuke i do add conditionals for if the size is too big, but i still sort of expected it to be few instructions overall. going to check elapsed time with /usr/bin/time now 21:36
timo huh
MasterDuke well, /usr/bin/time isn't very precise, but for 100 runs it thinks using alloca is 0.19s slower (in total) 21:41
the patch gist.github.com/MasterDuke17/497d9...9b29845492 if you're curious 21:42
maybe some of those checks for the size can be assumed to always be ok and removed 21:44
timo ok the first thing i see is file path lengths 21:51
we should be able to get a proper value for "maximum file path length the system allows"
so anything more than that would be an error later anyway
MasterDuke "Linux has a maximum filename length of 255 characters for most filesystems (including EXT4), and a maximum path of 4096 characters" to quote the first hit on google 21:54
timo we'll want to have this value for every system we support
and of course never accidentally write past our allocated buffer when we are fed something longer
MasterDuke looks like for bsd max path is 1024 21:55
and macos 21:56
oh, and windows is 260
timo oh christ, 260? 21:57
that's nothing?!
MasterDuke well, of course it's more complicated than that docs.microsoft.com/en-us/windows/w...limitation 21:57
but that looks like the initial default 21:58
i'll just remove all those checks and see if it's faster. if not, no reason to continue the experiment 22:00
timo ah, good point 22:02
to get an upper and lower bound
MasterDuke huh. even with no checks, 6,889,088 more instructions for alloca and 0.04s more time according to /usr/bin/time 22:23
timo that is very odd. what does alloca actually compile to, then? 22:39
just the call into malloc is short, but then you'll also run through malloc, which does a bunch of stuff, whereas alloca should essentially be barely a change at all?? 22:40
Ir also goes by cache lines, right? could it have something to do, then, with little changes giving us whole-cacheline differences at once? 22:42
even if the malloc implementation just stays in cache most of the time, it'd still be more code than many alloca usages combined?
but 7 million is out of how much? 10,000 million? 22:43
MasterDuke what do you mean, out of? 22:52
95769046524 total instructions for 100 runs with alloca, 95762157436 total instructions on master 22:53
timo m: say 95769046524 * 100 / 95762157436 22:54
camelia 100.007193956553
timo i wonder. will this be difficult to measure reliably until we've made everything else a lot faster already? 22:55
perhaps the big difference is when we have a multithreaded program, but valgrind wouldn't tell us in that case 22:56
also, can you try jemalloc or some other malloc implementation? 22:57
MasterDuke well, i figure the empty program is the best possible case for this since this is just about trying to speedup startup 22:58
i'll try with mimalloc. btw, did you see my comments about using it a little while ago? 22:59
timo i did not 23:00
MasterDuke colabti.org/irclogger/irclogger_lo...-10-16#l56 23:01
heh. with mimalloc LD_PRELOADed, alloca is .17 faster than master (though guess now i have to try master with mimalloc) 23:03
timo oh i see that i actually even replied to that
.17 seconds?
MasterDuke they just released 2.0, i think we should seriously consider bundling/usring it with moarvm 23:04
well, sum of all the elapsed times with master is 1827, 1649 for alloca+mimalloc 23:05
timo empty program still?
MasterDuke yeah
timo that means a regular run of empty program is 1.8 seconds? 23:06
MasterDuke 0.18
timo you only used 100 for valgrind i guess?
MasterDuke for both
timo m: say 1827 / 100 23:07
camelia 18.27
timo i'm confused :) :)
MasterDuke divide by 100 again 23:08
 /usr/bin/time reports 0.18, i log 18 to a file (and do that 100 times) 23:09
and then just sum all the lines in the file 23:14
hm, wonder if using hyperfine would be more rigorous 23:16
MasterDuke master+mimalloc is 1561 for startup time 23:19
MasterDuke so somehow it's really looking like alloca isn't helping in this case 23:20
(but mimalloc is)
ugh, so much slower to run callgrind 100 times... 23:25
timo oh for sure
MasterDuke an interesting  yet involved  experiment would be to rip out the fsa and just see how mimalloc does 23:28
timo mhh the code is already there, you'd just want to toss out the size checks including allocating the size bit before the data 23:29
MasterDuke oh right, FSA_SIZE_DEBUG is essentially that, isn't it? 23:30
maybe that'll be a christmas break project
alloca+mimalloc is 162919560 instructions more than master+mimalloc 23:31
and master is 9017008487 instructions more than master+mimalloc 23:33
all total for 100 runs