Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes.
Set by lizmat on 24 May 2021.
timo1 this isn't microoptimization, this is nanooptimization, except when you do nanotechnology you can do impressive things you wouldn't be able to do with regular "small stuff", and nanooptimization is just useless :P 09:52
nine It looks impressive though ;) 09:53
timo1 this improvement in the bytecode came from changing the nqp source code tho, so it really literally only applies in this one frame 10:41
nine It's a common frame though 10:42
lizmat and yet another Rakudo Weekly News hits the Net: rakudoweekly.blog/2023/02/06/2023-...en-davies/ 11:53
el gatito (** advocate) what is a "frame"? 13:41
Voldenet stack frame perhaps 13:49
nine Actually in this context a piece of bytecode, i.e. a code block 13:53
el gatito (** advocate) oh 13:54
timo1 nine: are we talking about the same code? the piece of code inside EXPORTHOW.nqp inside the nqp source? 14:04
does this actually run more than once?
nine I guess once in every process? 14:06
timo1 i see it up to twice (after putting an nqp::sin_n that fprintf stderr in) during build 14:10
nqp startup is only 0.04 so this can barely do anything :P 14:14
japhb I'd love to get that down an order of magnitude, but I suspect that requires more large-scale engineering. :-) 14:16
timo1 for sure
nine I have always wondered what exactly we spend that 100ms on in rakudo startup 14:17
timo1 we do a load of deserialization of stuff for example 14:18
nine Is there so much that we have to deserialize right away? 14:19
timo1 the "work queue" nature of the deserialization work code makes it a little tricky to attribute work to what "caused" it 14:24
japhb Deserialization is lazy, isn't it? 14:44
Kindof wonder if there's any performance benefit to figuring out what needs to be deserialized for -e '' and just doing that always, as fast as we can. 14:45
timo1 deserialization is lazy, yes 14:47
you're thinking maybe less "context switches" would benefit startup performance if we blaze through a whole chunk of serialized data ahead of time? 14:49
nine Even more if we store that stuff close together and benefit from caching 14:53
timo1 how hard is it going to be to reshuffle objects in the serialized blob? 14:54
japhb Yeah, what both of you said
Woodi maybe just serializing it once, compiling it, saving as executable to disc and then just loading it at startup ? if it is too specific then options for compiling different executables, eg. for one-liners, for long running services ? and if it can un-jit when needed then nothing is lost 15:57
Woodi ultimate option: generating asm code / executable that just do what asked, like generating grep-like binary without object stuff... 16:01
Voldenet perl5 starts in 5ms, nqp takes 33ms to start, raku takes 140ms to start, so nqp takes around 25% of startup time 16:07
timo1 the nqp command does some things that the raku command doesn't have to, i'm not sure how much sense it makes to express nqp as a fraction of raku startup 16:08
for example, rakudo shouldn't load the nqp grammar, actions, world, and optimizer 16:09
Voldenet I see, so the only real way to measure perf reliably is to actually use the profiler 16:11
timo1 i would say that is accurate, yeah
don't forget that rakudo doesn't start up too much slower than perl5 if you include some support for classes like moo or moose or whichever 16:12
Voldenet Moo itself takes 15ms to load 16:15
timo1 moose takes a lot longer, right? since moo is kind of "moose but lighter"? 16:16
Voldenet Yeah, moo and mouse are a lot faster, moose takes 120ms 16:18
nine While that makes us look less bad, it distracts from the fact that we could load a lot faster 16:19
timo1 right, not saying we shouldn't improve load times, it is definitely a goal up there in terms of priority 16:20
you think perl5 without any modules is something we could eventually reach? or at least load in 2x to 3x the time? 16:24
nine I honestly don't know. Perl doesn't have to do much when starting up 16:28
Voldenet python3: 23ms, nodejs: 60ms, ruby: 50ms 16:29
timo1 rakudo runs just over 200M branches during startup and misses 1.94% of them, where python runs 15M and misses 4.20% (lol) of them 16:30
how do we feel about turning spesh on a little later than when the program starts? 16:36
disabling spesh gives me a time of 0.181 wallclock vs spesh enabled gives 0.178, but the task-clock is 171msec vs 255msec, which just means when spesh is on we use 1.43 cpus and when it's off we use 0.95 cpus 16:37
Voldenet page-faults is especially big number on raku (17045) compared to python3 (1111) nodejs (2494) ruby (2263) 16:46
timo1 yeah that probably has something to do with how much ram we use also 16:49
and how much of the files we map we read from probably?
Voldenet ram usage probably matters a bit, but nodejs uses 3x more rss than ruby and there's not much of a difference in startup time 16:53
timo1 interesting 16:55
we can use perf to measure where page faults tend to happen
haha, 61% in __memset_sse2_unaligned_erms, 13.4% mi_page_free_list_extend, 9.18% in _dl_relocate_object 16:57
here, MVM_bytecode_unpack has 2.28%, maybe_grow_hash lands at 1.26%, MVM_spesh_log_entry at a surprising (to me) 1.1%, another .85% in MVM_spesh_log_decont, 0.73% in MVM_serialization_demand_object, 0.55% in MVM_spesh_log_type 16:59
this is the page-faults performance counter, not the minor-faults one 17:00
Voldenet so apparently just allocating larger chunks could improve performance 17:01
timo1 for spesh logs we already allocate one big chunk that we just write data to linearly 17:02
Voldenet I can see why page faults here are surprising