Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes.
Set by lizmat on 24 May 2021.
timo MVM_nfg_is_concat_stable also takes a good chunk of time. i wonder if there's easy wins there that are also helpful outside of microbenchmarks 00:01
oops, because we use #ifdef instead of #if we have always had the null check at the start of MVM_gc_root_temp_push instead of turning it off when not desired for debugging 00:03
i'm slightly surprised that there are actually calls to MVM_gc_root_temp_push even though it's MVM_STATIC_INLINE 00:07
MasterDuke i was under the impression with modern compilers that one could only suggest to inline, not force it 00:23
with my changes mi_malloc and mi_free essentially disappear, but are 4.3% and 3.4% respectively without 00:26
huh, why can't i annotate functions in perf? `Couldn't annotate MVM_nfg_is_concat_stable: Internal error: Invalid -1 error code`. never seen that before 00:28
why isn't it inlining MVM_nfg_crlf_grapheme!! that's just `return tc->instance->nfg->crlf_grapheme;` 00:35
oh, and `src/strings/siphash/csiphash.h:37:#define MVM_STATIC_INLINE static` is the only definition of MVM_STATIC_INLINE? 00:37
marking MVM_unicode_relative_ccc and MVM_nfg_crlf_grapheme as inline drops the time for 10m iterations down to ~0.39s and the instruction count for 1m iterations down to ~753m 00:44
we don't have anything else with just a plain `inline` annotation, but it definitely helps runtime 01:26
at least for that micro-benchmark 01:27
we spend a lot of time in MVM_disp_program_run when building rakudo 01:51
Geth MoarVM: MasterDuke17++ created pull request #1829:
Fast path for concatenating two in_situ_8 strings
03:09
timo yeah, disp programs do a lot of work, so it makes sense it'd be hot 08:15
some dispatch programs are compiled down to moar bytecode as part of spesh, you'll see them in spesh log output, it's mostly guards and sp_*get_* operations and whatnot 08:17
could be interesting to get statistics of which dispatch program runs how often and for how much time
or even from where it's called most often
Geth MoarVM/main: f1108304ae | MasterDuke17++ (committed using GitHub Web editor) | src/strings/ops.c
Fast path for concatenating two in_situ_8 strings

This happens pretty frequently when building Rakudo (~650k times), so adding a check for it turns out to be beneficial.
08:34
timo MVM_STATIC_INLINE is supposed to be defined in config.h or something like that? the build system figures out what the used compiler needs it to look like 08:42
the define for MVM_STATIC_INLINE in csiphash.h seems incorrect, but at least harmless since it only does it if MVM_STATIC_INLINE is not defined yet 08:44
you probably didn't find the one in config.h because it's in .gitignore since it is not meant to be checked in, and many tools don't show search results in such files
Geth MoarVM/main: 39b9b88c83 | (Timo Paulssen)++ | src/gc/roots.h
fix check for root debug mode

We used #ifdef here even though we unconditionally define the symbol further up, but we set it to 0 by default. However, the 0 we set it to makes it defined, so we accidentally were building the debug check for temp roots all the time
08:50
lizmat would that have performance effects, timo ? 08:52
timo this last one? yes it would.
probably not huge, though
lizmat ok, will have CI run through it and then bump
timo it'll make a bigger difference on something like MD's last microbenchmark, which spends a very large percentage in a function that pushes multiple temp roots every time (MVM_string_concatenate), and less in "real code" where lots of stuff happens outside of such functions 08:53
lizmat well, MD's last fix ate about .5 second of stage parse of the core.c setting for me 08:55
I'll take every .5 second there, as I tend to do that a *lot*
timo you mean improved the time?
timo that's pretty great 08:57
i don't expect the temp root push fix to give anything remotely close to that 08:58
lizmat hmmm looks like it breaks all CI tests ? 09:39
ahhh... caused by rakudo breakage 09:40
yeah, looks like... restarting all CI to make sure 09:52
yeah.. more like .05 second improvement by the looks of it, and hat 10:13
*that's well within noise levels
timo after: sp_runcfunc_o r8(12), r17(7), callsite(0x45bf44551a0, 3 arg, 3 pos, nonflattening, interned) - 'dispatcher_drop_n_args_impl' 11:29
before: sp_runcfunc_o r1(2), r28(0), liti64(5903873283424), r0(2), r6(1) 11:30
lizmat and yet another Rakudo Weekly News hits the Net: rakudoweekly.blog/2024/08/05/2024-32-de-python/ 12:58
jdv wow, quite early. going for a longer ride? 13:14
lizmat hehe... sadly, no... just other things I'm working on that need quality time today, and the Weekly inbetween would affect the quality of the remaining time :-) 13:15
jdv its been terrible weather for riding here, also sadly. Haven't been out much at all. 13:25
lizmat well, I understand Debby is going to drop 470mm of rain in the coming days in some places... 13:28
timo ok so we have an optimization that turns a "get slurpy args" into "fastcreate a hash, then directly bind the arguments by their name into the hash" 13:46
mostly, the next thing that happens in the code is checking for the number of elements in the hash and then branching 13:47
i now have a piece of code that can give a constant result of 0 for elems if nothing is bound into the freshly created hash, eliminating a branch and the BBs that follow it in that case
making it return the correct elems even when stuff is bound into it would also be possible with a bit more work 13:48
slurpy named args* 13:49
and the award for most elems eliminated goes to src/Perl6/Actions.nqp:9480 with 22 eliminations 13:55
Frame size: 45814 bytes (5512 from inlined frames) Specialization took 100550us (total 119637us) JIT was successful and compilation took 12983us Bytecode size: 148668 byte last BB: 912 14:03
Frame size: 44082 bytes (3860 from inlined frames) Specialization took 99936us (total 119158us) JIT was successful and compilation took 12802us Bytecode size: 142081 byte last BB: 893
38,851.92 msec task-clock / 39,347.00 msec task-clock hmmm 14:54
208,480,312,092 cycles 58,528,061,857 stalled-cycles-frontend 14:55
213,014,411,336 cycles 59,423,848,534 stalled-cycles-frontend
that was with the optimizations in the first line, without the optimizations in the second line 14:56
lizmat so that'd be just under 2% improvement? 14:57
timo with a +/- of 0.4s on the improved one and 0.1s on the unimproved one, somehow 15:03
i think sp_boolify_iter* are no longer used anywhere 15:11
timo runs with a few more optimization thingies, 30 repeats, and takes a nap 15:15
[Coke] timo++ 15:20
timo 37.664 +- 0.117 seconds time elapsed ( +- 0.31% ) 17:26
38.424 +- 0.117 seconds time elapsed ( +- 0.30% )
i dunno, it looks like an actual improvement 17:29
Geth MoarVM/runcy_funcy_optimizations: a9e08475b8 | (Timo Paulssen)++ | 3 files
optimize some runcfunc into moar operations

those can then be further optimized, for example boolify_boxed_int to unbox_i that then becomes to sp_get_i64 or maybe a set from the original source that boxed the object in the first place.
18:10
MoarVM/runcy_funcy_optimizations: 55c6002b30 | (Timo Paulssen)++ | src/spesh/optimize.c
optimize elems in rare cases

such as when we have the create or fastcreate for the elems directly in front of it and nothing changes it. This happens when we optimize a "get slurpy named arguments" and we know what arguments there are.
Currently we only optimize elems to a constant 0 if nothing is put into the object, but in the future we could get the exact number.
timo lizmat_: you could give this branch a try and see what it does to performance
MasterDuke timo: i must admit i didn't follow your before/after earlier. before had some large literal value and after had a dispatcher call? 21:59