mef Hi, 11:11
lizmat o/ mef 11:11
mef I've downloaded thinkc11@makoto 20:11:53/250215 (devel/MoarVM)% ls ../../distfiles/MoarVM-2025.01.tar.gz 11:12
../../distfiles/MoarVM-2025.01.tar.gz
thinkc11@makoto 20:11:57/250215 (devel/MoarVM)% sha1 ../../distfiles/MoarVM-2025.01.tar.gz
SHA1 (../../distfiles/MoarVM-2025.01.tar.gz) = d43fcbc8140457070e411d94af85b463876571f6
the timestamp inside looks strange
tar ztvf segfaults
lizmat I think this was discussed here before... 11:13
mef OK, thanks. Just use that file is OK ?
lizmat irclogs.raku.org/moarvm/2025-01-25.html#23:23 11:14
mef (sorry, I'm new on this channel)
lizmat yeah, it's a known issue if you downloaded it from moarvm.org, it should be ok
mef the packaging can be done OK, ignoring some time stamp warning. 11:17
I'll try zip version.
(I may be reading zip related topic wrongly ;-) 11:19
(I'm on NetBSD by the way, thanks always) 11:20
MasterDuke well, i've now found even more slower ways to do find_cclass() github.com/MoarVM/MoarVM/compare/m...ass_slower 16:47
tellable6 2025-02-13T20:43:26Z #moarvm <lizmat> MasterDuke17 looks like the find_cclass update is causing HTTP::Tiny to fail
2025-02-13T20:44:04Z #moarvm <lizmat> MasterDuke17 symptoms are having newlines at the end of a string where they were not expected according to the test
MasterDuke but thanks timo for fixing the 8-bit case 16:48
lizmat yeah, glad t was fixed so quickly. also glad it was spotted by an early blin run 16:49
MasterDuke jnthn, nine, timo: any idea how difficult that python interpreter change (from computed goto to tail-calling) would be to implement for moarvm? 16:53
jnthn: it looks like moarvm uses the cytron paper to compute SSA form in spesh. i ran across bernsteinbear.com/blog/ssa/ the other day and wondered if there would be any benefit to switching to a different algorithm? 16:56
MasterDuke can i quickly tell if an array of utf8 bytes will result in any 32-bit graphemes without decoding it/them? 18:33
we have `MVM_string_buf32_can_fit_into_8bit()`, but that takes an array of already-decoded graphemes
timo it's not too hard to detect the case of "doesn't have even a single continuation byte" 18:43
however, i'm not sure "only 7bit codepoints" corresponds directly to "no 32bit graphemes" 18:44
after NFG happens
are there any combining characters in the lowest 127 unicode codepoints? 18:45
MasterDuke dunno 18:46
timo if that is possible, a utf8 input that only has no-continuation-byte codepoints encoded in it could require a new synthetic codepoint to be allocated, and if that's possible, we just need 126 or 125 different sequences to be created in order to spill over into 32bit grapheme territory 18:47
MasterDuke so as long as the byte sequence was less than 125 bytes long it could be know to be safe? 18:48
timo no
synthetics are global per instance
they also do not get recycled 18:49
MasterDuke hm. well, faking it by calling `MVM_string_buf32_can_fit_into_8bit((MVMGrapheme32 *)utf8, bytes)` and then duplicating the decoding code into branches for 8 vs 32 doesn't seem to be faster, so i guess it doesn't really matter 18:53
timo have you been looking at something like perf's per-instruction annotated assembly output for some metrics? 18:54
MasterDuke sort of. i know arm assembly even less than x86. mostly just looking at the time taken to call `"CORE.c.setting".IO.slurp.lines` a bunch of times in a loop 18:56
timo yeah, it will probably be quite tricky to figure out what's making it slow without some very low-level output i'm thinking 18:57
MasterDuke since i figure that has a good mix of line lengths and char sizes
timo doesn't have to be perf, can be cachegrind's cache simulation for example that could be of interest
we don't have anything at the raku level that decodes utf8 without also doing normalization? 19:00
ugexe i haven't followed along, but is there still a performance benefit for those string code paths despite any additional fixes that came after the original bench marking?
im curious but im also too lazy to look at the fixes to have an intuitive idea 19:01
timo right, the fix made it more expensive again
MasterDuke haven't checked
timo that would be worthwhile. maybe we should actually revert that and continue searching for something faster
MasterDuke gist.github.com/MasterDuke17/dc4d3...3ece2ee734 19:03
that's MVM_string_utf8_decode() at HEAD of main
timo oh in my head i was still at the "found more ways to make find_cclass slower" from earlier today 19:04
i know almost none of these mnemonics :D 19:05
MasterDuke i'm looking at both MVM_string_utf8_decode and MVM_string_find_cclass, they're 2 and 3 in perf for that example i gave 19:06
i think MVM_unicode_normalizer_get_grapheme has tons of branches 19:07
or MVM_unicode_normalizer_process_codepoint_to_grapheme 19:08
timo branches can be unproblematic if they are overwhelmingly predicted correctly 19:10
but i would presume that for diverse input text like the core setting there's plenty times when it hits some rare branches? 19:11
MasterDuke also the code is largish, so might not get inlined well
timo branch mispredictions are also a thing perf can measure 19:12
MasterDuke 0.33% of all branches missed 19:14
timo since inlining of C code happens only at compile time, you should be able to see it from the assembly view, whether it goes back and forth between one function and the other 19:15
if you crank the -F parameter up a bit, we should see a slightly more even spread of instructions with nonzero percent values, which will also give a little bit of a hint about coverage 19:21
in the last gif you shared, i think that really just looks like the main place it's hitting is where it has to actually wait for new parts of the input to arrive from memory. so that's the "ldr q31, [x0] #16" that queues the load and gets hit by sampling only 0.29% and the "bic v31.4s, #0x7f" after it which gets 5.55% of the hits 19:24
ah, there are entries with more than 10% further up too 19:25
sorry, i'm really not entirely completely good at this 19:27
MasterDuke gist.github.com/MasterDuke17/eefd4...d3f4d27ba2 is with `-F max` and for 200 loop iterations instead of 50 19:28
gist.github.com/MasterDuke17/0213d...ed0db46104 is the same thing on my x86 desktop 19:32
or what about a fast way to find out if anything will need normalization? then we could have a decode body that doesn't need to go through the normalizer 19:34
timo that will probably require attention every time we do a unicode update 19:35
MasterDuke oh, don't we need to do a unicode update now regardless? 19:36
we're on 15... 19:37
yeah, 16 came out in september 19:38
timo here's an idea that is probably worth barely anything at all: is it always safe to skip the `if (n->buffer_norm_end == n->buffer_start)` check from inside this code? actually, i thought i saw some time spent there but now i can't find it again 19:41
gist.github.com/MasterDuke17/0213d...1-txt-L215 maybe i was seeing this? 19:47
MasterDuke don't really notice a difference after removing it 19:49
timo OK
the unicode stuff is unfortunately an area where i've only got a surface-level understanding :| 19:58
MasterDuke and i've got less than that 19:59
timo i'm going to have a look at the intel vtune profiler 21:22
jnthn .tell MasterDuke It's not clear to me that any of the other algorithms mentioend there offer a significant performance win. I believe it's using a more modern (and faster) algorithm than was available when the Cryton paper was written to calculate the dominance. 22:27
tellable6 jnthn, I'll pass your message to MasterDuke 22:28
jnthn .tell MasterDuke to be clear, MoarVM is already using the faster dominance algorithm
tellable6 jnthn, I'll pass your message to MasterDuke