00:37
guifa left
10:23
sena_kun joined
11:11
mef joined
|
|||
mef | Hi, | 11:11 | |
11:11
sena_kun left
|
|||
lizmat | o/ mef | 11:11 | |
11:11
sena_kun joined
|
|||
mef | I've downloaded thinkc11@makoto 20:11:53/250215 (devel/MoarVM)% ls ../../distfiles/MoarVM-2025.01.tar.gz | 11:12 | |
../../distfiles/MoarVM-2025.01.tar.gz | |||
thinkc11@makoto 20:11:57/250215 (devel/MoarVM)% sha1 ../../distfiles/MoarVM-2025.01.tar.gz | |||
SHA1 (../../distfiles/MoarVM-2025.01.tar.gz) = d43fcbc8140457070e411d94af85b463876571f6 | |||
the timestamp inside looks strange | |||
tar ztvf segfaults | |||
lizmat | I think this was discussed here before... | 11:13 | |
mef | OK, thanks. Just use that file is OK ? | ||
lizmat | irclogs.raku.org/moarvm/2025-01-25.html#23:23 | 11:14 | |
mef | (sorry, I'm new on this channel) | ||
lizmat | yeah, it's a known issue if you downloaded it from moarvm.org, it should be ok | ||
mef | the packaging can be done OK, ignoring some time stamp warning. | 11:17 | |
I'll try zip version. | |||
(I may be reading zip related topic wrongly ;-) | 11:19 | ||
(I'm on NetBSD by the way, thanks always) | 11:20 | ||
15:37
sena_kun left
15:40
sena_kun joined
15:41
camelia left
15:42
camelia joined
15:49
sena_kun left
16:46
MasterDuke joined
|
|||
MasterDuke | well, i've now found even more slower ways to do find_cclass() github.com/MoarVM/MoarVM/compare/m...ass_slower | 16:47 | |
tellable6 | 2025-02-13T20:43:26Z #moarvm <lizmat> MasterDuke17 looks like the find_cclass update is causing HTTP::Tiny to fail | ||
2025-02-13T20:44:04Z #moarvm <lizmat> MasterDuke17 symptoms are having newlines at the end of a string where they were not expected according to the test | |||
MasterDuke | but thanks timo for fixing the 8-bit case | 16:48 | |
lizmat | yeah, glad t was fixed so quickly. also glad it was spotted by an early blin run | 16:49 | |
MasterDuke | jnthn, nine, timo: any idea how difficult that python interpreter change (from computed goto to tail-calling) would be to implement for moarvm? | 16:53 | |
jnthn: it looks like moarvm uses the cytron paper to compute SSA form in spesh. i ran across bernsteinbear.com/blog/ssa/ the other day and wondered if there would be any benefit to switching to a different algorithm? | 16:56 | ||
17:07
MasterDuke left
18:31
MasterDuke joined
|
|||
MasterDuke | can i quickly tell if an array of utf8 bytes will result in any 32-bit graphemes without decoding it/them? | 18:33 | |
we have `MVM_string_buf32_can_fit_into_8bit()`, but that takes an array of already-decoded graphemes | |||
timo | it's not too hard to detect the case of "doesn't have even a single continuation byte" | 18:43 | |
however, i'm not sure "only 7bit codepoints" corresponds directly to "no 32bit graphemes" | 18:44 | ||
after NFG happens | |||
are there any combining characters in the lowest 127 unicode codepoints? | 18:45 | ||
MasterDuke | dunno | 18:46 | |
timo | if that is possible, a utf8 input that only has no-continuation-byte codepoints encoded in it could require a new synthetic codepoint to be allocated, and if that's possible, we just need 126 or 125 different sequences to be created in order to spill over into 32bit grapheme territory | 18:47 | |
MasterDuke | so as long as the byte sequence was less than 125 bytes long it could be know to be safe? | 18:48 | |
timo | no | ||
synthetics are global per instance | |||
they also do not get recycled | 18:49 | ||
MasterDuke | hm. well, faking it by calling `MVM_string_buf32_can_fit_into_8bit((MVMGrapheme32 *)utf8, bytes)` and then duplicating the decoding code into branches for 8 vs 32 doesn't seem to be faster, so i guess it doesn't really matter | 18:53 | |
timo | have you been looking at something like perf's per-instruction annotated assembly output for some metrics? | 18:54 | |
MasterDuke | sort of. i know arm assembly even less than x86. mostly just looking at the time taken to call `"CORE.c.setting".IO.slurp.lines` a bunch of times in a loop | 18:56 | |
timo | yeah, it will probably be quite tricky to figure out what's making it slow without some very low-level output i'm thinking | 18:57 | |
MasterDuke | since i figure that has a good mix of line lengths and char sizes | ||
timo | doesn't have to be perf, can be cachegrind's cache simulation for example that could be of interest | ||
we don't have anything at the raku level that decodes utf8 without also doing normalization? | 19:00 | ||
ugexe | i haven't followed along, but is there still a performance benefit for those string code paths despite any additional fixes that came after the original bench marking? | ||
im curious but im also too lazy to look at the fixes to have an intuitive idea | 19:01 | ||
timo | right, the fix made it more expensive again | ||
MasterDuke | haven't checked | ||
timo | that would be worthwhile. maybe we should actually revert that and continue searching for something faster | ||
MasterDuke | gist.github.com/MasterDuke17/dc4d3...3ece2ee734 | 19:03 | |
that's MVM_string_utf8_decode() at HEAD of main | |||
timo | oh in my head i was still at the "found more ways to make find_cclass slower" from earlier today | 19:04 | |
i know almost none of these mnemonics :D | 19:05 | ||
MasterDuke | i'm looking at both MVM_string_utf8_decode and MVM_string_find_cclass, they're 2 and 3 in perf for that example i gave | 19:06 | |
i think MVM_unicode_normalizer_get_grapheme has tons of branches | 19:07 | ||
or MVM_unicode_normalizer_process_codepoint_to_grapheme | 19:08 | ||
timo | branches can be unproblematic if they are overwhelmingly predicted correctly | 19:10 | |
but i would presume that for diverse input text like the core setting there's plenty times when it hits some rare branches? | 19:11 | ||
MasterDuke | also the code is largish, so might not get inlined well | ||
timo | branch mispredictions are also a thing perf can measure | 19:12 | |
MasterDuke | 0.33% of all branches missed | 19:14 | |
timo | since inlining of C code happens only at compile time, you should be able to see it from the assembly view, whether it goes back and forth between one function and the other | 19:15 | |
if you crank the -F parameter up a bit, we should see a slightly more even spread of instructions with nonzero percent values, which will also give a little bit of a hint about coverage | 19:21 | ||
in the last gif you shared, i think that really just looks like the main place it's hitting is where it has to actually wait for new parts of the input to arrive from memory. so that's the "ldr q31, [x0] #16" that queues the load and gets hit by sampling only 0.29% and the "bic v31.4s, #0x7f" after it which gets 5.55% of the hits | 19:24 | ||
ah, there are entries with more than 10% further up too | 19:25 | ||
sorry, i'm really not entirely completely good at this | 19:27 | ||
MasterDuke | gist.github.com/MasterDuke17/eefd4...d3f4d27ba2 is with `-F max` and for 200 loop iterations instead of 50 | 19:28 | |
gist.github.com/MasterDuke17/0213d...ed0db46104 is the same thing on my x86 desktop | 19:32 | ||
or what about a fast way to find out if anything will need normalization? then we could have a decode body that doesn't need to go through the normalizer | 19:34 | ||
timo | that will probably require attention every time we do a unicode update | 19:35 | |
MasterDuke | oh, don't we need to do a unicode update now regardless? | 19:36 | |
we're on 15... | 19:37 | ||
yeah, 16 came out in september | 19:38 | ||
timo | here's an idea that is probably worth barely anything at all: is it always safe to skip the `if (n->buffer_norm_end == n->buffer_start)` check from inside this code? actually, i thought i saw some time spent there but now i can't find it again | 19:41 | |
gist.github.com/MasterDuke17/0213d...1-txt-L215 maybe i was seeing this? | 19:47 | ||
MasterDuke | don't really notice a difference after removing it | 19:49 | |
timo | OK | ||
the unicode stuff is unfortunately an area where i've only got a surface-level understanding :| | 19:58 | ||
MasterDuke | and i've got less than that | 19:59 | |
20:39
MasterDuke left
|
|||
timo | i'm going to have a look at the intel vtune profiler | 21:22 | |
22:19
sena_kun joined
|
|||
jnthn | .tell MasterDuke It's not clear to me that any of the other algorithms mentioend there offer a significant performance win. I believe it's using a more modern (and faster) algorithm than was available when the Cryton paper was written to calculate the dominance. | 22:27 | |
tellable6 | jnthn, I'll pass your message to MasterDuke | 22:28 | |
jnthn | .tell MasterDuke to be clear, MoarVM is already using the faster dominance algorithm | ||
tellable6 | jnthn, I'll pass your message to MasterDuke |