Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes.
Set by lizmat on 24 May 2021.
MasterDuke after way too much flailing at the keyboard (i hate c++ libraries), i managed to call simdutf's validate_utf8() in moarvm. so i added a fast path (that still uses our normalization) when decoding utf8 if validate_utf8() says it's valid 01:23
i was hoping to measure if doing the validation + fast path was faster than what we do now (decoding that checks for validity along the way) 01:24
but annoyingly, `raku -e ''` dies with `insert duplicate key 01:25
   at SETTING::src/core.c/allomorphs.rakumod:309  (/home/dan/r/install/share/perl6/runtime/CORE.c.setting.moarvm:val)`
github.com/MoarVM/MoarVM/compare/m...th_simdutf if anybody feels like seeing what i tried 01:28
if i use a moarvm built on main with a printf added in the UTF8_REJECT case, it never hits. so i guess my `decode_utf8_byte_nocheck()` is wrong? 01:44
ugexe do you know what the string (key) is? maybe that is a blue 01:45
clue
MasterDuke not yet, working on that now 01:46
timo i think your code is treating utf8 bytes as if they were codepoints? 01:54
ready = MVM_unicode_normalizer_process_codepoint_to_grapheme(tc, &norm, decode_utf8_byte_nocheck((MVMuint8)*utf8), &g);
like, this only takes a single byte and feeds it directly into codepoint_to_grapheme? 01:56
MasterDuke but the old code did the same? `decode_utf8_byte(&state, &codepoint, (MVMuint8)*utf8)` writes a value in to `codepoint`, and then `ready = MVM_unicode_normalizer_process_codepoint_to_grapheme(tc, &norm, codepoint, &g);` 01:57
timo that takes the value from "codepoint", which is only valid if the state that decode_utf8_byte returns is UTF8_ACCEPT 01:58
i think it does that only after a full utf8 codepoint was read
MasterDuke ah, i was assuming that continually getting UTF8_ACCEPT would be the same as `validate_utf8()` returning true 01:59
timo right, that's not the case
MasterDuke oh, i guess the old code actually only throws if there are two UTF8_REJECT in a row 02:00
timo i think it jumps backwards after the first UTF8_REJECT
MasterDuke hm. is there actually an easy way to use simdutf? 02:03
or does our use of NFG make it impossible?
since simdutf doesn't (yet) do any normalization 02:04
could we use simdutf to decode if valid, and then post-process the result to get it into NFG? 02:05
timo no clue tbh 02:10
the speed benefit to simdutf, and many other things, may be in large part that they don't have to split the input into codepoints? maybe? 02:12
MasterDuke i just tried printing the results of simdutf::count_utf8(), simdutf::utf8_length_from_latin1(), and the final `count` from our processing. `count` was always equal or slightly smaller (which is what i expected), so maybe if it's fast enough using that to preallocate the buffer to guaranteed-to-be-big-enough would be helpful 02:24
oh, looks like more often we allocate it bigger than we need to, so maybe we'd be more likely to save memory (and the realloc to shrink it that we do if it's too big) 02:29
afk 02:47