Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes. Set by lizmat on 24 May 2021. |
|||
01:20
MasterDuke joined
|
|||
MasterDuke | after way too much flailing at the keyboard (i hate c++ libraries), i managed to call simdutf's validate_utf8() in moarvm. so i added a fast path (that still uses our normalization) when decoding utf8 if validate_utf8() says it's valid | 01:23 | |
i was hoping to measure if doing the validation + fast path was faster than what we do now (decoding that checks for validity along the way) | 01:24 | ||
but annoyingly, `raku -e ''` dies with `insert duplicate key | 01:25 | ||
at SETTING::src/core.c/allomorphs.rakumod:309 (/home/dan/r/install/share/perl6/runtime/CORE.c.setting.moarvm:val)` | |||
github.com/MoarVM/MoarVM/compare/m...th_simdutf if anybody feels like seeing what i tried | 01:28 | ||
if i use a moarvm built on main with a printf added in the UTF8_REJECT case, it never hits. so i guess my `decode_utf8_byte_nocheck()` is wrong? | 01:44 | ||
ugexe | do you know what the string (key) is? maybe that is a blue | 01:45 | |
clue | |||
MasterDuke | not yet, working on that now | 01:46 | |
timo | i think your code is treating utf8 bytes as if they were codepoints? | 01:54 | |
ready = MVM_unicode_normalizer_process_codepoint_to_grapheme(tc, &norm, decode_utf8_byte_nocheck((MVMuint8)*utf8), &g); | |||
like, this only takes a single byte and feeds it directly into codepoint_to_grapheme? | 01:56 | ||
MasterDuke | but the old code did the same? `decode_utf8_byte(&state, &codepoint, (MVMuint8)*utf8)` writes a value in to `codepoint`, and then `ready = MVM_unicode_normalizer_process_codepoint_to_grapheme(tc, &norm, codepoint, &g);` | 01:57 | |
timo | that takes the value from "codepoint", which is only valid if the state that decode_utf8_byte returns is UTF8_ACCEPT | 01:58 | |
i think it does that only after a full utf8 codepoint was read | |||
MasterDuke | ah, i was assuming that continually getting UTF8_ACCEPT would be the same as `validate_utf8()` returning true | 01:59 | |
timo | right, that's not the case | ||
MasterDuke | oh, i guess the old code actually only throws if there are two UTF8_REJECT in a row | 02:00 | |
timo | i think it jumps backwards after the first UTF8_REJECT | ||
MasterDuke | hm. is there actually an easy way to use simdutf? | 02:03 | |
or does our use of NFG make it impossible? | |||
since simdutf doesn't (yet) do any normalization | 02:04 | ||
could we use simdutf to decode if valid, and then post-process the result to get it into NFG? | 02:05 | ||
timo | no clue tbh | 02:10 | |
the speed benefit to simdutf, and many other things, may be in large part that they don't have to split the input into codepoints? maybe? | 02:12 | ||
MasterDuke | i just tried printing the results of simdutf::count_utf8(), simdutf::utf8_length_from_latin1(), and the final `count` from our processing. `count` was always equal or slightly smaller (which is what i expected), so maybe if it's fast enough using that to preallocate the buffer to guaranteed-to-be-big-enough would be helpful | 02:24 | |
oh, looks like more often we allocate it bigger than we need to, so maybe we'd be more likely to save memory (and the realloc to shrink it that we do if it's too big) | 02:29 | ||
afk | 02:47 | ||
02:52
MasterDuke left
04:53
ab5tract left
04:56
ab5tract joined
10:48
nebuchadnezzar joined
11:15
sena_kun joined
12:33
nebuchadnezzar left
14:06
sena_kun left
14:07
sena_kun joined
22:45
sena_kun left
23:27
bloatable6 left,
benchable6 left,
shareable6 left,
greppable6 left,
notable6 left,
sourceable6 left,
evalable6 left,
releasable6 left,
unicodable6 left,
linkable6 left,
coverable6 left,
committable6 left,
nativecallable6 left,
quotable6 left,
bisectable6 left,
tellable6 left
23:29
committable6 joined
23:30
sourceable6 joined,
linkable6 joined,
nativecallable6 joined,
notable6 joined,
greppable6 joined,
coverable6 joined,
quotable6 joined,
releasable6 joined,
guifa joined
23:31
evalable6 joined,
bloatable6 joined,
bisectable6 joined,
shareable6 joined,
benchable6 joined
23:32
unicodable6 joined,
tellable6 joined
|