Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes.
Set by lizmat on 24 May 2021.
timo anyway, a non-deduplicated coverage log file of core.c setting compilation is just over 80 gigabytes big 00:41
made QASTNode's "set" method use sorted_keys and now it's taking a long while to find the first difference in the logs 00:47
line 185900790 had the first difference 00:53
linkable6 (2022-01-05) github.com/danaisi6/folder17434280...2794c6356e commit 185900790 00:54
timo ... uhhhh?
[Coke] Weird 01:57
07:46 [Coke] left 07:48 [Coke] joined 15:49 Nicholas left 16:04 Nicholas joined 16:46 lizmat joined 18:15 lizmat left 18:48 lizmat joined 18:49 lizmat left, lizmat joined 19:59 lizmat_ joined, lizmat left 20:31 lizmat_ left, lizmat joined 20:54 lizmat left, MasterDuke joined
MasterDuke does anyone have any thoughts on how to do normalization (NFG) faster? 20:55
20:55 lizmat joined
MasterDuke i've been experimenting with using simdutf (github.com/simdutf/simdutf) in `MVM_string_utf8_decode()`, but i believe it's the normalizing that's taking the time 20:58
i added an `if (<is valid utf8 according to simdutf>) { <decode with simdutf>; <normalize with our code>; } else { <the existing code> }` 21:01
but it isn't any faster. i guess there's a chance i wouldn't get as much of a speedup on this arm laptop, but i do believe there is at least some neon code in simdutf 21:02
i think part of the problem is that simdutf writes into a char buffer, but our normalization works on/creates an int buffer 21:07
japhb How is it writing into a char buffer if it is decoding a UTF8 bytestream? Is it only doing that if it can guarantee ASCII or Latin-* codepoint sets or something? 21:09
MasterDuke yeah, that's for latin1_to_utf8. if you do utf8_to_utf16 it writes into a char16_t buffer 21:11
i'm not tied to that library, any and all suggestions are welcome 21:13
so right now i have the code malloc two buffer, a char and an int. but that seems non-optimal. i then tried just casting the last quarter of the int buffer into a char buffer, decoding into that, and then normalizing from there into the start of the int buffer 21:16
but that didn't work
japhb Part of the annoying problem is that as of a couple years ago, there wasn't much software that could *correctly* normalize. As language support goes, only Raku and Swift had it right. And a lot of libraries could validate a UTF-8 stream, and that it was properly *de*normalized (which IIRC is the first step to renormalizing as NFG), but that's not the hardest part.
MasterDuke you know, i've known that swift support was good, but i've never looked at how they implement it 21:19
japhb Might be worth a look I suppose. Curious if they do everything bespoke or if they've found (or created) an independent library that accelerates all or part of it.
patrickb Can I in new-disp prevent dispatch programs from being recorded? I.e. force to always cal 21:22
do the record phase and never create or run a dispatch program?
Or phrased differently, can I misuse the dispatch mechanism to hook into some function calls to run some custom code before reaching the caller? 21:24
I suspect it's a probably a stupid idea and I shouldn't try...
japhb I suspect that something that can absurdly fast go from (proposed UTF8 bytestream) --> valid? + (max codepoint width) + NFD would be quite welcome though
MasterDuke hm. looks like github.com/swiftlang/swift/blob/ma...tion.swift is implemented in swift. that's not immediately useful for us 21:27
patrickb: that sounds like a 'hold my beer' challenge 21:28
zhaskell.github.io/utf8rewind/html...9837ca69db looks like it does normalization 21:41
no idea if it's any faster than what moarvm has implemented
and doesn't do NFG (which is just a rakudo/moarvm thing, right?) 21:42
hm, i don't really know the difference between NFD and NFG. would it be fast to convert from NFD to NFG? 21:44
japhb NFD means "Normalization Form: canonically Decomposed" 21:56
Similarly, NFKD is "NF: C[K]ompatibility Decomposed" -- this is less common
NFC is "NF: canonically Composed" (and there's NFKC for the compatibility form) 21:57
These all work at the level of available codepoints; if there is a pre-composed codepoint already defined in Unicode for some sequence of base character and modifiers, great, you can replace the modified sequence with a single replacement codepoint. But if there isn't, it needs to remain decomposed, because there's no defined thing to compose it to. 21:58
NFG is "NF: Grapheme composed" meaning that *all* possible sequences of base character and accents/modifiers (together creating a grapheme, which is what a user would tend to think of a "character" as) are composed into a single unit, even if that unit isn't already in the Unicode list. (But preferring to use the Unicode composed codepoints if they exist.) 22:00
One of the key things about the decomposition phase is that it gives a strict ordering to all the accents/modifiers, so that two strings that have the same graphemes also have the same codepoint sequence in normalized decomposed form. 22:02
That's used when composing, to make sure normalized composition does the same thing (produces the same sequence of codepoints or NFG units) no matter the exact mess that got fed in as input. 22:03
So typically the sequence is (input --> UTF-8 decode --> NFD --> NFG --> str/Str), at least semantically (even if the real code smashes some of those conversions together) 22:05
22:06 lizmat left
japhb One other key thing that some other languages/engines/VMs don't have to deal with is that MoarVM allows a single string to be made of lots of individual pieces that are strung together -- and normalization has to do the right thing across the boundaries between them. 22:07
And you can't do it "in place" because strings are supposed to be immutable. 22:08
22:09 MasterDuke left
japhb samcv, timo, and of course jnthn have all spent a fair amount of time on these sections of the code and can probably give you a more official rundown than I just did. 22:09
MCP: EOL. 22:11