#moarvm on 10 May 2025 - Raku Programming Language Log

Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes. Set by lizmat on 24 May 2021.
timo	anyway, a non-deduplicated coverage log file of core.c setting compilation is just over 80 gigabytes big	00:41	Copy link Message link Add to gist Remove
	made QASTNode's "set" method use sorted_keys and now it's taking a long while to find the first difference in the logs	00:47	Copy link Message link Add to gist Remove
	line 185900790 had the first difference	00:53	Copy link Message link Add to gist Remove
linkable6	(2022-01-05) github.com/danaisi6/folder17434280...2794c6356e commit 185900790	00:54	Copy link Message link Add to gist Remove
timo	... uhhhh?		Copy link Message link Add to gist Remove
[Coke]	Weird	01:57	Copy link Message link Add to gist Remove
07:46 [Coke] left 07:48 [Coke] joined 15:49 Nicholas left 16:04 Nicholas joined 16:46 lizmat joined 18:15 lizmat left 18:48 lizmat joined 18:49 lizmat left, lizmat joined 19:59 lizmat_ joined, lizmat left 20:31 lizmat_ left, lizmat joined 20:54 lizmat left, MasterDuke joined
MasterDuke	does anyone have any thoughts on how to do normalization (NFG) faster?	20:55	Copy link Message link Add to gist Remove
20:55 lizmat joined
MasterDuke	i've been experimenting with using simdutf (github.com/simdutf/simdutf) in `MVM_string_utf8_decode()`, but i believe it's the normalizing that's taking the time	20:58	Copy link Message link Add to gist Remove
	i added an `if (<is valid utf8 according to simdutf>) { <decode with simdutf>; <normalize with our code>; } else { <the existing code> }`	21:01	Copy link Message link Add to gist Remove
	but it isn't any faster. i guess there's a chance i wouldn't get as much of a speedup on this arm laptop, but i do believe there is at least some neon code in simdutf	21:02	Copy link Message link Add to gist Remove
	i think part of the problem is that simdutf writes into a char buffer, but our normalization works on/creates an int buffer	21:07	Copy link Message link Add to gist Remove
japhb	How is it writing into a char buffer if it is decoding a UTF8 bytestream? Is it only doing that if it can guarantee ASCII or Latin-* codepoint sets or something?	21:09	Copy link Message link Add to gist Remove
MasterDuke	yeah, that's for latin1_to_utf8. if you do utf8_to_utf16 it writes into a char16_t buffer	21:11	Copy link Message link Add to gist Remove
	i'm not tied to that library, any and all suggestions are welcome	21:13	Copy link Message link Add to gist Remove
	so right now i have the code malloc two buffer, a char and an int. but that seems non-optimal. i then tried just casting the last quarter of the int buffer into a char buffer, decoding into that, and then normalizing from there into the start of the int buffer	21:16	Copy link Message link Add to gist Remove
	but that didn't work		Copy link Message link Add to gist Remove
japhb	Part of the annoying problem is that as of a couple years ago, there wasn't much software that could correctly normalize. As language support goes, only Raku and Swift had it right. And a lot of libraries could validate a UTF-8 stream, and that it was properly denormalized (which IIRC is the first step to renormalizing as NFG), but that's not the hardest part.		Copy link Message link Add to gist Remove
MasterDuke	you know, i've known that swift support was good, but i've never looked at how they implement it	21:19	Copy link Message link Add to gist Remove
japhb	Might be worth a look I suppose. Curious if they do everything bespoke or if they've found (or created) an independent library that accelerates all or part of it.		Copy link Message link Add to gist Remove
patrickb	Can I in new-disp prevent dispatch programs from being recorded? I.e. force to always cal	21:22	Copy link Message link Add to gist Remove
	do the record phase and never create or run a dispatch program?		Copy link Message link Add to gist Remove
	Or phrased differently, can I misuse the dispatch mechanism to hook into some function calls to run some custom code before reaching the caller?	21:24	Copy link Message link Add to gist Remove
	I suspect it's a probably a stupid idea and I shouldn't try...		Copy link Message link Add to gist Remove
japhb	I suspect that something that can absurdly fast go from (proposed UTF8 bytestream) --> valid? + (max codepoint width) + NFD would be quite welcome though		Copy link Message link Add to gist Remove
MasterDuke	hm. looks like github.com/swiftlang/swift/blob/ma...tion.swift is implemented in swift. that's not immediately useful for us	21:27	Copy link Message link Add to gist Remove
	patrickb: that sounds like a 'hold my beer' challenge	21:28	Copy link Message link Add to gist Remove
	zhaskell.github.io/utf8rewind/html...9837ca69db looks like it does normalization	21:41	Copy link Message link Add to gist Remove
	no idea if it's any faster than what moarvm has implemented		Copy link Message link Add to gist Remove
	and doesn't do NFG (which is just a rakudo/moarvm thing, right?)	21:42	Copy link Message link Add to gist Remove
	hm, i don't really know the difference between NFD and NFG. would it be fast to convert from NFD to NFG?	21:44	Copy link Message link Add to gist Remove
japhb	NFD means "Normalization Form: canonically Decomposed"	21:56	Copy link Message link Add to gist Remove
	Similarly, NFKD is "NF: C[K]ompatibility Decomposed" -- this is less common		Copy link Message link Add to gist Remove
	NFC is "NF: canonically Composed" (and there's NFKC for the compatibility form)	21:57	Copy link Message link Add to gist Remove
	These all work at the level of available codepoints; if there is a pre-composed codepoint already defined in Unicode for some sequence of base character and modifiers, great, you can replace the modified sequence with a single replacement codepoint. But if there isn't, it needs to remain decomposed, because there's no defined thing to compose it to.	21:58	Copy link Message link Add to gist Remove
	NFG is "NF: Grapheme composed" meaning that all possible sequences of base character and accents/modifiers (together creating a grapheme, which is what a user would tend to think of a "character" as) are composed into a single unit, even if that unit isn't already in the Unicode list. (But preferring to use the Unicode composed codepoints if they exist.)	22:00	Copy link Message link Add to gist Remove
	One of the key things about the decomposition phase is that it gives a strict ordering to all the accents/modifiers, so that two strings that have the same graphemes also have the same codepoint sequence in normalized decomposed form.	22:02	Copy link Message link Add to gist Remove
	That's used when composing, to make sure normalized composition does the same thing (produces the same sequence of codepoints or NFG units) no matter the exact mess that got fed in as input.	22:03	Copy link Message link Add to gist Remove
	So typically the sequence is (input --> UTF-8 decode --> NFD --> NFG --> str/Str), at least semantically (even if the real code smashes some of those conversions together)	22:05	Copy link Message link Add to gist Remove
22:06 lizmat left
japhb	One other key thing that some other languages/engines/VMs don't have to deal with is that MoarVM allows a single string to be made of lots of individual pieces that are strung together -- and normalization has to do the right thing across the boundaries between them.	22:07	Copy link Message link Add to gist Remove
	And you can't do it "in place" because strings are supposed to be immutable.	22:08	Copy link Message link Add to gist Remove
22:09 MasterDuke left
japhb	samcv, timo, and of course jnthn have all spent a fair amount of time on these sections of the code and can probably give you a more official rundown than I just did.	22:09	Copy link Message link Add to gist Remove
	MCP: EOL.	22:11	Copy link Message link Add to gist Remove

Please report any issues / comments / feature requests as an issue on App::Raku::Log.

Thank you!