#moarvm on 11 May 2025 - Raku Programming Language Log

Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes. Set by lizmat on 24 May 2021.
nine	patrickb: maybe we should lock ourselves in a room for a couple hours at the RCS and talk this through?	08:21	Copy link Message link Add to gist Remove
08:53 lizmat joined 09:10 lizmat left
patrickb	nine: Gladly!	14:16	Copy link Message link Add to gist Remove
	I do hope it doesn't take a couple of hours though. I still think this should be straight forward (once the approach is fully ironed out).	14:21	Copy link Message link Add to gist Remove
16:22 mandrilllone joined
mandrilllone	maybe an approach could be to make an Unicode disabled version of Moarvm, just by silently breaking most Unicode functions	16:23	Copy link Message link Add to gist Remove
nine	mandrilllone: that's still assuming that normalization is the actual problem	16:24	Copy link Message link Add to gist Remove
mandrilllone	My bet is on the 32bit character taking space		Copy link Message link Add to gist Remove
16:26 mandrilllone left 19:48 MasterDuke joined
MasterDuke	i just timed slurping 'rakudo/gen/moar/CORE.c.setting' 100 times and summing the chars. both normally and also with `:enc("latin1")`	19:50	Copy link Message link Add to gist Remove
	normally was ~1.2s and latin1 was ~0.2s	19:51	Copy link Message link Add to gist Remove
	so utf8 decoding+normalization definitely takes time	19:53	Copy link Message link Add to gist Remove
	and currently, if i'm understanding MVM_string_utf8_decode correctly, both pretty much happen somewhat simultaneously. i.e., it walks through the incoming bytes, decoding each one to utf8 and normalizing it (which may delay writing several bytes to the output until a sequence of bytes is fully normalized)	19:58	Copy link Message link Add to gist Remove
	i think my experiment changed that so the utf8 decoding happens first, and then the output of that is walked to do normalization	19:59	Copy link Message link Add to gist Remove
	but it wasn't any faster. so assuming that simdutf is in fact faster at decoding utf8 than our code (and maybe it isn't, but i know lemire's stuff is usually pretty fast), that would mean that the normalizing is in fast dominating the time	20:02	Copy link Message link Add to gist Remove
	jnthn: istr you had a series of blog posts a couple years ago about how you made file IO faster, which i believe included speeding up utf8 decoding and normalization. do you have any suggestions for further improvements? anything concrete as todos?	20:05	Copy link Message link Add to gist Remove
ab5tract	MasterDuke: simdutf sounds like a necessary experiment, if nothing else. Are you already planning to poke at it?	20:48	Copy link Message link Add to gist Remove
MasterDuke	ab5tract: not sure what you mean. something other than what i've already done?	20:51	Copy link Message link Add to gist Remove
ab5tract	sorry, it looks like I missed some context. I'm caught up on the scrollback now	20:53	Copy link Message link Add to gist Remove
MasterDuke	ah, no worries		Copy link Message link Add to gist Remove
ab5tract	I'm happy you are taking stabs at this MasterDuke, it's not exactly low hanging fruit from an implementation standpoint, but in terms of speedup gains it looks like a bit of a gold mine	20:54	Copy link Message link Add to gist Remove
	Did you already come across github.com/uni-algo/uni-algo?	20:55	Copy link Message link Add to gist Remove
	Ah, just saw that normalization does not include NFG :(	20:57	Copy link Message link Add to gist Remove
MasterDuke	i hadn't seen that, thanks. maybe there will some inspiration there at least	20:59	Copy link Message link Add to gist Remove
ab5tract	their README really emphasizes adherence to the Unicode standard, but then it's missing NFG :(	21:00	Copy link Message link Add to gist Remove
MasterDuke	aiui NFG isn't a Unicode standard, it's our own thing		Copy link Message link Add to gist Remove
	ab5tract recalls some beer-hazy conversation from years ago where someone confidently expressed that Raku was far from unique in it's handling of graphemes / NFG is no big deal	21:02	Copy link Message link Add to gist Remove
	should we perhapsn submit our work to the standards body?		Copy link Message link Add to gist Remove
MasterDuke	hm, i thought raku was pretty unique, with swift being the only other language doing graphemes	21:04	Copy link Message link Add to gist Remove
ab5tract	I've always thought we did it the best	21:05	Copy link Message link Add to gist Remove
	Swift's approach means that the same input always produces the same output, byte for byte, though	21:06	Copy link Message link Add to gist Remove
21:06 mandrillone joined
MasterDuke	i thought our does too?	21:06	Copy link Message link Add to gist Remove
mandrillone	suckless provides some algos to manage graphemes, maybe interesting to give it a look libs.suckless.org/libgrapheme/	21:07	Copy link Message link Add to gist Remove
ab5tract	MasterDuke: nope, that's the sacrifice we make for being able to match two semantically-equal-but-composed-differently characters with each other	21:08	Copy link Message link Add to gist Remove
	it's such a simple fix though, you can just write a class that stores the original content as a Buf, with the corresponding hooks to spit the Buf back out (Str/gist)	21:09	Copy link Message link Add to gist Remove
	(or even spurt)		Copy link Message link Add to gist Remove
	IIRC Swift does the above internally.	21:10	Copy link Message link Add to gist Remove
	All these memores are from before 6.c though, so throw some salt on those miles that may be varying up ahead		Copy link Message link Add to gist Remove
MasterDuke	mandrillone: thanks, that also looks interesting	21:11	Copy link Message link Add to gist Remove
ab5tract	Both languages have changed quite a bit		Copy link Message link Add to gist Remove
21:16 mandrillone left
MasterDuke	afk, but thanks for the suggestions. will check logs for any more	21:16	Copy link Message link Add to gist Remove
21:21 MasterDuke left 21:29 lizmat joined
japhb	We follow the Unicode standards for how to cluster codepoints into a grapheme and suchlike, but NFG in particular (where we assign new "codepoints" to grapheme clusters as needed) is our own thing because it can theoretically be overwhelmed. (Every possible legal composition of combining characters, accents, etc. with every possible base character is a lot of characters.)	21:31	Copy link Message link Add to gist Remove
	However I can imagine that having all through NFC be done at some absurd speed, and then layering NFG on top of that in a way that takes advantage of being able to expect NFC compliance, might make for a faster algorithm. Still have to deal with the multi-piece string problem though; I'm guessing most high-speed Unicode libraries aren't expecting a string to be chunked.	21:34	Copy link Message link Add to gist Remove
ab5tract	Just curious, but why are they able to get away with non-chunked strings while we are stuck supporting chunked strings?	22:01	Copy link Message link Add to gist Remove
	japhb: I hope the above question doesn't come off as too pointed. Thank you for the details!	22:04	Copy link Message link Add to gist Remove
japhb	Because C doesn't have an 'x' operator. ;-)	23:05	Copy link Message link Add to gist Remove
	But seriously, the idea is to be able to efficiently do joins, substrings, and repetitions.		Copy link Message link Add to gist Remove
	ab5tract: ^^	23:06	Copy link Message link Add to gist Remove
	Imagine `substr($a, 3, 4) ~ substr($q, 10, 1) x 25`		Copy link Message link Add to gist Remove

Please report any issues / comments / feature requests as an issue on App::Raku::Log.

Thank you!