Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes.
Set by lizmat on 24 May 2021.
nine patrickb: maybe we should lock ourselves in a room for a couple hours at the RCS and talk this through? 08:21
08:53 lizmat joined 09:10 lizmat left
patrickb nine: Gladly! 14:16
I do hope it doesn't take a couple of hours though. I still think this should be straight forward (once the approach is fully ironed out). 14:21
16:22 mandrilllone joined
mandrilllone maybe an approach could be to make an Unicode disabled version of Moarvm, just by silently breaking most Unicode functions 16:23
nine mandrilllone: that's still assuming that normalization is the actual problem 16:24
mandrilllone My bet is on the 32bit character taking space
16:26 mandrilllone left 19:48 MasterDuke joined
MasterDuke i just timed slurping 'rakudo/gen/moar/CORE.c.setting' 100 times and summing the chars. both normally and also with `:enc("latin1")` 19:50
normally was ~1.2s and latin1 was ~0.2s 19:51
so utf8 decoding+normalization definitely takes time 19:53
and currently, if i'm understanding MVM_string_utf8_decode correctly, both pretty much happen somewhat simultaneously. i.e., it walks through the incoming bytes, decoding each one to utf8 and normalizing it (which may delay writing several bytes to the output until a sequence of bytes is fully normalized) 19:58
i think my experiment changed that so the utf8 decoding happens first, and then the output of that is walked to do normalization 19:59
but it wasn't any faster. so assuming that simdutf is in fact faster at decoding utf8 than our code (and maybe it isn't, but i know lemire's stuff is usually pretty fast), that would mean that the normalizing is in fast dominating the time 20:02
jnthn: istr you had a series of blog posts a couple years ago about how you made file IO faster, which i believe included speeding up utf8 decoding and normalization. do you have any suggestions for further improvements? anything concrete as todos? 20:05
ab5tract MasterDuke: simdutf sounds like a necessary experiment, if nothing else. Are you already planning to poke at it? 20:48
MasterDuke ab5tract: not sure what you mean. something other than what i've already done? 20:51
ab5tract sorry, it looks like I missed some context. I'm caught up on the scrollback now 20:53
MasterDuke ah, no worries
ab5tract I'm happy you are taking stabs at this MasterDuke, it's not exactly low hanging fruit from an implementation standpoint, but in terms of speedup gains it looks like a bit of a gold mine 20:54
Did you already come across github.com/uni-algo/uni-algo? 20:55
Ah, just saw that normalization does not include NFG :( 20:57
MasterDuke i hadn't seen that, thanks. maybe there will some inspiration there at least 20:59
ab5tract their README really emphasizes adherence to the Unicode standard, but then it's missing NFG :( 21:00
MasterDuke aiui NFG isn't a Unicode standard, it's our own thing
ab5tract recalls some beer-hazy conversation from years ago where someone confidently expressed that Raku was far from unique in it's handling of graphemes / NFG is no big deal 21:02
should we perhapsn submit our work to the standards body?
MasterDuke hm, i thought raku was pretty unique, with swift being the only other language doing graphemes 21:04
ab5tract I've always thought we did it the best 21:05
Swift's approach means that the same input always produces the same output, byte for byte, though 21:06
21:06 mandrillone joined
MasterDuke i thought our does too? 21:06
mandrillone suckless provides some algos to manage graphemes, maybe interesting to give it a look libs.suckless.org/libgrapheme/ 21:07
ab5tract MasterDuke: nope, that's the sacrifice we make for being able to match two semantically-equal-but-composed-differently characters with each other 21:08
it's such a simple fix though, you can just write a class that stores the original content as a Buf, with the corresponding hooks to spit the Buf back out (Str/gist) 21:09
(or even spurt)
IIRC Swift does the above internally. 21:10
All these memores are from before 6.c though, so throw some salt on those miles that may be varying up ahead
MasterDuke mandrillone: thanks, that also looks interesting 21:11
ab5tract Both languages have changed quite a bit
21:16 mandrillone left
MasterDuke afk, but thanks for the suggestions. will check logs for any more 21:16
21:21 MasterDuke left 21:29 lizmat joined
japhb We follow the Unicode standards for how to cluster codepoints into a grapheme and suchlike, but NFG in particular (where we assign new "codepoints" to grapheme clusters as needed) is our own thing because it can theoretically be overwhelmed. (Every possible legal composition of combining characters, accents, etc. with every possible base character is *a lot of characters*.) 21:31
However I can imagine that having all through NFC be done at some absurd speed, and then layering NFG on top of that in a way that takes advantage of being able to expect NFC compliance, *might* make for a faster algorithm. Still have to deal with the multi-piece string problem though; I'm guessing most high-speed Unicode libraries aren't expecting a string to be chunked. 21:34
ab5tract Just curious, but why are they able to get away with non-chunked strings while we are stuck supporting chunked strings? 22:01
japhb: I hope the above question doesn't come off as too pointed. Thank you for the details! 22:04
japhb Because C doesn't have an 'x' operator. ;-) 23:05
But seriously, the idea is to be able to efficiently do joins, substrings, and repetitions.
ab5tract: ^^ 23:06
Imagine `substr($a, 3, 4) ~ substr($q, 10, 1) x 25`