Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes. Set by lizmat on 24 May 2021. |
|||
nine | patrickb: maybe we should lock ourselves in a room for a couple hours at the RCS and talk this through? | 08:21 | |
08:53
lizmat joined
09:10
lizmat left
|
|||
patrickb | nine: Gladly! | 14:16 | |
I do hope it doesn't take a couple of hours though. I still think this should be straight forward (once the approach is fully ironed out). | 14:21 | ||
16:22
mandrilllone joined
|
|||
mandrilllone | maybe an approach could be to make an Unicode disabled version of Moarvm, just by silently breaking most Unicode functions | 16:23 | |
nine | mandrilllone: that's still assuming that normalization is the actual problem | 16:24 | |
mandrilllone | My bet is on the 32bit character taking space | ||
16:26
mandrilllone left
19:48
MasterDuke joined
|
|||
MasterDuke | i just timed slurping 'rakudo/gen/moar/CORE.c.setting' 100 times and summing the chars. both normally and also with `:enc("latin1")` | 19:50 | |
normally was ~1.2s and latin1 was ~0.2s | 19:51 | ||
so utf8 decoding+normalization definitely takes time | 19:53 | ||
and currently, if i'm understanding MVM_string_utf8_decode correctly, both pretty much happen somewhat simultaneously. i.e., it walks through the incoming bytes, decoding each one to utf8 and normalizing it (which may delay writing several bytes to the output until a sequence of bytes is fully normalized) | 19:58 | ||
i think my experiment changed that so the utf8 decoding happens first, and then the output of that is walked to do normalization | 19:59 | ||
but it wasn't any faster. so assuming that simdutf is in fact faster at decoding utf8 than our code (and maybe it isn't, but i know lemire's stuff is usually pretty fast), that would mean that the normalizing is in fast dominating the time | 20:02 | ||
jnthn: istr you had a series of blog posts a couple years ago about how you made file IO faster, which i believe included speeding up utf8 decoding and normalization. do you have any suggestions for further improvements? anything concrete as todos? | 20:05 | ||
ab5tract | MasterDuke: simdutf sounds like a necessary experiment, if nothing else. Are you already planning to poke at it? | 20:48 | |
MasterDuke | ab5tract: not sure what you mean. something other than what i've already done? | 20:51 | |
ab5tract | sorry, it looks like I missed some context. I'm caught up on the scrollback now | 20:53 | |
MasterDuke | ah, no worries | ||
ab5tract | I'm happy you are taking stabs at this MasterDuke, it's not exactly low hanging fruit from an implementation standpoint, but in terms of speedup gains it looks like a bit of a gold mine | 20:54 | |
Did you already come across github.com/uni-algo/uni-algo? | 20:55 | ||
Ah, just saw that normalization does not include NFG :( | 20:57 | ||
MasterDuke | i hadn't seen that, thanks. maybe there will some inspiration there at least | 20:59 | |
ab5tract | their README really emphasizes adherence to the Unicode standard, but then it's missing NFG :( | 21:00 | |
MasterDuke | aiui NFG isn't a Unicode standard, it's our own thing | ||
ab5tract recalls some beer-hazy conversation from years ago where someone confidently expressed that Raku was far from unique in it's handling of graphemes / NFG is no big deal | 21:02 | ||
should we perhapsn submit our work to the standards body? | |||
MasterDuke | hm, i thought raku was pretty unique, with swift being the only other language doing graphemes | 21:04 | |
ab5tract | I've always thought we did it the best | 21:05 | |
Swift's approach means that the same input always produces the same output, byte for byte, though | 21:06 | ||
21:06
mandrillone joined
|
|||
MasterDuke | i thought our does too? | 21:06 | |
mandrillone | suckless provides some algos to manage graphemes, maybe interesting to give it a look libs.suckless.org/libgrapheme/ | 21:07 | |
ab5tract | MasterDuke: nope, that's the sacrifice we make for being able to match two semantically-equal-but-composed-differently characters with each other | 21:08 | |
it's such a simple fix though, you can just write a class that stores the original content as a Buf, with the corresponding hooks to spit the Buf back out (Str/gist) | 21:09 | ||
(or even spurt) | |||
IIRC Swift does the above internally. | 21:10 | ||
All these memores are from before 6.c though, so throw some salt on those miles that may be varying up ahead | |||
MasterDuke | mandrillone: thanks, that also looks interesting | 21:11 | |
ab5tract | Both languages have changed quite a bit | ||
21:16
mandrillone left
|
|||
MasterDuke | afk, but thanks for the suggestions. will check logs for any more | 21:16 | |
21:21
MasterDuke left
21:29
lizmat joined
|
|||
japhb | We follow the Unicode standards for how to cluster codepoints into a grapheme and suchlike, but NFG in particular (where we assign new "codepoints" to grapheme clusters as needed) is our own thing because it can theoretically be overwhelmed. (Every possible legal composition of combining characters, accents, etc. with every possible base character is *a lot of characters*.) | 21:31 | |
However I can imagine that having all through NFC be done at some absurd speed, and then layering NFG on top of that in a way that takes advantage of being able to expect NFC compliance, *might* make for a faster algorithm. Still have to deal with the multi-piece string problem though; I'm guessing most high-speed Unicode libraries aren't expecting a string to be chunked. | 21:34 | ||
ab5tract | Just curious, but why are they able to get away with non-chunked strings while we are stuck supporting chunked strings? | 22:01 | |
japhb: I hope the above question doesn't come off as too pointed. Thank you for the details! | 22:04 | ||
japhb | Because C doesn't have an 'x' operator. ;-) | 23:05 | |
But seriously, the idea is to be able to efficiently do joins, substrings, and repetitions. | |||
ab5tract: ^^ | 23:06 | ||
Imagine `substr($a, 3, 4) ~ substr($q, 10, 1) x 25` |