MasterDuke is there a bounds of the size of the resulting string github.com/MoarVM/MoarVM/blob/main...#L171-L173 based on the number of bytes? 13:39
lizmat I'd say the maximum size is 4 * bytes ? 13:45
ah, it already starts out that way I see 13:46
hmmm 13:47
nine I don't think there's anything else one can do
lizmat so I'd see the realloc would be rare? Or do you see it happen more often MasterDuke ? 13:48
MasterDuke well, that was really just  a tangent. i'm trying to get rakudo to build with the in-situ strings and happened to see that and wondered 13:49
Voldenet there is no upper bounds of codepoints in grapheme cluster
MasterDuke still trying to figure out how this "Requested synthetic 85 when only 1 have been created." happens
Voldenet: right, but the number decoded from a fixed number of bytes should be upper bounded i believe 13:50
Voldenet HM, the reverse is probably true 13:51
nine MasterDuke: time to pull out rr? 13:54
nine I don't see how a buffer with x utf-8 encoded bytes could lead to more than x graphemes 13:55
Isn't it a strict 1:n between graphemes and code points and between code points and bytes?
MasterDuke nine: yep, annoyingly had to rebuild rr first 13:59
Voldenet one codepoint can't represent multiple graphemes, `malloc(sizeof(MVMuint32) * bufsize)` was in MVM_string_utf8_decode from the start 14:06
and realloc (well, malloc + memcpy) was there from the start 14:08
I do think there's a subtle error in that code
if it started with `malloc(sizeof(MVMGrapheme32) * bufsize / 4)` it'd represent most cases well 14:09
eh wait, nevermind, grapheme is always 32 bits 14:10
nine sizeof(MVMGrapheme32) * bufsize will most of the time be the correct size 14:12
lizmat not true: I think there's also an 8-bit representation if only ASCII chars ?
of graphemes I mean ?
MVMGrapheme8 vs MVMGrapheme32 14:13
Voldenet > typedef MVMint32 MVMGrapheme32; 14:14
since this decoder allocates buffer as MVMGrapheme32[buflen], isn't the realloc pointless? 14:15
nine That's what I think 14:16
Voldenet it'd only make sense if one byte could somehow become multiple graphemes
MasterDuke which it looks like it can, if you follow the MVM_unicode_normalizer_process_codepoint_to_grapheme call 14:17
lizmat hmmm maybe if the decoding is utf8-c8 ? 14:18
nine Aaah NFD
lizmat of course, decomposing composed codepoints 14:19
is what you mean, nine?
MasterDuke hm. i've tracked back to where that -85 is assigned. unfortunately the previous value was -48, which is still not a value i want to see 14:21
[Coke] m: say (-85,-48...*)[2] 14:22
camelia -11
nine lizmat: yes 14:23
lizmat but composed codepoints would still be only one grapheme ? 14:24
perhaps NFD should generate just integers, instead of graphemes ? 14:25
Voldenet Wait, so you're saying there's a case where for 1 byte of an input becomes more than 4 bytes? 14:49
MasterDuke the string in question (re too many synthetics, not the bytes question) is ":«" 14:51
and i think the problem might be that it's represented as an in_situ_32 (in_situ_32 = {58, 171}), and then it tries to convert to an in_situ_8 14:52
ab5tract MasterDuke: that's interesting.. why would it want to downgrade its representation? it seems like things should go towards 32, but not the other way around 15:09
(keep in mind, I've still got a lot to learn about MVMGrapheme8 and MVMGrapheme32) 15:11
nine ab5tract: Grapheme8 is more memory and cache efficient. 15:21
ab5tract that much is clear, sure. but if something is store as Grapheme32, I would presume that it is taking advantage of the more expansive representation. 15:23
*stored 15:24
nine No, that's just the default. We get to Grapheme8 by scanning a Grapheme32 string and noticing that 8 is enough 15:29
ab5tract Ah got it, thanks 15:30
MasterDuke heh. thought i figured out what the problem was, now the nqp build fails... 18:38
got the nqp build working again... 19:19
well, starting compiling CORE.c, but `Directive f not applicable for value of type BOOTNum`. *no* idea how that happens 19:26
MasterDuke rakudo just built and installed with in-situ-strings 20:09
[Coke] nice 20:36
MasterDuke and `make m-test` had one fail in t/02-rakudo/reproducible-builds.t, but `make m-spectest` passed 21:33
lizmat fwiw, I also see t/02-rakudo/reproducible-builds.t occasionally (feels about 2% of the tome) 21:34
MasterDuke yeah, i've seen a rare fail before. but somewhat ironically, this fail appears to be 100% reproducible 21:35
lizmat ah, that'd be different :-) 21:36
MasterDuke wow, that test spews a ton of output to my console when run directly 21:37
lizmat yeah :-)
MasterDuke i assume it's related to the fact that this branch adds two new storage_type's to MVMString 21:39
if anyone is interested, github.com/MoarVM/MoarVM/compare/m...tu-strings is the current state 21:40
it's not really ready to be PRed yet though. there's a bunch of cleanup and optimization, but nine, timo1, et at., if you take a look and have any suggestions please let me know 21:48
[Coke] MasterDuke++ 22:38