11 May 2025 |
ab5tract |
MasterDuke: nope, that's the sacrifice we make for being able to match two semantically-equal-but-composed-differently characters with each other |
21:08 |
|
|
it's such a simple fix though, you can just write a class that stores the original content as a Buf, with the corresponding hooks to spit the Buf back out (Str/gist) |
21:09 |
|
|
(or even spurt) |
|
|
|
IIRC Swift does the above internally. |
21:10 |
|
|
All these memores are from before 6.c though, so throw some salt on those miles that may be varying up ahead |
|
|
MasterDuke |
mandrillone: thanks, that also looks interesting |
21:11 |
|
ab5tract |
Both languages have changed quite a bit |
|
|
MasterDuke |
afk, but thanks for the suggestions. will check logs for any more |
21:16 |
|
japhb |
We follow the Unicode standards for how to cluster codepoints into a grapheme and suchlike, but NFG in particular (where we assign new "codepoints" to grapheme clusters as needed) is our own thing because it can theoretically be overwhelmed. (Every possible legal composition of combining characters, accents, etc. with every possible base character is *a lot of characters*.) |
21:31 |
|
|
However I can imagine that having all through NFC be done at some absurd speed, and then layering NFG on top of that in a way that takes advantage of being able to expect NFC compliance, *might* make for a faster algorithm. Still have to deal with the multi-piece string problem though; I'm guessing most high-speed Unicode libraries aren't expecting a string to be chunked. |
21:34 |
|
ab5tract |
Just curious, but why are they able to get away with non-chunked strings while we are stuck supporting chunked strings? |
22:01 |
|
|
japhb: I hope the above question doesn't come off as too pointed. Thank you for the details! |
22:04 |
|
japhb |
Because C doesn't have an 'x' operator. ;-) |
23:05 |
|
|
But seriously, the idea is to be able to efficiently do joins, substrings, and repetitions. |
|
|
|
ab5tract: ^^ |
23:06 |
|
|
Imagine `substr($a, 3, 4) ~ substr($q, 10, 1) x 25` |
|
|
12 May 2025 |
lizmat |
and yet another Rakudo Weekly News hits the Net: rakudoweekly.blog/2025/05/12/2025-...c-in-time/ |
13:19 |
|
timo |
one case where our ropes implementation should really give good benefits is when getting the substrings out of a big heap of Match objects |
14:51 |
|
|
though we basically already store the orig/target string and offset and length in the Match object and only create the roped substring object when the Match has .Str called on it |
14:52 |
|
|
(because we also want the start and length separately, and we don't have an API to ask a rope for its parts, and it would also be kind of awkward to handle) |
|
|
|
but imagine feeding /usr/share/dict/words into a grammar that just matches one line into one match object, and imagine we were eagerly creating these substrings as full objects |
14:54 |
|
|
we should probably still look into immediately turning roped strings into buffer-backed strings if they are very small, after some measurements of how performance differs between the two |
|
|
|
strings that fit in-situ will probably always outperform rope-based substrings |
14:55 |
|
|
also consider the case of having a my Str $result = ""; and then a loop that does $result ~= $something repeatedly |
14:56 |
|
|
if we don't have ropes, we would be copying the buffer in the $result over and over again to make the next $result |
14:57 |
|
|
this is something common in python and CPython has an optimization where if a string's reference count is 1, then they will just mutate it in-place |
|
|
|
and early on, Pypy didn't have something specifically for this use case, and some existing python scripts would run in quadratic time in pypy but not in cpython |
14:58 |
|
|
we can potentially still do something smarter in our case. right now when appending a string to a roped string we create a new rope array by copying the previous one and adding the extra ropes. maybe at some point we want to collapse the strands if we're not already doing that? |
14:59 |
|
|
we already don't do nested stranded strings |
|
|
|
i kind of use ropes and strands interchangeably. i think a rope is supposed to refer to a string made of strands? |
15:00 |
|
|
something we don't have any logic or heuristics for is what happens when you have a huge string that you take a few substrings of, and never look at the original string ever again. we do keep the huge string around in memory because the substrings point at it, but we don't do any stats or anything to figure that out at run time |
15:03 |
|
|
similarly we don't measure what synthetic code points have been created, and whether any of them have become unused |
|
|
|
that's the exhaustion problem japhb alluded to earlier. a very long-running moar process can slowly accumulate synthetics over time. an "attack" on this particular feature may take gigabytes of text to be fed in? still have to make a proof-of-concept to figure out if it's possible to make more synthetics with less input |
15:04 |
|
|
oh wow. Konsole is *not* happy about my combining character printing test |
15:10 |
|
|
$4 = 0xe716a |
15:32 |
|
|
:utfsize(9712968) |
|
|
|
that's 0xe716a synthetics generated after 9_712_968 bytes of utf8, but i'm just using one A followed by 4 combining characters out of the range of 0x300..0x36f |
15:33 |
|
|
choosing the same initial character also results in a very wide NFG trie node I think |
15:35 |
|
|
whew, memory usage is also going up quite a bit |
15:40 |
|
|
29k entries in the "free at safepoint" linked list |
15:44 |
|
|
that's not just 16 bytes per entry for the linked list alone ... |
15:45 |
|
mandrillone |
timo: looks like adding complexity to complexity |
16:37 |
|
|
I’d get rid of sub strings altogether |
16:38 |
|
|
In any case, it’s an implementation detail |
16:39 |
|
15 May 2025 |
lizmat |
timo what is the cutoff point for in-situ strings again? |
09:54 |
|
|
feels to me those strings wouldn't need interning? |
|
|
nine |
They still would |
10:05 |
|
|
The interning is also to avoid duplication of those strings in the mbc files |
|
|
|
Of those HLL Str objects. The low level strings are deduplicated anyway |
10:06 |
|
lizmat |
ok |
10:10 |
|
|
the current interning logic in RakuAST limits the number of interned strings to 64K |
|
|
|
that feels... arbitrary ? |
10:11 |
|
nine |
I guess it's a reasonable compromise. It prevents excessive memory usage in extreme cases. 64k string constants is a lot |
10:17 |
|