00:00 vrurg joined 01:35 librasteve_ left
ShimmerFairy m: say "\x[1193F]A".chars # should be "1", since 2016 03:00
camelia 2
ShimmerFairy It's suddenly become apparent to me that, if I manage the Unicode upgrade well enough, all of Moar's unicode support ought to get a once-over. 03:01
It's kinda fun to work through the grapheme code and finding situations where it doesn't work right. For instance, "A\c[ZWJ]🧀" is misinterpreted as a single grapheme, because the implementation of rule GB11 is too broad. 06:58
lizmat ++ShimmerFairy yete again :-) 08:58
ShimmerFairy I've decided that unfortunately the grapheme breaker function needs to be completely rewritten. It was written for a world where you only needed one codepoint behind and ahead of the possible break point, but nowadays we have a number of rules that depend on more context. The current function only just manages RI grapheme state, but bolting additional stateful checks on would be awkward. 09:04
Perhaps the people who worked on the original function could work it in, but to me at least the design doesn't fit the current ruleset anymore. 09:05
lizmat hmmm... I hope that's not going to be too detrimental to decoding efficiency 09:08
ShimmerFairy I think it should be fine, since the state machine approach I'm trying to write up right now would let you skip rule checks that can't possibly be true. For a first pass the "one ahead/behind only" rules are mostly handled in a single state, but I think it could be broken down further to skip more checks on each run. 09:12
lizmat that sounds good: more power to ya! 09:24
ShimmerFairy Out of curiosity, is there an established way of profiling NFG string handling? I figured keeping track of how long 'make stresstest' takes would be informative, but if there's a better method I'll use it instead. 10:28
10:59 librasteve_ joined
lizmat timo might know 11:30
disbot6 <jubilatious1_98524> m: say "\x[1193F]".chars; 17:24
<Raku eval> 1
timo it seemed to me like we already had something that can do more than one ahead and behind with some state kept, especially for the regional indicators handling that wants multiple-of-two codes 17:30
disbot6 <jubilatious1_98524> m: say "\x[1193F]"; 17:33
<Raku eval> 𑤿
timo the "does a string need re-checking after concat" check may be more interesting? 17:34
disbot6 <jubilatious1_98524> I don't know if \x[1193F] is a free-standing character or not. 17:36
timo trying to get something from unicode.org and it's taking ... a minute? 17:43
looks like 1193F is InCB=None and Grapheme_Extend is No, but Grapheme_Cluster_Break is Prepend 17:47
so with it being a Prepend that means we should never break after it (except of course at end-of-text) 17:48
disbot6 <jubilatious1_98524> Amazing! 17:51
<jubilatious1_98524> m: say Unicode.version; 17:56
<Raku eval> v15.0
<jubilatious1_98524> m: say "A\c[ZWJ]🧀".chars 18:06
<Raku eval> 1
timo you think that's not right? 18:16
19:08 patrickb left, patrickb joined 19:25 vrurg_ joined, linkable6 left, notable6 left, linkable6 joined, sugarbeet left 19:26 sugarbeet joined, bloatable6 left, benchable6 left, tellable6 left 19:27 vrurg left 19:29 notable6 joined