|
00:00
vrurg joined
01:35
librasteve_ left
|
|||
| ShimmerFairy | m: say "\x[1193F]A".chars # should be "1", since 2016 | 03:00 | |
| camelia | 2 | ||
| ShimmerFairy | It's suddenly become apparent to me that, if I manage the Unicode upgrade well enough, all of Moar's unicode support ought to get a once-over. | 03:01 | |
| It's kinda fun to work through the grapheme code and finding situations where it doesn't work right. For instance, "A\c[ZWJ]🧀" is misinterpreted as a single grapheme, because the implementation of rule GB11 is too broad. | 06:58 | ||
| lizmat | ++ShimmerFairy yete again :-) | 08:58 | |
| ShimmerFairy | I've decided that unfortunately the grapheme breaker function needs to be completely rewritten. It was written for a world where you only needed one codepoint behind and ahead of the possible break point, but nowadays we have a number of rules that depend on more context. The current function only just manages RI grapheme state, but bolting additional stateful checks on would be awkward. | 09:04 | |
| Perhaps the people who worked on the original function could work it in, but to me at least the design doesn't fit the current ruleset anymore. | 09:05 | ||
| lizmat | hmmm... I hope that's not going to be too detrimental to decoding efficiency | 09:08 | |
| ShimmerFairy | I think it should be fine, since the state machine approach I'm trying to write up right now would let you skip rule checks that can't possibly be true. For a first pass the "one ahead/behind only" rules are mostly handled in a single state, but I think it could be broken down further to skip more checks on each run. | 09:12 | |
| lizmat | that sounds good: more power to ya! | 09:24 | |
| ShimmerFairy | Out of curiosity, is there an established way of profiling NFG string handling? I figured keeping track of how long 'make stresstest' takes would be informative, but if there's a better method I'll use it instead. | 10:28 | |
|
10:59
librasteve_ joined
|
|||
| lizmat | timo might know | 11:30 | |
| disbot6 | <jubilatious1_98524> m: say "\x[1193F]".chars; | 17:24 | |
| <Raku eval> 1 | |||
| timo | it seemed to me like we already had something that can do more than one ahead and behind with some state kept, especially for the regional indicators handling that wants multiple-of-two codes | 17:30 | |
| disbot6 | <jubilatious1_98524> m: say "\x[1193F]"; | 17:33 | |
| <Raku eval> 𑤿 | |||
| timo | the "does a string need re-checking after concat" check may be more interesting? | 17:34 | |
| disbot6 | <jubilatious1_98524> I don't know if \x[1193F] is a free-standing character or not. | 17:36 | |
| timo | trying to get something from unicode.org and it's taking ... a minute? | 17:43 | |
| looks like 1193F is InCB=None and Grapheme_Extend is No, but Grapheme_Cluster_Break is Prepend | 17:47 | ||
| so with it being a Prepend that means we should never break after it (except of course at end-of-text) | 17:48 | ||
| disbot6 | <jubilatious1_98524> Amazing! | 17:51 | |
| <jubilatious1_98524> m: say Unicode.version; | 17:56 | ||
| <Raku eval> v15.0 | |||
| <jubilatious1_98524> m: say "A\c[ZWJ]🧀".chars | 18:06 | ||
| <Raku eval> 1 | |||
| timo | you think that's not right? | 18:16 | |
|
19:08
patrickb left,
patrickb joined
19:25
vrurg_ joined,
linkable6 left,
notable6 left,
linkable6 joined,
sugarbeet left
19:26
sugarbeet joined,
bloatable6 left,
benchable6 left,
tellable6 left
19:27
vrurg left
19:29
notable6 joined
|
|||