|
00:51
sourceable6__ left,
sourceable6 joined
03:13
patrickb left,
patrickb joined
05:10
ugexe left,
ugexe joined
10:14
linkable6 left
10:15
linkable6 joined
10:42
unicodable6 left,
unicodable6 joined
11:04
[Coke] left,
[Coke] joined
|
|||
| ShimmerFairy | I've decided to do some historical digging to see why MoarVM decided against ICU back in the day, so far it seems people were very unhappy with ICU back in 2005/06, though possibly a lot of that had to do with the fact that it used to be bundled in with Parrot and needed to be built with it (and thus as an external lib would've been less annoying?). | 12:04 | |
| I'm doing this because figuring out what to do with MoarVM's unicode support will depend on how relevant the old objections to ICU are nowadays. | 12:05 | ||
| (btw, I've noticed that this nifty irclogs website Raku has doesn't include #parrot (and any related) channels, which it arguably should) | 12:09 | ||
| Alright, having combed through all of #perl6, it seems that the historical objection to ICU amounts to "couldn't get it to build on Windows", which is probably a lot easier to do nowadays (not to mention a fork of ICU has been integrated into Windows since 2017). | 13:00 | ||
| The only explicit mention of something wrong with ICU is the lack of "NFG support" (i.e. a struct/class that models an NFG string), but you don't *need* that functionality to be tightly integrated into a Unicode library, so that complaint baffles me. | 13:02 | ||
| nemokosch | I thought NFG was specifically made up by Perl 6 rather than Unicode itself | 13:04 | |
| timo | maybe it means that when you use ICU, any string that wants to be in NFG can't use any of the ICU string functions because ICU has no concept of NFG? | 13:07 | |
| ShimmerFairy | It is, but it's also nothing more than using 2 (two) standard Unicode algorithms to accomplish a particular task. It's the sort of thing you could easily build on top of a Unicode library; there's no need to build it *into* the library. | ||
| But yeah, in terms of actual functionality, I can't find any old mention of something that ICU fails to do that the compiler devs needed to compensate for. Sure it would be nice for your backing Unicode library to have a grapheme string built in, but it's ultimately just an application of Unicode to a task, not a modification of Unicode itself. | 13:14 | ||
| The true objection seems to be purely about having to build it on Windows machines (and in the mid-2000s, some bristling that ICU requires you to have a C++ compiler available, horror of horrors, back when everyone still thought supporting pre-standard C systems was good and useful.) | 13:16 | ||
| Also, Larry was consistently of the opinion that Perl 5's integrated Unicode support was a perfectly fine approach, no need to do things differently. | 13:17 | ||
| Nicholas | (I'm not here) - implementation of NFG ought to be O(1) on various tasks, whereas handling graphmes as NFC or NFD is O(n) | 13:27 | |
| also is ICU still UTF-16, or does it offer proper UTC-32 APIs? | |||
| as in, implementation of NFG *is* O(1) because the synthetic code points created for graphemes are single integers | 13:28 | ||
| nemokosch | it is O(1), however, the creation of these strings is O(N) | ||
| so this is a red herring | |||
| Nicholas | it may or may not be a red herring. Once you have the strings, various regex, er rule, operations can be O(1) whereas with NFC they are O(n). This is a question of trade offs | 13:30 | |
| nemokosch | one thing is sure: once you decide to create to everything NFG-processed, all your strings will be slow | 13:31 | |
| the mere creation of them | 13:32 | ||
| it would still be a valid and rather simple tradeoff to only build the NFG version at the first certain kind of operation | |||
| Nicholas | yes, that follows. But unless you also want to normalise strings, then U+00E9 is not equal to U+0065 U+0301, even if the human assumed this | 13:34 | |
| timo | we have 8bit string storage as well in moarvm though | ||
| Nicholas | (I hope I have my example correct. it's trying to be Ć© expressed in NFC vs decomposed | ||
| nemokosch | however, ICU would at least do NFC | ||
| Nicholas | and humans don't care how the computer represents what they typed) | 13:35 | |
| nemokosch | anyway, what ShimmerFairy did is a very refreshing precedent | 13:36 | |
| Nicholas | From limited reading of the scollback, my memory was also that Larry was of the opinion that Unicode support was supposed to be a key feature of "the artist then known as Perl 6", so the intent was to be ahead of the curve, not relying on a third party library. | 13:37 | |
| nemokosch | figuring out the original design considerations and see how they apply | ||
| Nicholas | that design consideration at the time would mean that ICU was "lagging". These days, I think that "rate of change of Unicode" has slowed down, and ICU is mostly there | 13:38 | |
| nemokosch | for example, being ahead of the curve, I think it's fair to say now that this simply didn't happen | ||
| Nicholas | well, *is* there. Unicode is mostly adding more damn emojis | ||
| nemokosch | anyway, it's great to see that some people are both enthusiastic and able to work with Unicode | 13:48 | |
| ShimmerFairy | I can't speak to how good or bad ICU was as an implementation of Unicode back in the day, all I can say is that nobody in #perl6 ever complained that it wasn't letting them implement a feature. | ||
| nemokosch | for myself, I'm rather curious about why the Zig compiler toolchain failed, but even that is well outside my comfort zone for sure | ||
| ShimmerFairy | Ultimately, I get the vague feeling that back then, people thought other languages had poor Unicode support because good tools didn't exist (thus we have to make our own), when I think the truth is that adding Unicode support to an existing language is not trivial. | 13:51 | |
| Maybe back in the early early days of (non-6) Perl, it truly was the case that nobody had a good Unicode library on hand to hook into, but perhaps even in the mid-2000s that was already no longer true. | 13:52 | ||
| nemokosch | the Python 3 "fiasco" would certainly make that impression to us mere mortals | ||
| Nicholas | unicode-org.github.io/icu-docs/api...ng_8h.html -- ICU uses 16-bit Unicode (UTF-16) in the form of arrays of UChar code units. UTF-16 encodes each Unicode code point with either one or two UChar code units. (This is the default form of Unicode, and a forward-compatible extension of the original, fixed-width form that was known as UCS-2. UTF-16 superseded UCS-2 with Unicode | 13:54 | |
| 2.0 in 1996.) | |||
| This is still UTF-16. This smells. | |||
| (it might smell less bad that the alternatives, but UTF-16 has downsides of UTF-8, downsides of UTC-32, and some just to itself) | 13:55 | ||
| timo | love me a good surrogate pair | ||
| Nicholas | I'm not saying "don't do it" but UTF-16 has its own pain when you'd rather be storing your NFC or NFD code points in something fixed width | 13:56 | |
| ShimmerFairy | Yeah, it's not great, and I wish that ICU had more complete support for UChar32 * strings, but at the end of the day the UTF that underlies a higher-level string type shouldn't matter to the public interface. | ||
| Nicholas | I believe for most users, index operations on strings aren't a perforance issue | ||
| but if you're thinking about regular expressions (at least, historically for how the past 20 to 30 years went) you think of implementation details in terms of index location | 13:57 | ||
| and for UTF-16 storage that's O(N) for NFC text | |||
| and if you're thinking graphems you have O(N) conversion from Graphemes to NFC or NFD | |||
| and another O(n) from NFC code points to UTF-16 representation | 13:58 | ||
| (in the general case)(does the general case matter?) | |||
| *that* would be the two reasons for not wanting the ICU *code*. The data structures - totally | |||
| the trade off here clearly is that unmaintained or stale/laggy custome Uncicode support | 13:59 | ||
| sucks more than not-ultimately performant current Unicode support from a third party library | |||
| but O(n) atop O(n) is quadratic. So there's a quadratic trap here. Whether it ever springs is really the question | 14:00 | ||
| ShimmerFairy | While indexing an arbitrary codepoint in a non-UTF-32 string would be more expensive than a UTF-32 string, for things like regexes what you'd actually be doing is keep track of where you are in the string, and then ask to move forward/back by some number of codepoints. And both UTF-8 and UTF-16 are designed to be quick and easy to do that in, if not as much so as UTF-32. | 14:01 | |
| timo | i don't think we ever got a good iterator caching implementation for our regex engine on parrot, so if there was ever "a unicode character" in the core setting, compile times suddenly became hours to days :) | ||
| Nicholas | I once screwed up the Perl 5 "caching" for UTF-8 to code point offsets | ||
| lizmat | ShimmerFairy: if you have some logs of #parrot somewhere, I could probably integrate them into the websote | ||
| *site | 14:02 | ||
| Nicholas | it matters a lot, it turns out, how that caching worked | ||
| lizmat | colabti.org/ircloggy/ doesn't provide them | ||
| Nicholas | "recent place" and "recent offset from that place" were far more performant than 2 "most recent places" | ||
| this is a decade ago. I forget the details. But the exact form of the caching was an important crutch. And things were only performant because a cache hid O(n) behaviour | 14:03 | ||
| there are traps here. They might be avoidable. But they exist | |||
| nemokosch | another thing to keep in mind is that Raku's regexes are not just different syntax-wise. Not sure if that creates new obstacles but it's worth noting | 14:05 | |
| ShimmerFairy | Hah, apparently the only #parrot log I have onhand is one time where I joined, asked a question, got no response, and left 20 minutes later. | ||
| timo | to be fair, only waiting 20 minutes is quite short | 14:06 | |
| Nicholas | if raku's default "boundary" character maps to a grapheme boundary, rather than a code point boundary, then performance of graphemes is going to matter. If that can be faked/maintained with some amount of caching that's cool. But it ought to be tested, and at some sort of scale | ||
| and I need to be somewhere else, so I need to go AFK and really be "not here" | |||
| timo | see you later Nicholas :) | 14:07 | |
| Nicholas | I hope what I brain dumped was useful. I'm not sure what the right trade offs are, but I hope I added some useful data to help make them better | ||
| lizmat | Nicholas++ | ||
| ShimmerFairy | It's good to get the input. I'm still not the biggest fan of ICU myself, but I find it fascinating that the historical objections to it were really entirely about building it on Windows (and being a wrinkle in the old desire for a pure ANSI C project). Makes it harder to decide if we should keep not using it. | 14:08 | |
| Nicholas | I belive it was also "UTF-16" as an conversion mismatch | 14:09 | |
| ShimmerFairy | A point in favor of moving away from internal 32-bit codepoints, by the way, is that across the programming world people usually really don't like wasting four bytes per character, especially when it's mostly ASCII text. In that sense UTF-16 is a decent compromise, being somewhat space-efficient and somewhat fixed-size. | 14:12 | |
| lizmat | note that MoarVM nowadays uses 8-bit representation if it can | 14:13 | |
| timo right ? | |||
| timo | another note that the crlf grapheme, being a synthetic, can give you 32bit storage when you don't expect it | ||
| lizmat | so "foo" would be 8-bit, and "foo\n" would be 32bit | 14:14 | |
| timo | i'd have to double-check, I think only \r\n gets a synthetic, not \n or \r on their own | 14:15 | |
| in theory, we have a strands data structure that would allow us to store only small pieces of a string at the higher byte size, for strings where the ratio is favourable | 14:16 | ||
| it has a performance impact to have to go the extra indirection step, but it can be worth it if you save a lot of memory that no longer needs to go into the cpu cache | |||
| Nicholas | IIRC Swift uses UTF-8 for storage, as the trade offs for that were better. (storage size/cache hits vs tight CPU bound code to do O(n) calculations that had few to no cache misses) | ||
| and at one level, the logical trade was fixed width storage in whichever worked best of 7 bit, 8 bit, 16 bit and 32 bit | 14:17 | ||
| ShimmerFairy | (Another related note is that every UTF has a max of 4 bytes to the codepoint, so picking something other than UTF-16 and UTF-8 will never give you larger byte sequences than the same text in UTF-32.) | ||
| timo | i thought you need to go AFK :) | ||
| Nicholas | (7 bit had the advantage that we we can be sure that it's all ASCII, we don't need to care whether somethign later wanted UTF-8 or some legacy 8 bit encoding) | 14:18 | |
| I fail. | |||
| timo | i'm also AFK for a bit now | ||
| Nicholas | The coffee is still AFK. I will crack soon. | ||
| and that "7 bit" thing is something I remember Dan saying | |||
| about the *only* thing I remember from back then. | 14:19 | ||
| right, I've cracked. ENOCOFFEE | |||
| ShimmerFairy | While ICU runs on UTF-16, there are some attempts to make it easier to use other forms, like UText. Problem is, from a quick glance none of the things they offer are compatible with much of ICU, and I think they'd all require some amount of C++ glue to make functional. (e.g. despite what UText is meant to do, they have yet to bother providing a version of it for UTF-32 text) | 14:20 | |
| Overall, I think what I perhaps want to do on this topic is go ahead with that personal experiment to put ICU into MoarVM, because there's no known reason it can't work. Either I'll end up with a functional copy of MoarVM, or I'll discover firsthand how bad it is for this virtual machine. | 14:23 | ||
| lizmat | ++ShimmerFairy | 14:25 | |
| ShimmerFairy | (I *do* still like the idea of factoring things out into a separate library, but I've learned over the years that I have a really bad habit of writing my code when I don't need to, and thus to distrust those sorts of impulses) | ||
| timo | i'd be interested to know if ICU has something to reduce the size of the parts of the library that contain things like the names of unicode characters | ||
| or any other kind of compression | 14:26 | ||
| ShimmerFairy | A quick look tells me that libicudata.so.78.2, the largest of its libraries, is approx. 32MiB | 14:27 | |
| timo | uhhhh, libmoar.so is "just" 27 megs big | 14:28 | |
| ShimmerFairy | I should note that ICU is definitely a library I would not want to bundle with MoarVM under 3rdparty/, and I also would expect it to be dynamically linked to the one your system already has installed, at least in POSIX land. If we have to consider static linking scenarios, then that changes things considerably. | 14:33 | |
| timo | right, i wouldn suggest statically linking it | ||
| lizmat | wouldn't dynlinking and versioning issues bring a lot of potential turmoil ? | 14:34 | |
| ShimmerFairy | in the case of ICU, yes unfortunately. The library's ABI version is just the version of the project, so every major ICU update means programs have to be recompiled to link to the new library. However, at least on systems like Linux, the 'moarvm' package would only be as difficult to install as any other ICU-using package, like Qt, or Firefox without its bundled version of ICU. | 14:40 | |
| timo | so it's expected and fine to have multiple libicu versions on a system at the same time? | 14:44 | |
| based on which packages already recompiled against the new version? | |||
|
14:50
ShimmerFairy left,
ShimmerFairy joined
|
|||
| timo | we will have to start bundling libicu with the binary releases we have on rakudo.org | 14:51 | |
| ShimmerFairy | Message got eaten: I can't speak to that, because I use Gentoo, so whenever ICU is updated the package manager just rebuilds everything that needs to be, and thus I only have one set of ICU libraries in /usr/lib64 | ||
| Just to be clear, I don't object to the idea of pursuing factoring out our own library first, and saving ICU as plan b, I just know that I can be too eager to take the "do it myself" approach, so I'm hesitant to push that idea in a group project. | 14:53 | ||
| timo | fair | 14:57 | |
| librasteve | ICU im MoarVM seems like a very good idea to me - even though there is likely a lot of 6.d code out there that depends on 6.d Unicode, I would be very happy to make Raku "6.f" fully ICU centric even at the cost of tweaking the regex design | 15:59 | |
| lizmat | brrr... librasteve well volunteered :-) | ||
| timo | by "the regex design" you mean the implementation? | 16:10 | |
|
16:15
[Coke] left
|
|||
| ShimmerFairy | Switching MoarVM over to ICU shouldn't cause any breaking changes in Raku, nor NQP I would think. The most you should see is that referencing properties by name would be less precise, and more properties would be supported (since I think our UCD script still doesn't actually incorporate all the properties there are; I haphazardly added a couple that I though I'd need when it came to updating the grapheme rules). | 17:59 | |
|
18:09
[Coke] joined
|
|||
| librasteve | timo: ShimmerFairy mentions that "referencing properties by name" would be less precise (and I assume some other data-dependent results) so I guess this is (strictly speaking) a breaking change | 19:23 | |
| personally I think that that would be a price worth paying for being able to state that Raku is 100% UTC#18 compliant | |||
| also they mention that UTC#18 no longer requires regex to be able to support :i fully - so I guess this is a change to the intent of the Raku regex design (even if in practice it was never possible to achieve that) | 19:25 | ||
| anyway I am far from an expert in this - so my opinion is mostly around my perception of the trade offs between the costs of moving to UTC vs the benefits of having Raku on the official UTC toolchain | 19:27 | ||
| ShimmerFairy | That wouldn't be a breaking change, because no existing code would break. Current Raku is, at least in some cases, kinda picky about exact string matches for property names, when Unicode in practice recommends a loose matching procedure. | ||
| m: say " ".uniprop($_) for ("White_Space", "Whitespace", "whitespace", "WhiteSpace", "White Space") # these should all return the same true value, under loose matching | 19:28 | ||
| camelia | True 0 True True 0 |
||
| ShimmerFairy | (to be clear you obviously don't *need* ICU to fix this issue, but iiuc it would come for "free" with using ICU to query properties) | 19:40 | |
| I have to go for the day now, but of course I'll still be thinking about all this. | 19:43 | ||
| lizmat | fwiw, I've been able to obtain a copy of the #parrot #parrotsketch logs in a MariaDB database... so now I only need the tuits for conversion | 20:02 | |
| japhb | ++lizmat | 20:03 | |