00:51 sourceable6__ left, sourceable6 joined 03:13 patrickb left, patrickb joined 05:10 ugexe left, ugexe joined 10:14 linkable6 left 10:15 linkable6 joined 10:42 unicodable6 left, unicodable6 joined 11:04 [Coke] left, [Coke] joined
ShimmerFairy I've decided to do some historical digging to see why MoarVM decided against ICU back in the day, so far it seems people were very unhappy with ICU back in 2005/06, though possibly a lot of that had to do with the fact that it used to be bundled in with Parrot and needed to be built with it (and thus as an external lib would've been less annoying?). 12:04
I'm doing this because figuring out what to do with MoarVM's unicode support will depend on how relevant the old objections to ICU are nowadays. 12:05
(btw, I've noticed that this nifty irclogs website Raku has doesn't include #parrot (and any related) channels, which it arguably should) 12:09
Alright, having combed through all of #perl6, it seems that the historical objection to ICU amounts to "couldn't get it to build on Windows", which is probably a lot easier to do nowadays (not to mention a fork of ICU has been integrated into Windows since 2017). 13:00
The only explicit mention of something wrong with ICU is the lack of "NFG support" (i.e. a struct/class that models an NFG string), but you don't *need* that functionality to be tightly integrated into a Unicode library, so that complaint baffles me. 13:02
nemokosch I thought NFG was specifically made up by Perl 6 rather than Unicode itself 13:04
timo maybe it means that when you use ICU, any string that wants to be in NFG can't use any of the ICU string functions because ICU has no concept of NFG? 13:07
ShimmerFairy It is, but it's also nothing more than using 2 (two) standard Unicode algorithms to accomplish a particular task. It's the sort of thing you could easily build on top of a Unicode library; there's no need to build it *into* the library.
But yeah, in terms of actual functionality, I can't find any old mention of something that ICU fails to do that the compiler devs needed to compensate for. Sure it would be nice for your backing Unicode library to have a grapheme string built in, but it's ultimately just an application of Unicode to a task, not a modification of Unicode itself. 13:14
The true objection seems to be purely about having to build it on Windows machines (and in the mid-2000s, some bristling that ICU requires you to have a C++ compiler available, horror of horrors, back when everyone still thought supporting pre-standard C systems was good and useful.) 13:16
Also, Larry was consistently of the opinion that Perl 5's integrated Unicode support was a perfectly fine approach, no need to do things differently. 13:17
Nicholas (I'm not here) - implementation of NFG ought to be O(1) on various tasks, whereas handling graphmes as NFC or NFD is O(n) 13:27
also is ICU still UTF-16, or does it offer proper UTC-32 APIs?
as in, implementation of NFG *is* O(1) because the synthetic code points created for graphemes are single integers 13:28
nemokosch it is O(1), however, the creation of these strings is O(N)
so this is a red herring
Nicholas it may or may not be a red herring. Once you have the strings, various regex, er rule, operations can be O(1) whereas with NFC they are O(n). This is a question of trade offs 13:30
nemokosch one thing is sure: once you decide to create to everything NFG-processed, all your strings will be slow 13:31
the mere creation of them 13:32
it would still be a valid and rather simple tradeoff to only build the NFG version at the first certain kind of operation
Nicholas yes, that follows. But unless you also want to normalise strings, then U+00E9 is not equal to U+0065 U+0301, even if the human assumed this 13:34
timo we have 8bit string storage as well in moarvm though
Nicholas (I hope I have my example correct. it's trying to be Ć© expressed in NFC vs decomposed
nemokosch however, ICU would at least do NFC
Nicholas and humans don't care how the computer represents what they typed) 13:35
nemokosch anyway, what ShimmerFairy did is a very refreshing precedent 13:36
Nicholas From limited reading of the scollback, my memory was also that Larry was of the opinion that Unicode support was supposed to be a key feature of "the artist then known as Perl 6", so the intent was to be ahead of the curve, not relying on a third party library. 13:37
nemokosch figuring out the original design considerations and see how they apply
Nicholas that design consideration at the time would mean that ICU was "lagging". These days, I think that "rate of change of Unicode" has slowed down, and ICU is mostly there 13:38
nemokosch for example, being ahead of the curve, I think it's fair to say now that this simply didn't happen
Nicholas well, *is* there. Unicode is mostly adding more damn emojis
nemokosch anyway, it's great to see that some people are both enthusiastic and able to work with Unicode 13:48
ShimmerFairy I can't speak to how good or bad ICU was as an implementation of Unicode back in the day, all I can say is that nobody in #perl6 ever complained that it wasn't letting them implement a feature.
nemokosch for myself, I'm rather curious about why the Zig compiler toolchain failed, but even that is well outside my comfort zone for sure
ShimmerFairy Ultimately, I get the vague feeling that back then, people thought other languages had poor Unicode support because good tools didn't exist (thus we have to make our own), when I think the truth is that adding Unicode support to an existing language is not trivial. 13:51
Maybe back in the early early days of (non-6) Perl, it truly was the case that nobody had a good Unicode library on hand to hook into, but perhaps even in the mid-2000s that was already no longer true. 13:52
nemokosch the Python 3 "fiasco" would certainly make that impression to us mere mortals
Nicholas unicode-org.github.io/icu-docs/api...ng_8h.html -- ICU uses 16-bit Unicode (UTF-16) in the form of arrays of UChar code units. UTF-16 encodes each Unicode code point with either one or two UChar code units. (This is the default form of Unicode, and a forward-compatible extension of the original, fixed-width form that was known as UCS-2. UTF-16 superseded UCS-2 with Unicode 13:54
2.0 in 1996.)
This is still UTF-16. This smells.
(it might smell less bad that the alternatives, but UTF-16 has downsides of UTF-8, downsides of UTC-32, and some just to itself) 13:55
timo love me a good surrogate pair
Nicholas I'm not saying "don't do it" but UTF-16 has its own pain when you'd rather be storing your NFC or NFD code points in something fixed width 13:56
ShimmerFairy Yeah, it's not great, and I wish that ICU had more complete support for UChar32 * strings, but at the end of the day the UTF that underlies a higher-level string type shouldn't matter to the public interface.
Nicholas I believe for most users, index operations on strings aren't a perforance issue
but if you're thinking about regular expressions (at least, historically for how the past 20 to 30 years went) you think of implementation details in terms of index location 13:57
and for UTF-16 storage that's O(N) for NFC text
and if you're thinking graphems you have O(N) conversion from Graphemes to NFC or NFD
and another O(n) from NFC code points to UTF-16 representation 13:58
(in the general case)(does the general case matter?)
*that* would be the two reasons for not wanting the ICU *code*. The data structures - totally
the trade off here clearly is that unmaintained or stale/laggy custome Uncicode support 13:59
sucks more than not-ultimately performant current Unicode support from a third party library
but O(n) atop O(n) is quadratic. So there's a quadratic trap here. Whether it ever springs is really the question 14:00
ShimmerFairy While indexing an arbitrary codepoint in a non-UTF-32 string would be more expensive than a UTF-32 string, for things like regexes what you'd actually be doing is keep track of where you are in the string, and then ask to move forward/back by some number of codepoints. And both UTF-8 and UTF-16 are designed to be quick and easy to do that in, if not as much so as UTF-32. 14:01
timo i don't think we ever got a good iterator caching implementation for our regex engine on parrot, so if there was ever "a unicode character" in the core setting, compile times suddenly became hours to days :)
Nicholas I once screwed up the Perl 5 "caching" for UTF-8 to code point offsets
lizmat ShimmerFairy: if you have some logs of #parrot somewhere, I could probably integrate them into the websote
*site 14:02
Nicholas it matters a lot, it turns out, how that caching worked
lizmat colabti.org/ircloggy/ doesn't provide them
Nicholas "recent place" and "recent offset from that place" were far more performant than 2 "most recent places"
this is a decade ago. I forget the details. But the exact form of the caching was an important crutch. And things were only performant because a cache hid O(n) behaviour 14:03
there are traps here. They might be avoidable. But they exist
nemokosch another thing to keep in mind is that Raku's regexes are not just different syntax-wise. Not sure if that creates new obstacles but it's worth noting 14:05
ShimmerFairy Hah, apparently the only #parrot log I have onhand is one time where I joined, asked a question, got no response, and left 20 minutes later.
timo to be fair, only waiting 20 minutes is quite short 14:06
Nicholas if raku's default "boundary" character maps to a grapheme boundary, rather than a code point boundary, then performance of graphemes is going to matter. If that can be faked/maintained with some amount of caching that's cool. But it ought to be tested, and at some sort of scale
and I need to be somewhere else, so I need to go AFK and really be "not here"
timo see you later Nicholas :) 14:07
Nicholas I hope what I brain dumped was useful. I'm not sure what the right trade offs are, but I hope I added some useful data to help make them better
lizmat Nicholas++
ShimmerFairy It's good to get the input. I'm still not the biggest fan of ICU myself, but I find it fascinating that the historical objections to it were really entirely about building it on Windows (and being a wrinkle in the old desire for a pure ANSI C project). Makes it harder to decide if we should keep not using it. 14:08
Nicholas I belive it was also "UTF-16" as an conversion mismatch 14:09
ShimmerFairy A point in favor of moving away from internal 32-bit codepoints, by the way, is that across the programming world people usually really don't like wasting four bytes per character, especially when it's mostly ASCII text. In that sense UTF-16 is a decent compromise, being somewhat space-efficient and somewhat fixed-size. 14:12
lizmat note that MoarVM nowadays uses 8-bit representation if it can 14:13
timo right ?
timo another note that the crlf grapheme, being a synthetic, can give you 32bit storage when you don't expect it
lizmat so "foo" would be 8-bit, and "foo\n" would be 32bit 14:14
timo i'd have to double-check, I think only \r\n gets a synthetic, not \n or \r on their own 14:15
in theory, we have a strands data structure that would allow us to store only small pieces of a string at the higher byte size, for strings where the ratio is favourable 14:16
it has a performance impact to have to go the extra indirection step, but it can be worth it if you save a lot of memory that no longer needs to go into the cpu cache
Nicholas IIRC Swift uses UTF-8 for storage, as the trade offs for that were better. (storage size/cache hits vs tight CPU bound code to do O(n) calculations that had few to no cache misses)
and at one level, the logical trade was fixed width storage in whichever worked best of 7 bit, 8 bit, 16 bit and 32 bit 14:17
ShimmerFairy (Another related note is that every UTF has a max of 4 bytes to the codepoint, so picking something other than UTF-16 and UTF-8 will never give you larger byte sequences than the same text in UTF-32.)
timo i thought you need to go AFK :)
Nicholas (7 bit had the advantage that we we can be sure that it's all ASCII, we don't need to care whether somethign later wanted UTF-8 or some legacy 8 bit encoding) 14:18
I fail.
timo i'm also AFK for a bit now
Nicholas The coffee is still AFK. I will crack soon.
and that "7 bit" thing is something I remember Dan saying
about the *only* thing I remember from back then. 14:19
right, I've cracked. ENOCOFFEE
ShimmerFairy While ICU runs on UTF-16, there are some attempts to make it easier to use other forms, like UText. Problem is, from a quick glance none of the things they offer are compatible with much of ICU, and I think they'd all require some amount of C++ glue to make functional. (e.g. despite what UText is meant to do, they have yet to bother providing a version of it for UTF-32 text) 14:20
Overall, I think what I perhaps want to do on this topic is go ahead with that personal experiment to put ICU into MoarVM, because there's no known reason it can't work. Either I'll end up with a functional copy of MoarVM, or I'll discover firsthand how bad it is for this virtual machine. 14:23
lizmat ++ShimmerFairy 14:25
ShimmerFairy (I *do* still like the idea of factoring things out into a separate library, but I've learned over the years that I have a really bad habit of writing my code when I don't need to, and thus to distrust those sorts of impulses)
timo i'd be interested to know if ICU has something to reduce the size of the parts of the library that contain things like the names of unicode characters
or any other kind of compression 14:26
ShimmerFairy A quick look tells me that libicudata.so.78.2, the largest of its libraries, is approx. 32MiB 14:27
timo uhhhh, libmoar.so is "just" 27 megs big 14:28
ShimmerFairy I should note that ICU is definitely a library I would not want to bundle with MoarVM under 3rdparty/, and I also would expect it to be dynamically linked to the one your system already has installed, at least in POSIX land. If we have to consider static linking scenarios, then that changes things considerably. 14:33
timo right, i wouldn suggest statically linking it
lizmat wouldn't dynlinking and versioning issues bring a lot of potential turmoil ? 14:34
ShimmerFairy in the case of ICU, yes unfortunately. The library's ABI version is just the version of the project, so every major ICU update means programs have to be recompiled to link to the new library. However, at least on systems like Linux, the 'moarvm' package would only be as difficult to install as any other ICU-using package, like Qt, or Firefox without its bundled version of ICU. 14:40
timo so it's expected and fine to have multiple libicu versions on a system at the same time? 14:44
based on which packages already recompiled against the new version?
14:50 ShimmerFairy left, ShimmerFairy joined
timo we will have to start bundling libicu with the binary releases we have on rakudo.org 14:51
ShimmerFairy Message got eaten: I can't speak to that, because I use Gentoo, so whenever ICU is updated the package manager just rebuilds everything that needs to be, and thus I only have one set of ICU libraries in /usr/lib64
Just to be clear, I don't object to the idea of pursuing factoring out our own library first, and saving ICU as plan b, I just know that I can be too eager to take the "do it myself" approach, so I'm hesitant to push that idea in a group project. 14:53
timo fair 14:57
librasteve ICU im MoarVM seems like a very good idea to me - even though there is likely a lot of 6.d code out there that depends on 6.d Unicode, I would be very happy to make Raku "6.f" fully ICU centric even at the cost of tweaking the regex design 15:59
lizmat brrr... librasteve well volunteered :-)
timo by "the regex design" you mean the implementation? 16:10
16:15 [Coke] left
ShimmerFairy Switching MoarVM over to ICU shouldn't cause any breaking changes in Raku, nor NQP I would think. The most you should see is that referencing properties by name would be less precise, and more properties would be supported (since I think our UCD script still doesn't actually incorporate all the properties there are; I haphazardly added a couple that I though I'd need when it came to updating the grapheme rules). 17:59
18:09 [Coke] joined
librasteve timo: ShimmerFairy mentions that "referencing properties by name" would be less precise (and I assume some other data-dependent results) so I guess this is (strictly speaking) a breaking change 19:23
personally I think that that would be a price worth paying for being able to state that Raku is 100% UTC#18 compliant
also they mention that UTC#18 no longer requires regex to be able to support :i fully - so I guess this is a change to the intent of the Raku regex design (even if in practice it was never possible to achieve that) 19:25
anyway I am far from an expert in this - so my opinion is mostly around my perception of the trade offs between the costs of moving to UTC vs the benefits of having Raku on the official UTC toolchain 19:27
ShimmerFairy That wouldn't be a breaking change, because no existing code would break. Current Raku is, at least in some cases, kinda picky about exact string matches for property names, when Unicode in practice recommends a loose matching procedure.
m: say " ".uniprop($_) for ("White_Space", "Whitespace", "whitespace", "WhiteSpace", "White Space") # these should all return the same true value, under loose matching 19:28
camelia True
0
True
True
0
ShimmerFairy (to be clear you obviously don't *need* ICU to fix this issue, but iiuc it would come for "free" with using ICU to query properties) 19:40
I have to go for the day now, but of course I'll still be thinking about all this. 19:43
lizmat fwiw, I've been able to obtain a copy of the #parrot #parrotsketch logs in a MariaDB database... so now I only need the tuits for conversion 20:02
japhb ++lizmat 20:03