#moarvm on 22 February 2026 - Raku Programming Language Log

Welcome to the main channel on the development of MoarVM, a virtual machine for NQP and Rakudo (moarvm.org). This channel is being logged for historical purposes. Set by lizmat on 24 May 2021.
00:51 sourceable6__ left, sourceable6 joined 03:13 patrickb left, patrickb joined 05:10 ugexe left, ugexe joined 10:14 linkable6 left 10:15 linkable6 joined 10:42 unicodable6 left, unicodable6 joined 11:04 [Coke] left, [Coke] joined
ShimmerFairy	I've decided to do some historical digging to see why MoarVM decided against ICU back in the day, so far it seems people were very unhappy with ICU back in 2005/06, though possibly a lot of that had to do with the fact that it used to be bundled in with Parrot and needed to be built with it (and thus as an external lib would've been less annoying?).	12:04	Copy link Message link Add to gist Remove
	I'm doing this because figuring out what to do with MoarVM's unicode support will depend on how relevant the old objections to ICU are nowadays.	12:05	Copy link Message link Add to gist Remove
	(btw, I've noticed that this nifty irclogs website Raku has doesn't include #parrot (and any related) channels, which it arguably should)	12:09	Copy link Message link Add to gist Remove
	Alright, having combed through all of #perl6, it seems that the historical objection to ICU amounts to "couldn't get it to build on Windows", which is probably a lot easier to do nowadays (not to mention a fork of ICU has been integrated into Windows since 2017).	13:00	Copy link Message link Add to gist Remove
	The only explicit mention of something wrong with ICU is the lack of "NFG support" (i.e. a struct/class that models an NFG string), but you don't need that functionality to be tightly integrated into a Unicode library, so that complaint baffles me.	13:02	Copy link Message link Add to gist Remove
nemokosch	I thought NFG was specifically made up by Perl 6 rather than Unicode itself	13:04	Copy link Message link Add to gist Remove
timo	maybe it means that when you use ICU, any string that wants to be in NFG can't use any of the ICU string functions because ICU has no concept of NFG?	13:07	Copy link Message link Add to gist Remove
ShimmerFairy	It is, but it's also nothing more than using 2 (two) standard Unicode algorithms to accomplish a particular task. It's the sort of thing you could easily build on top of a Unicode library; there's no need to build it into the library.		Copy link Message link Add to gist Remove
	But yeah, in terms of actual functionality, I can't find any old mention of something that ICU fails to do that the compiler devs needed to compensate for. Sure it would be nice for your backing Unicode library to have a grapheme string built in, but it's ultimately just an application of Unicode to a task, not a modification of Unicode itself.	13:14	Copy link Message link Add to gist Remove
	The true objection seems to be purely about having to build it on Windows machines (and in the mid-2000s, some bristling that ICU requires you to have a C++ compiler available, horror of horrors, back when everyone still thought supporting pre-standard C systems was good and useful.)	13:16	Copy link Message link Add to gist Remove
	Also, Larry was consistently of the opinion that Perl 5's integrated Unicode support was a perfectly fine approach, no need to do things differently.	13:17	Copy link Message link Add to gist Remove
Nicholas	(I'm not here) - implementation of NFG ought to be O(1) on various tasks, whereas handling graphmes as NFC or NFD is O(n)	13:27	Copy link Message link Add to gist Remove
	also is ICU still UTF-16, or does it offer proper UTC-32 APIs?		Copy link Message link Add to gist Remove
	as in, implementation of NFG is O(1) because the synthetic code points created for graphemes are single integers	13:28	Copy link Message link Add to gist Remove
nemokosch	it is O(1), however, the creation of these strings is O(N)		Copy link Message link Add to gist Remove
	so this is a red herring		Copy link Message link Add to gist Remove
Nicholas	it may or may not be a red herring. Once you have the strings, various regex, er rule, operations can be O(1) whereas with NFC they are O(n). This is a question of trade offs	13:30	Copy link Message link Add to gist Remove
nemokosch	one thing is sure: once you decide to create to everything NFG-processed, all your strings will be slow	13:31	Copy link Message link Add to gist Remove
	the mere creation of them	13:32	Copy link Message link Add to gist Remove
	it would still be a valid and rather simple tradeoff to only build the NFG version at the first certain kind of operation		Copy link Message link Add to gist Remove
Nicholas	yes, that follows. But unless you also want to normalise strings, then U+00E9 is not equal to U+0065 U+0301, even if the human assumed this	13:34	Copy link Message link Add to gist Remove
timo	we have 8bit string storage as well in moarvm though		Copy link Message link Add to gist Remove
Nicholas	(I hope I have my example correct. it's trying to be é expressed in NFC vs decomposed		Copy link Message link Add to gist Remove
nemokosch	however, ICU would at least do NFC		Copy link Message link Add to gist Remove
Nicholas	and humans don't care how the computer represents what they typed)	13:35	Copy link Message link Add to gist Remove
nemokosch	anyway, what ShimmerFairy did is a very refreshing precedent	13:36	Copy link Message link Add to gist Remove
Nicholas	From limited reading of the scollback, my memory was also that Larry was of the opinion that Unicode support was supposed to be a key feature of "the artist then known as Perl 6", so the intent was to be ahead of the curve, not relying on a third party library.	13:37	Copy link Message link Add to gist Remove
nemokosch	figuring out the original design considerations and see how they apply		Copy link Message link Add to gist Remove
Nicholas	that design consideration at the time would mean that ICU was "lagging". These days, I think that "rate of change of Unicode" has slowed down, and ICU is mostly there	13:38	Copy link Message link Add to gist Remove
nemokosch	for example, being ahead of the curve, I think it's fair to say now that this simply didn't happen		Copy link Message link Add to gist Remove
Nicholas	well, is there. Unicode is mostly adding more damn emojis		Copy link Message link Add to gist Remove
nemokosch	anyway, it's great to see that some people are both enthusiastic and able to work with Unicode	13:48	Copy link Message link Add to gist Remove
ShimmerFairy	I can't speak to how good or bad ICU was as an implementation of Unicode back in the day, all I can say is that nobody in #perl6 ever complained that it wasn't letting them implement a feature.		Copy link Message link Add to gist Remove
nemokosch	for myself, I'm rather curious about why the Zig compiler toolchain failed, but even that is well outside my comfort zone for sure		Copy link Message link Add to gist Remove
ShimmerFairy	Ultimately, I get the vague feeling that back then, people thought other languages had poor Unicode support because good tools didn't exist (thus we have to make our own), when I think the truth is that adding Unicode support to an existing language is not trivial.	13:51	Copy link Message link Add to gist Remove
	Maybe back in the early early days of (non-6) Perl, it truly was the case that nobody had a good Unicode library on hand to hook into, but perhaps even in the mid-2000s that was already no longer true.	13:52	Copy link Message link Add to gist Remove
nemokosch	the Python 3 "fiasco" would certainly make that impression to us mere mortals		Copy link Message link Add to gist Remove
Nicholas	unicode-org.github.io/icu-docs/api...ng_8h.html -- ICU uses 16-bit Unicode (UTF-16) in the form of arrays of UChar code units. UTF-16 encodes each Unicode code point with either one or two UChar code units. (This is the default form of Unicode, and a forward-compatible extension of the original, fixed-width form that was known as UCS-2. UTF-16 superseded UCS-2 with Unicode	13:54	Copy link Message link Add to gist Remove
	2.0 in 1996.)		Copy link Message link Add to gist Remove
	This is still UTF-16. This smells.		Copy link Message link Add to gist Remove
	(it might smell less bad that the alternatives, but UTF-16 has downsides of UTF-8, downsides of UTC-32, and some just to itself)	13:55	Copy link Message link Add to gist Remove
timo	love me a good surrogate pair		Copy link Message link Add to gist Remove
Nicholas	I'm not saying "don't do it" but UTF-16 has its own pain when you'd rather be storing your NFC or NFD code points in something fixed width	13:56	Copy link Message link Add to gist Remove
ShimmerFairy	Yeah, it's not great, and I wish that ICU had more complete support for UChar32 * strings, but at the end of the day the UTF that underlies a higher-level string type shouldn't matter to the public interface.		Copy link Message link Add to gist Remove
Nicholas	I believe for most users, index operations on strings aren't a perforance issue		Copy link Message link Add to gist Remove
	but if you're thinking about regular expressions (at least, historically for how the past 20 to 30 years went) you think of implementation details in terms of index location	13:57	Copy link Message link Add to gist Remove
	and for UTF-16 storage that's O(N) for NFC text		Copy link Message link Add to gist Remove
	and if you're thinking graphems you have O(N) conversion from Graphemes to NFC or NFD		Copy link Message link Add to gist Remove
	and another O(n) from NFC code points to UTF-16 representation	13:58	Copy link Message link Add to gist Remove
	(in the general case)(does the general case matter?)		Copy link Message link Add to gist Remove
	that would be the two reasons for not wanting the ICU code. The data structures - totally		Copy link Message link Add to gist Remove
	the trade off here clearly is that unmaintained or stale/laggy custome Uncicode support	13:59	Copy link Message link Add to gist Remove
	sucks more than not-ultimately performant current Unicode support from a third party library		Copy link Message link Add to gist Remove
	but O(n) atop O(n) is quadratic. So there's a quadratic trap here. Whether it ever springs is really the question	14:00	Copy link Message link Add to gist Remove
ShimmerFairy	While indexing an arbitrary codepoint in a non-UTF-32 string would be more expensive than a UTF-32 string, for things like regexes what you'd actually be doing is keep track of where you are in the string, and then ask to move forward/back by some number of codepoints. And both UTF-8 and UTF-16 are designed to be quick and easy to do that in, if not as much so as UTF-32.	14:01	Copy link Message link Add to gist Remove
timo	i don't think we ever got a good iterator caching implementation for our regex engine on parrot, so if there was ever "a unicode character" in the core setting, compile times suddenly became hours to days :)		Copy link Message link Add to gist Remove
Nicholas	I once screwed up the Perl 5 "caching" for UTF-8 to code point offsets		Copy link Message link Add to gist Remove
lizmat	ShimmerFairy: if you have some logs of #parrot somewhere, I could probably integrate them into the websote		Copy link Message link Add to gist Remove
	*site	14:02	Copy link Message link Add to gist Remove
Nicholas	it matters a lot, it turns out, how that caching worked		Copy link Message link Add to gist Remove
lizmat	colabti.org/ircloggy/ doesn't provide them		Copy link Message link Add to gist Remove
Nicholas	"recent place" and "recent offset from that place" were far more performant than 2 "most recent places"		Copy link Message link Add to gist Remove
	this is a decade ago. I forget the details. But the exact form of the caching was an important crutch. And things were only performant because a cache hid O(n) behaviour	14:03	Copy link Message link Add to gist Remove
	there are traps here. They might be avoidable. But they exist		Copy link Message link Add to gist Remove
nemokosch	another thing to keep in mind is that Raku's regexes are not just different syntax-wise. Not sure if that creates new obstacles but it's worth noting	14:05	Copy link Message link Add to gist Remove
ShimmerFairy	Hah, apparently the only #parrot log I have onhand is one time where I joined, asked a question, got no response, and left 20 minutes later.		Copy link Message link Add to gist Remove
timo	to be fair, only waiting 20 minutes is quite short	14:06	Copy link Message link Add to gist Remove
Nicholas	if raku's default "boundary" character maps to a grapheme boundary, rather than a code point boundary, then performance of graphemes is going to matter. If that can be faked/maintained with some amount of caching that's cool. But it ought to be tested, and at some sort of scale		Copy link Message link Add to gist Remove
	and I need to be somewhere else, so I need to go AFK and really be "not here"		Copy link Message link Add to gist Remove
timo	see you later Nicholas :)	14:07	Copy link Message link Add to gist Remove
Nicholas	I hope what I brain dumped was useful. I'm not sure what the right trade offs are, but I hope I added some useful data to help make them better		Copy link Message link Add to gist Remove
lizmat	Nicholas++		Copy link Message link Add to gist Remove
ShimmerFairy	It's good to get the input. I'm still not the biggest fan of ICU myself, but I find it fascinating that the historical objections to it were really entirely about building it on Windows (and being a wrinkle in the old desire for a pure ANSI C project). Makes it harder to decide if we should keep not using it.	14:08	Copy link Message link Add to gist Remove
Nicholas	I belive it was also "UTF-16" as an conversion mismatch	14:09	Copy link Message link Add to gist Remove
ShimmerFairy	A point in favor of moving away from internal 32-bit codepoints, by the way, is that across the programming world people usually really don't like wasting four bytes per character, especially when it's mostly ASCII text. In that sense UTF-16 is a decent compromise, being somewhat space-efficient and somewhat fixed-size.	14:12	Copy link Message link Add to gist Remove
lizmat	note that MoarVM nowadays uses 8-bit representation if it can	14:13	Copy link Message link Add to gist Remove
	timo right ?		Copy link Message link Add to gist Remove
timo	another note that the crlf grapheme, being a synthetic, can give you 32bit storage when you don't expect it		Copy link Message link Add to gist Remove
lizmat	so "foo" would be 8-bit, and "foo\n" would be 32bit	14:14	Copy link Message link Add to gist Remove
timo	i'd have to double-check, I think only \r\n gets a synthetic, not \n or \r on their own	14:15	Copy link Message link Add to gist Remove
	in theory, we have a strands data structure that would allow us to store only small pieces of a string at the higher byte size, for strings where the ratio is favourable	14:16	Copy link Message link Add to gist Remove
	it has a performance impact to have to go the extra indirection step, but it can be worth it if you save a lot of memory that no longer needs to go into the cpu cache		Copy link Message link Add to gist Remove
Nicholas	IIRC Swift uses UTF-8 for storage, as the trade offs for that were better. (storage size/cache hits vs tight CPU bound code to do O(n) calculations that had few to no cache misses)		Copy link Message link Add to gist Remove
	and at one level, the logical trade was fixed width storage in whichever worked best of 7 bit, 8 bit, 16 bit and 32 bit	14:17	Copy link Message link Add to gist Remove
ShimmerFairy	(Another related note is that every UTF has a max of 4 bytes to the codepoint, so picking something other than UTF-16 and UTF-8 will never give you larger byte sequences than the same text in UTF-32.)		Copy link Message link Add to gist Remove
timo	i thought you need to go AFK :)		Copy link Message link Add to gist Remove
Nicholas	(7 bit had the advantage that we we can be sure that it's all ASCII, we don't need to care whether somethign later wanted UTF-8 or some legacy 8 bit encoding)	14:18	Copy link Message link Add to gist Remove
	I fail.		Copy link Message link Add to gist Remove
timo	i'm also AFK for a bit now		Copy link Message link Add to gist Remove
Nicholas	The coffee is still AFK. I will crack soon.		Copy link Message link Add to gist Remove
	and that "7 bit" thing is something I remember Dan saying		Copy link Message link Add to gist Remove
	about the only thing I remember from back then.	14:19	Copy link Message link Add to gist Remove
	right, I've cracked. ENOCOFFEE		Copy link Message link Add to gist Remove
ShimmerFairy	While ICU runs on UTF-16, there are some attempts to make it easier to use other forms, like UText. Problem is, from a quick glance none of the things they offer are compatible with much of ICU, and I think they'd all require some amount of C++ glue to make functional. (e.g. despite what UText is meant to do, they have yet to bother providing a version of it for UTF-32 text)	14:20	Copy link Message link Add to gist Remove
	Overall, I think what I perhaps want to do on this topic is go ahead with that personal experiment to put ICU into MoarVM, because there's no known reason it can't work. Either I'll end up with a functional copy of MoarVM, or I'll discover firsthand how bad it is for this virtual machine.	14:23	Copy link Message link Add to gist Remove
lizmat	++ShimmerFairy	14:25	Copy link Message link Add to gist Remove
ShimmerFairy	(I do still like the idea of factoring things out into a separate library, but I've learned over the years that I have a really bad habit of writing my code when I don't need to, and thus to distrust those sorts of impulses)		Copy link Message link Add to gist Remove
timo	i'd be interested to know if ICU has something to reduce the size of the parts of the library that contain things like the names of unicode characters		Copy link Message link Add to gist Remove
	or any other kind of compression	14:26	Copy link Message link Add to gist Remove
ShimmerFairy	A quick look tells me that libicudata.so.78.2, the largest of its libraries, is approx. 32MiB	14:27	Copy link Message link Add to gist Remove
timo	uhhhh, libmoar.so is "just" 27 megs big	14:28	Copy link Message link Add to gist Remove
ShimmerFairy	I should note that ICU is definitely a library I would not want to bundle with MoarVM under 3rdparty/, and I also would expect it to be dynamically linked to the one your system already has installed, at least in POSIX land. If we have to consider static linking scenarios, then that changes things considerably.	14:33	Copy link Message link Add to gist Remove
timo	right, i wouldn suggest statically linking it		Copy link Message link Add to gist Remove
lizmat	wouldn't dynlinking and versioning issues bring a lot of potential turmoil ?	14:34	Copy link Message link Add to gist Remove
ShimmerFairy	in the case of ICU, yes unfortunately. The library's ABI version is just the version of the project, so every major ICU update means programs have to be recompiled to link to the new library. However, at least on systems like Linux, the 'moarvm' package would only be as difficult to install as any other ICU-using package, like Qt, or Firefox without its bundled version of ICU.	14:40	Copy link Message link Add to gist Remove
timo	so it's expected and fine to have multiple libicu versions on a system at the same time?	14:44	Copy link Message link Add to gist Remove
	based on which packages already recompiled against the new version?		Copy link Message link Add to gist Remove
14:50 ShimmerFairy left, ShimmerFairy joined
timo	we will have to start bundling libicu with the binary releases we have on rakudo.org	14:51	Copy link Message link Add to gist Remove
ShimmerFairy	Message got eaten: I can't speak to that, because I use Gentoo, so whenever ICU is updated the package manager just rebuilds everything that needs to be, and thus I only have one set of ICU libraries in /usr/lib64		Copy link Message link Add to gist Remove
	Just to be clear, I don't object to the idea of pursuing factoring out our own library first, and saving ICU as plan b, I just know that I can be too eager to take the "do it myself" approach, so I'm hesitant to push that idea in a group project.	14:53	Copy link Message link Add to gist Remove
timo	fair	14:57	Copy link Message link Add to gist Remove
librasteve	ICU im MoarVM seems like a very good idea to me - even though there is likely a lot of 6.d code out there that depends on 6.d Unicode, I would be very happy to make Raku "6.f" fully ICU centric even at the cost of tweaking the regex design	15:59	Copy link Message link Add to gist Remove
lizmat	brrr... librasteve well volunteered :-)		Copy link Message link Add to gist Remove
timo	by "the regex design" you mean the implementation?	16:10	Copy link Message link Add to gist Remove
16:15 [Coke] left
ShimmerFairy	Switching MoarVM over to ICU shouldn't cause any breaking changes in Raku, nor NQP I would think. The most you should see is that referencing properties by name would be less precise, and more properties would be supported (since I think our UCD script still doesn't actually incorporate all the properties there are; I haphazardly added a couple that I though I'd need when it came to updating the grapheme rules).	17:59	Copy link Message link Add to gist Remove
18:09 [Coke] joined
librasteve	timo: ShimmerFairy mentions that "referencing properties by name" would be less precise (and I assume some other data-dependent results) so I guess this is (strictly speaking) a breaking change	19:23	Copy link Message link Add to gist Remove
	personally I think that that would be a price worth paying for being able to state that Raku is 100% UTC#18 compliant		Copy link Message link Add to gist Remove
	also they mention that UTC#18 no longer requires regex to be able to support :i fully - so I guess this is a change to the intent of the Raku regex design (even if in practice it was never possible to achieve that)	19:25	Copy link Message link Add to gist Remove
	anyway I am far from an expert in this - so my opinion is mostly around my perception of the trade offs between the costs of moving to UTC vs the benefits of having Raku on the official UTC toolchain	19:27	Copy link Message link Add to gist Remove
ShimmerFairy	That wouldn't be a breaking change, because no existing code would break. Current Raku is, at least in some cases, kinda picky about exact string matches for property names, when Unicode in practice recommends a loose matching procedure.		Copy link Message link Add to gist Remove
	m: say " ".uniprop($_) for ("White_Space", "Whitespace", "whitespace", "WhiteSpace", "White Space") # these should all return the same true value, under loose matching	19:28	Copy link Message link Add to gist Remove Run code
camelia	True 0 True True 0		Copy link Message link Add to gist Remove
ShimmerFairy	(to be clear you obviously don't need ICU to fix this issue, but iiuc it would come for "free" with using ICU to query properties)	19:40	Copy link Message link Add to gist Remove
	I have to go for the day now, but of course I'll still be thinking about all this.	19:43	Copy link Message link Add to gist Remove
lizmat	fwiw, I've been able to obtain a copy of the #parrot #parrotsketch logs in a MariaDB database... so now I only need the tuits for conversion	20:02	Copy link Message link Add to gist Remove
japhb	++lizmat	20:03	Copy link Message link Add to gist Remove

Please report any issues / comments / feature requests as an issue on App::Raku::Log.

Thank you!