06:04 kjp joined 08:35 librasteve_ joined
lizmat do we actually have a tutorial / documentation on how encoding / decoding actually works in MoarVM 10:40
in light of adding new encodings ?
possibly from module land 10:43
ShimmerFairy A quick check of the docs directory says no, and I imagine there isn't any. MoarVM has been languishing with a very paltry selection of alternate encodings for its entire existence. 11:26
I will say I'm not a fan of adding more encodings to MoarVM at this precise moment, because it's a very hardcoded affair, and I don't think it would scale well at all. (Honestly, I wonder if we shouldn't switch to an external library that's already done the work of implementing the world's encodings.) 11:31
lizmat but a library that does NFG ? 12:05
fwiw... I guess I'll check the pure Raku road a bit then
ShimmerFairy NFG is just NFC + grapheme clusters, nothing that would be difficult to ask an external Unicode library for help with. (But I don't mean to suggest that swapping to external Unicode support would be trivial.) 12:13
Just to be clear, adding new encodings ought to be totally doable, I just get nervous about how many `case MVM_encoding_type_foo:`s there'll be if we gave MoarVM a more reasonable selection of encodings to support. 12:15
lizmat understood... 12:18
fwiw I don't think additional encodings would really require the type of speed that MoarVM supplies
ShimmerFairy I don't know what the motivation for doing it ourselves was when MoarVM started, but I think it's worth reconsidering, since Unicode (+ the world of other encodings) is so big. ICU is the obvious option, and unlike when MoarVM started it's actually under the control of Unicode itself now. 12:25
lizmat so, do you have an inkling of an idea how much effort it would be to use ICU in MoarVM ? 12:27
ShimmerFairy It wouldn't solve every possible Unicode issue (IIRC anybody who wants the Unihan properties is still on their own), but it would mean less work on the MoarVM side overall. Now that I'm thinking of it, maybe I ought to experiment with swapping it in sometime, just to see. 12:28
lizmat ++ShimmerFairy 12:33
ShimmerFairy I've only used it on occasion myself, and I'm not super familiar with MoarVM code in general, but I would first guess that it only changes how things are implemented; ideally you wouldn't see any NQP/Raku code break, though some things might behave differently.
One example of that last point: currently in Rakudo querying properties requires some pretty precise spelling, when Unicode actually recommends a way to loosely match them. If the ICU interface to properties handles that automatically (which I think it does), then you suddenly get less fiddly /<:Whitespace>/ and the like. 12:34
lizmat ah, and a whole set of lookup hashes in the rakudo core could go then :-) 12:40
ShimmerFairy I don't know if Rakudo itself does anything with processing property names, but at the very least MoarVM would be freed of a lot of stuff it has to do (and sometimes hasn't been doing). Most if not all of the UCD parsing stuff could go, for instance. 12:42
nemokosch I wonder how well it's going to integrate with Raku regexes 12:46
13:14 MasterDuke joined
MasterDuke ShimmerFairy: did you see github.com/MoarVM/MoarVM/issues/1988 ? 13:15
ShimmerFairy probably not, looking
I didn't poke at any of the ucd2c.pl code that actually writes out C code, so I'm very unfamiliar with how the resulting C files actually work. 13:20
MasterDuke i know nothing about it either, but the C looks incorrect enough that i would assume some unicode tests should fail 13:27
13:48 MasterDuke left
timo gustedt.wordpress.com/2026/02/15/d...and-clang/ i wish we didn't have to target msvc :| 14:37
ShimmerFairy That seems like a neat feature. At least it being in a *future* version of C means you're not stuck desperately waiting to be allowed to use the latest tech. Also note the future stdmchar.h header, which would give us some long-overdue functions for converting between the various UTFs, as well as the current locale's char and wchar_t. 14:47
timo we'll probably only have to wait 20 years this time for MSVC to catch up 14:48
ShimmerFairy I've got a long long history of banging my head against the wall waiting for gcc to implement some new C++ thing I want. (Luckily, they were remarkably quick with reflections for once, so that's nice.) Almost as bad as waiting for C and C++ to have good Unicode support (why I'm so excited for stdmchar.h). 14:50
It seems that GCC and Clang had a "cleanup" attribute long before "defer" was a thing. I wonder if MSVC has a similar compiler extension that could be leaned on in the meantime. 14:54
timo hey ShimmerFairy, you might have an idea for this: can we do something smarter for our temporary roots than the push and pop every time a relevant variable goes into or out of scope, with the help of the C compiler? the same mechanism that helps unroll stack frames based on where the IP currently is should theoretically be usable to turn the IP in a frame into a list of addresses relative to stack 15:08
or frame pointer that need to be rooted at arbitrary moments
but i don't have any ideas how we could get the C compiler to give that to us
apart from using the full DWARF debug info and picking out the bits we want
which seems like a lot of work both to program and to do at run time 15:09
since there's so much in there we're not interested in
ShimmerFairy I'm not familiar with the garbage collection stuff at all (had to look up what you meant by "temporary roots"), but IIUC the idea is that you want raw C pointers (always pointers?) to follow the GC-controlled objects they're pointing to? 15:21
timo right, this is only about pointers to gc-managed objects that live on the C stack in all the internal C functions moarvm has 15:23
we have the MVMROOT{,2,3,4,5,6,7,8} macros that make it kind of not so annoying to work with compared to explicitly pushing and popping 15:24
but that's really not more than syntactic sugar
i reckon that we do orders of magnitude more useless pushes and pops "just in case" we have to GC than pushes and pops that actually end up mattering
so even though scanning the stack for roots would surely be more work than just taking them out of our little temporary roots stack, we'd only do it when we actually need to do GC, and that should make a huge difference 15:25
or rather, that should give us a decent budget for doing work at that point in time
ShimmerFairy Since I'm far more experienced in C++, in that world I'd expect pointers to GC objects to be represented by a class which registers itself as a callback in the GC, and then that information rests with the GC for the lifetime of the "gc_pointer" object. No clue if that's possible in C, or if a C++ class that the C code gets pointers to would work. 15:28
timo: If your idea is to walk up the callstack at runtime to search for pointers, I think my starting point would be looking at how C++ exceptions are implemented, since they have to walk up the stack to find a suitable `catch`. 15:29
timo right, that's stack unwinding, the compiler generates tables of "instructions" that are keyed on the instruction pointer relative to the function's start address essentially
the thing about the "register a callback with the gc" idea is that I think it would be equivalent to what we do with the temporary roots at the moment; add it to a list when it goes "into scope", remove it from the list when it "leaves the scope" 15:32
we already had a fun run-in with unwinding tables when the windows C runtime switched the "always properly unwind stack frames" switch in their longjmp implementation on 15:35
the problem being, moarvm's JIT wasn't generating & registering these tables for the generated code, so trying to MVM_exception_throw_adhoc past a jitted frame back into the interpreter would crash 15:36
ShimmerFairy I think the key issue is that *somebody* has to know who to update when things get shuffled around. I feel like the only way to avoid costs would be to set up the lists of variables-to-update in compile time, one list for every possible callstack, and then at the start of each relevant function is a `MVM_GLOBAL_list_to_update = &this_list`. 15:37
(and if a function can be part of mulitple possible callstacks, that becomes way more annoying to do) 15:38
timo I'm still hoping there's something that can really only have run-time impact in the actual event where it's necessary 15:40
I'm actually writing a blog post vaguely related to this, but it's focused on APIs that are meant to be used when you generate code at run time 15:41
ShimmerFairy Worth noting that the vaguely similar idea of C++ exceptions is one of two places in that language where you're forced to pay a runtime cost even when you don't want it. (The other being RTTI for polymorphic stuff.) 15:42
timo oh, can you elaborate on the runtime cost for c++ exceptions? 15:44
ShimmerFairy While I'm far, *far* from the kind of expert programmer who could meaningfully help with this, so far it seems to me like one of those programming tasks where you wish there was a better way, but isn't. Copying an image to framebuffer is always gonna be an O(width * height) operation for something in the computer, for instance. 15:45
timo: I may be misremembering slightly, I'm not familiar with the low-level details. There's certainly a space cost, but apparently in the Itanium ABI runtime cost only occurs when exceptions are triggered? 15:46
timo whoa, pthread already includes an unwinder in itself for the pthread cancellation handlers feature 15:47
I am certain that an IP-keyed table similar to exception handler tables can be used to get the addresses of gc-relevant pointers only at the critical moments, the main difficulty is to find a way where I wouldn't have to do it all by hand, for example by parsing all the generated .s files the compiler spits out :D 15:51
ShimmerFairy I wonder if you could manage something by putting a list of pointers on the stack, and then a marker value that the GC can detect on the stack for functions to update. Maybe something like `{ DEFINE_UPDATE_MARKER(2); void * update_list[2] = {&foo, &bar}; ... }`
timo non-precise GCs, as opposed to the precise GC we have in moarvm, literally go through the stack and look for "stuff that looks like pointers to GC managed objects" with a variety of tricks to make it more reliable 15:52
ShimmerFairy (the idea in that example being that the marker comes just before the list, and tells you how big the list is)
timo I wonder if that's difficult to do with our use case, since the scopes variables are rooted in don't correspond 1:1 with function bodies 15:54
ShimmerFairy The compiler has to map out where all the stack variables go for that function, even if its lifetime is a lot shorter than the whole function. (Though I'm more familiar with generated assembler on old computers at the moment, so maybe x86_64 assembler does it different.) I think as long as a space on the stack doesn't get reused for pointers and non-pointers, it would work. 15:58
That is to say, as long as a function with blocks like `{ GC_Pointer * foo = init_thing(); ... } ... { int bar = 42; ... }` doesn't decide to have `foo` and `bar` share the same spot on the stack. 15:59
timo on higher optimization levels, I would say it's very likely that spots get reused 16:00
I would even say that it's not uncommon for stack frames to change in size from moment to moment
after all, that's what the "push" and "pop" instructions do
and even though they are often not emitted literally in optimized code, there definitely is add and sub to the stack pointer register that I've seen 16:01
I asked on the compiler explorer discord, they may have a good idea for where I can look 16:02
ShimmerFairy Yeah, my thinking is colored by the fact that I've been staring at ancient 32-bit MIPS code for a while now, particularly code which very rarely uses a frame pointer (that is, the functions are compiled so that they can allocate all the stack they need at the start and not think about it again).
timo :D 16:04
I have a mild interest in internals of retro video games 16:06
so I have at least a very mild familiarity with stuff like that I guess?
frame pointers are expensive to keep up to date, so it wouldn't be a surprise to see that code written for devices that have clock rates in the "multiple megahertz" would not bother to keep them 16:08
without frame pointers you can still fully properly unwind your stacks as long as you have the dwarf debug info (or equivalent), it's just more work
I've seen linux distros discuss moving back to -fno-omit-frame-pointer by default because it makes profiling and debugging and core dumps and all that much nicer to work with, and I agree. the performance impact seems not too terrible either 16:09
ShimmerFairy (To be clear, just in case the terms mean different things in AMD64, in MIPS land there are no explicit stack-handling instructions, so you just decrement the sp register at function start, and re-increment it at function end. the fp (frame pointer) register only gets used when you need to allocate more stack space mid-function for some reason.) 16:12
timo that makes sense
ShimmerFairy btw, looking at what could maybe be done with ICU, I was immediately reminded that it loves working on strings in UTF-16 form, which is annoying. Unless people have been bristling at MoarVM's memory usage wrt strings, that would mean converting to UTF-16 almost any time you want to use ICU's whole-string functions. 16:20
Luckily, property lookup will accept individual codepoints just fine, so that at least wouldn't be an immediate headache.
timo I had the impression that ICU is bad and we're far ahead of the competition because we rolled our own 16:21
that's probably an opinion formed a decade ago and not thought about since then 16:22
ShimmerFairy I've not been the biggest fan of ICU in the past, but really just for "I could totally write a better library (but never will)" reasons. When I need to do Unicode stuff in C++ it gets the job done well enough. I think the only real bad thing is that it's one of those insufferable projects that pegs its SONAME to the library's version, so ABI breakage happens on every major library version update. 16:24
I'm curious as to why MoarVM avoided ICU or some other external library in the past, but in any case I think it's worth exploring. When I went to update MoarVM for the latest Unicode, I couldn't help but begin to think about MoarVM's Unicode support long term; what happens the next time nobody is around to update this one project-specific Unicode impl? 16:26
timo we should extract moarvm's unicode stuff into a library that other projects can use, maybe we'll find volunteers that way? 16:29
not 100% serious, but also not 100% joking 16:30
ShimmerFairy That wouldn't be the worst idea, actually. I've always wanted to write a decent C++ library that let you work with strings like you can in Raku (after all these years I finally wrote something of a grammar library recently), but I'd probably want to give it C bindings at some point anyway. 16:31
One point in favor of ICU is that, since 2016 (i.e. a few years after MoarVM started), it was transferred over to the Unicode consortium's control. So while I don't think explicitly advertised as, like, The Official Library™, it also would be fair to say it's not merely "just another Unicode library". I would think that fact makes it a more reliable place to hang your hat than any other existing library. 16:34
timo IIRC parrot was using ICU and the decision was explicitly between ICU or roll-your-own, though i'm not sure if any alternatives were out there at the time and were considered 16:35
librakunicode 16:39
ShimmerFairy There are plenty of libraries out there, but I think ICU's probably the most comprehensive.
timo: Now that you've put the idea in my head, I think factoring things out would be a worthwhile idea to consider. At any rate, I think the current state of things isn't good longterm. If you think about it, asking virtual machine devs to periodically remember to be Unicode devs is a bit fragile. 16:42
timo right, getting people who are unicode devs into our boat by offering something that is useful for not just virtual machine devs who happen to need something for unicode could be a good choice 16:43
I feel like something like this has been at least mentioned a few times in the past by at least two different people
ShimmerFairy And like I said, I've been itching for a good C++ unicode library for a long time now (ICU is Java adapted to C adapted to C++, so it's not exactly built for the language). In fact, when I was thinking of replacing `ucd2c.pl` with a Raku script, I was wondering about if I'd just be duplicating work for my hypothetical future library, which would *also* need to parse the UCD into something. 16:45
timo I wonder how much mileage we could get out of something like ucd2sqlite 16:48
put something more programming-language-agnostic in the middle
so if someone wants to build a python or whatever version the "parse ucd" parts don't need to be touched at all 16:49
ShimmerFairy tbf there is the XML version of UCD provided by Unicode. Though in my opinion the true hard part is really just designing a good API to Unicode properties, the parsing feels relatively easy in comparison. 16:50
see: www.unicode.org/reports/tr42/ 16:51
timo that could be. i haven't seriously touched the ucd2pl parts of the thing in a long long time 16:52
ShimmerFairy My main beef with ucd2c.pl is just that it's Perl, which I'm not familiar with and means we don't get to use Grammars to parse the files. (OTOH, Raku's forceful string normalization might affect how well it can parse things here.) 16:54
.oO(I wonder if C++26 reflections would be helpful for this sort of codegen task...)
16:56
timo do the ucd source file we use actually have unicode strings in them? I seem to recall only the latin all-uppercase names and hexadecimal codes and property short codes and full names and such 17:06
ShimmerFairy I don't think so? I remember having to think about normalization when working on one of the roast tests, but that's because you have to assemble the test codepoint sequences into strings to actually test them. 17:11
Most UCD files are kept deliberately ASCII-only, with I believe only the various Test files using × and ÷ as the extent of their non-ASCII usage. I think as long as a Raku script sticks to `Uni.new(@codes).encode("utf-8")`, there shouldn't be any trouble. 17:12
21:46 kjp left 21:47 kjp joined 23:25 librasteve_ left