|
06:04
kjp joined
08:35
librasteve_ joined
|
|||
| lizmat | do we actually have a tutorial / documentation on how encoding / decoding actually works in MoarVM | 10:40 | |
| in light of adding new encodings ? | |||
| possibly from module land | 10:43 | ||
| ShimmerFairy | A quick check of the docs directory says no, and I imagine there isn't any. MoarVM has been languishing with a very paltry selection of alternate encodings for its entire existence. | 11:26 | |
| I will say I'm not a fan of adding more encodings to MoarVM at this precise moment, because it's a very hardcoded affair, and I don't think it would scale well at all. (Honestly, I wonder if we shouldn't switch to an external library that's already done the work of implementing the world's encodings.) | 11:31 | ||
| lizmat | but a library that does NFG ? | 12:05 | |
| fwiw... I guess I'll check the pure Raku road a bit then | |||
| ShimmerFairy | NFG is just NFC + grapheme clusters, nothing that would be difficult to ask an external Unicode library for help with. (But I don't mean to suggest that swapping to external Unicode support would be trivial.) | 12:13 | |
| Just to be clear, adding new encodings ought to be totally doable, I just get nervous about how many `case MVM_encoding_type_foo:`s there'll be if we gave MoarVM a more reasonable selection of encodings to support. | 12:15 | ||
| lizmat | understood... | 12:18 | |
| fwiw I don't think additional encodings would really require the type of speed that MoarVM supplies | |||
| ShimmerFairy | I don't know what the motivation for doing it ourselves was when MoarVM started, but I think it's worth reconsidering, since Unicode (+ the world of other encodings) is so big. ICU is the obvious option, and unlike when MoarVM started it's actually under the control of Unicode itself now. | 12:25 | |
| lizmat | so, do you have an inkling of an idea how much effort it would be to use ICU in MoarVM ? | 12:27 | |
| ShimmerFairy | It wouldn't solve every possible Unicode issue (IIRC anybody who wants the Unihan properties is still on their own), but it would mean less work on the MoarVM side overall. Now that I'm thinking of it, maybe I ought to experiment with swapping it in sometime, just to see. | 12:28 | |
| lizmat | ++ShimmerFairy | 12:33 | |
| ShimmerFairy | I've only used it on occasion myself, and I'm not super familiar with MoarVM code in general, but I would first guess that it only changes how things are implemented; ideally you wouldn't see any NQP/Raku code break, though some things might behave differently. | ||
| One example of that last point: currently in Rakudo querying properties requires some pretty precise spelling, when Unicode actually recommends a way to loosely match them. If the ICU interface to properties handles that automatically (which I think it does), then you suddenly get less fiddly /<:Whitespace>/ and the like. | 12:34 | ||
| lizmat | ah, and a whole set of lookup hashes in the rakudo core could go then :-) | 12:40 | |
| ShimmerFairy | I don't know if Rakudo itself does anything with processing property names, but at the very least MoarVM would be freed of a lot of stuff it has to do (and sometimes hasn't been doing). Most if not all of the UCD parsing stuff could go, for instance. | 12:42 | |
| nemokosch | I wonder how well it's going to integrate with Raku regexes | 12:46 | |
|
13:14
MasterDuke joined
|
|||
| MasterDuke | ShimmerFairy: did you see github.com/MoarVM/MoarVM/issues/1988 ? | 13:15 | |
| ShimmerFairy | probably not, looking | ||
| I didn't poke at any of the ucd2c.pl code that actually writes out C code, so I'm very unfamiliar with how the resulting C files actually work. | 13:20 | ||
| MasterDuke | i know nothing about it either, but the C looks incorrect enough that i would assume some unicode tests should fail | 13:27 | |
|
13:48
MasterDuke left
|
|||
| timo | gustedt.wordpress.com/2026/02/15/d...and-clang/ i wish we didn't have to target msvc :| | 14:37 | |
| ShimmerFairy | That seems like a neat feature. At least it being in a *future* version of C means you're not stuck desperately waiting to be allowed to use the latest tech. Also note the future stdmchar.h header, which would give us some long-overdue functions for converting between the various UTFs, as well as the current locale's char and wchar_t. | 14:47 | |
| timo | we'll probably only have to wait 20 years this time for MSVC to catch up | 14:48 | |
| ShimmerFairy | I've got a long long history of banging my head against the wall waiting for gcc to implement some new C++ thing I want. (Luckily, they were remarkably quick with reflections for once, so that's nice.) Almost as bad as waiting for C and C++ to have good Unicode support (why I'm so excited for stdmchar.h). | 14:50 | |
| It seems that GCC and Clang had a "cleanup" attribute long before "defer" was a thing. I wonder if MSVC has a similar compiler extension that could be leaned on in the meantime. | 14:54 | ||
| timo | hey ShimmerFairy, you might have an idea for this: can we do something smarter for our temporary roots than the push and pop every time a relevant variable goes into or out of scope, with the help of the C compiler? the same mechanism that helps unroll stack frames based on where the IP currently is should theoretically be usable to turn the IP in a frame into a list of addresses relative to stack | 15:08 | |
| or frame pointer that need to be rooted at arbitrary moments | |||
| but i don't have any ideas how we could get the C compiler to give that to us | |||
| apart from using the full DWARF debug info and picking out the bits we want | |||
| which seems like a lot of work both to program and to do at run time | 15:09 | ||
| since there's so much in there we're not interested in | |||
| ShimmerFairy | I'm not familiar with the garbage collection stuff at all (had to look up what you meant by "temporary roots"), but IIUC the idea is that you want raw C pointers (always pointers?) to follow the GC-controlled objects they're pointing to? | 15:21 | |
| timo | right, this is only about pointers to gc-managed objects that live on the C stack in all the internal C functions moarvm has | 15:23 | |
| we have the MVMROOT{,2,3,4,5,6,7,8} macros that make it kind of not so annoying to work with compared to explicitly pushing and popping | 15:24 | ||
| but that's really not more than syntactic sugar | |||
| i reckon that we do orders of magnitude more useless pushes and pops "just in case" we have to GC than pushes and pops that actually end up mattering | |||
| so even though scanning the stack for roots would surely be more work than just taking them out of our little temporary roots stack, we'd only do it when we actually need to do GC, and that should make a huge difference | 15:25 | ||
| or rather, that should give us a decent budget for doing work at that point in time | |||
| ShimmerFairy | Since I'm far more experienced in C++, in that world I'd expect pointers to GC objects to be represented by a class which registers itself as a callback in the GC, and then that information rests with the GC for the lifetime of the "gc_pointer" object. No clue if that's possible in C, or if a C++ class that the C code gets pointers to would work. | 15:28 | |
| timo: If your idea is to walk up the callstack at runtime to search for pointers, I think my starting point would be looking at how C++ exceptions are implemented, since they have to walk up the stack to find a suitable `catch`. | 15:29 | ||
| timo | right, that's stack unwinding, the compiler generates tables of "instructions" that are keyed on the instruction pointer relative to the function's start address essentially | ||
| the thing about the "register a callback with the gc" idea is that I think it would be equivalent to what we do with the temporary roots at the moment; add it to a list when it goes "into scope", remove it from the list when it "leaves the scope" | 15:32 | ||
| we already had a fun run-in with unwinding tables when the windows C runtime switched the "always properly unwind stack frames" switch in their longjmp implementation on | 15:35 | ||
| the problem being, moarvm's JIT wasn't generating & registering these tables for the generated code, so trying to MVM_exception_throw_adhoc past a jitted frame back into the interpreter would crash | 15:36 | ||
| ShimmerFairy | I think the key issue is that *somebody* has to know who to update when things get shuffled around. I feel like the only way to avoid costs would be to set up the lists of variables-to-update in compile time, one list for every possible callstack, and then at the start of each relevant function is a `MVM_GLOBAL_list_to_update = &this_list`. | 15:37 | |
| (and if a function can be part of mulitple possible callstacks, that becomes way more annoying to do) | 15:38 | ||
| timo | I'm still hoping there's something that can really only have run-time impact in the actual event where it's necessary | 15:40 | |
| I'm actually writing a blog post vaguely related to this, but it's focused on APIs that are meant to be used when you generate code at run time | 15:41 | ||
| ShimmerFairy | Worth noting that the vaguely similar idea of C++ exceptions is one of two places in that language where you're forced to pay a runtime cost even when you don't want it. (The other being RTTI for polymorphic stuff.) | 15:42 | |
| timo | oh, can you elaborate on the runtime cost for c++ exceptions? | 15:44 | |
| ShimmerFairy | While I'm far, *far* from the kind of expert programmer who could meaningfully help with this, so far it seems to me like one of those programming tasks where you wish there was a better way, but isn't. Copying an image to framebuffer is always gonna be an O(width * height) operation for something in the computer, for instance. | 15:45 | |
| timo: I may be misremembering slightly, I'm not familiar with the low-level details. There's certainly a space cost, but apparently in the Itanium ABI runtime cost only occurs when exceptions are triggered? | 15:46 | ||
| timo | whoa, pthread already includes an unwinder in itself for the pthread cancellation handlers feature | 15:47 | |
| I am certain that an IP-keyed table similar to exception handler tables can be used to get the addresses of gc-relevant pointers only at the critical moments, the main difficulty is to find a way where I wouldn't have to do it all by hand, for example by parsing all the generated .s files the compiler spits out :D | 15:51 | ||
| ShimmerFairy | I wonder if you could manage something by putting a list of pointers on the stack, and then a marker value that the GC can detect on the stack for functions to update. Maybe something like `{ DEFINE_UPDATE_MARKER(2); void * update_list[2] = {&foo, &bar}; ... }` | ||
| timo | non-precise GCs, as opposed to the precise GC we have in moarvm, literally go through the stack and look for "stuff that looks like pointers to GC managed objects" with a variety of tricks to make it more reliable | 15:52 | |
| ShimmerFairy | (the idea in that example being that the marker comes just before the list, and tells you how big the list is) | ||
| timo | I wonder if that's difficult to do with our use case, since the scopes variables are rooted in don't correspond 1:1 with function bodies | 15:54 | |
| ShimmerFairy | The compiler has to map out where all the stack variables go for that function, even if its lifetime is a lot shorter than the whole function. (Though I'm more familiar with generated assembler on old computers at the moment, so maybe x86_64 assembler does it different.) I think as long as a space on the stack doesn't get reused for pointers and non-pointers, it would work. | 15:58 | |
| That is to say, as long as a function with blocks like `{ GC_Pointer * foo = init_thing(); ... } ... { int bar = 42; ... }` doesn't decide to have `foo` and `bar` share the same spot on the stack. | 15:59 | ||
| timo | on higher optimization levels, I would say it's very likely that spots get reused | 16:00 | |
| I would even say that it's not uncommon for stack frames to change in size from moment to moment | |||
| after all, that's what the "push" and "pop" instructions do | |||
| and even though they are often not emitted literally in optimized code, there definitely is add and sub to the stack pointer register that I've seen | 16:01 | ||
| I asked on the compiler explorer discord, they may have a good idea for where I can look | 16:02 | ||
| ShimmerFairy | Yeah, my thinking is colored by the fact that I've been staring at ancient 32-bit MIPS code for a while now, particularly code which very rarely uses a frame pointer (that is, the functions are compiled so that they can allocate all the stack they need at the start and not think about it again). | ||
| timo | :D | 16:04 | |
| I have a mild interest in internals of retro video games | 16:06 | ||
| so I have at least a very mild familiarity with stuff like that I guess? | |||
| frame pointers are expensive to keep up to date, so it wouldn't be a surprise to see that code written for devices that have clock rates in the "multiple megahertz" would not bother to keep them | 16:08 | ||
| without frame pointers you can still fully properly unwind your stacks as long as you have the dwarf debug info (or equivalent), it's just more work | |||
| I've seen linux distros discuss moving back to -fno-omit-frame-pointer by default because it makes profiling and debugging and core dumps and all that much nicer to work with, and I agree. the performance impact seems not too terrible either | 16:09 | ||
| ShimmerFairy | (To be clear, just in case the terms mean different things in AMD64, in MIPS land there are no explicit stack-handling instructions, so you just decrement the sp register at function start, and re-increment it at function end. the fp (frame pointer) register only gets used when you need to allocate more stack space mid-function for some reason.) | 16:12 | |
| timo | that makes sense | ||
| ShimmerFairy | btw, looking at what could maybe be done with ICU, I was immediately reminded that it loves working on strings in UTF-16 form, which is annoying. Unless people have been bristling at MoarVM's memory usage wrt strings, that would mean converting to UTF-16 almost any time you want to use ICU's whole-string functions. | 16:20 | |
| Luckily, property lookup will accept individual codepoints just fine, so that at least wouldn't be an immediate headache. | |||
| timo | I had the impression that ICU is bad and we're far ahead of the competition because we rolled our own | 16:21 | |
| that's probably an opinion formed a decade ago and not thought about since then | 16:22 | ||
| ShimmerFairy | I've not been the biggest fan of ICU in the past, but really just for "I could totally write a better library (but never will)" reasons. When I need to do Unicode stuff in C++ it gets the job done well enough. I think the only real bad thing is that it's one of those insufferable projects that pegs its SONAME to the library's version, so ABI breakage happens on every major library version update. | 16:24 | |
| I'm curious as to why MoarVM avoided ICU or some other external library in the past, but in any case I think it's worth exploring. When I went to update MoarVM for the latest Unicode, I couldn't help but begin to think about MoarVM's Unicode support long term; what happens the next time nobody is around to update this one project-specific Unicode impl? | 16:26 | ||
| timo | we should extract moarvm's unicode stuff into a library that other projects can use, maybe we'll find volunteers that way? | 16:29 | |
| not 100% serious, but also not 100% joking | 16:30 | ||
| ShimmerFairy | That wouldn't be the worst idea, actually. I've always wanted to write a decent C++ library that let you work with strings like you can in Raku (after all these years I finally wrote something of a grammar library recently), but I'd probably want to give it C bindings at some point anyway. | 16:31 | |
| One point in favor of ICU is that, since 2016 (i.e. a few years after MoarVM started), it was transferred over to the Unicode consortium's control. So while I don't think explicitly advertised as, like, The Official Library™, it also would be fair to say it's not merely "just another Unicode library". I would think that fact makes it a more reliable place to hang your hat than any other existing library. | 16:34 | ||
| timo | IIRC parrot was using ICU and the decision was explicitly between ICU or roll-your-own, though i'm not sure if any alternatives were out there at the time and were considered | 16:35 | |
| librakunicode | 16:39 | ||
| ShimmerFairy | There are plenty of libraries out there, but I think ICU's probably the most comprehensive. | ||
| timo: Now that you've put the idea in my head, I think factoring things out would be a worthwhile idea to consider. At any rate, I think the current state of things isn't good longterm. If you think about it, asking virtual machine devs to periodically remember to be Unicode devs is a bit fragile. | 16:42 | ||
| timo | right, getting people who are unicode devs into our boat by offering something that is useful for not just virtual machine devs who happen to need something for unicode could be a good choice | 16:43 | |
| I feel like something like this has been at least mentioned a few times in the past by at least two different people | |||
| ShimmerFairy | And like I said, I've been itching for a good C++ unicode library for a long time now (ICU is Java adapted to C adapted to C++, so it's not exactly built for the language). In fact, when I was thinking of replacing `ucd2c.pl` with a Raku script, I was wondering about if I'd just be duplicating work for my hypothetical future library, which would *also* need to parse the UCD into something. | 16:45 | |
| timo | I wonder how much mileage we could get out of something like ucd2sqlite | 16:48 | |
| put something more programming-language-agnostic in the middle | |||
| so if someone wants to build a python or whatever version the "parse ucd" parts don't need to be touched at all | 16:49 | ||
| ShimmerFairy | tbf there is the XML version of UCD provided by Unicode. Though in my opinion the true hard part is really just designing a good API to Unicode properties, the parsing feels relatively easy in comparison. | 16:50 | |
| see: www.unicode.org/reports/tr42/ | 16:51 | ||
| timo | that could be. i haven't seriously touched the ucd2pl parts of the thing in a long long time | 16:52 | |
| ShimmerFairy | My main beef with ucd2c.pl is just that it's Perl, which I'm not familiar with and means we don't get to use Grammars to parse the files. (OTOH, Raku's forceful string normalization might affect how well it can parse things here.) | 16:54 | |
.oO(I wonder if C++26 reflections would be helpful for this sort of codegen task...) |
16:56 | ||
| timo | do the ucd source file we use actually have unicode strings in them? I seem to recall only the latin all-uppercase names and hexadecimal codes and property short codes and full names and such | 17:06 | |
| ShimmerFairy | I don't think so? I remember having to think about normalization when working on one of the roast tests, but that's because you have to assemble the test codepoint sequences into strings to actually test them. | 17:11 | |
| Most UCD files are kept deliberately ASCII-only, with I believe only the various Test files using × and ÷ as the extent of their non-ASCII usage. I think as long as a Raku script sticks to `Uni.new(@codes).encode("utf-8")`, there shouldn't be any trouble. | 17:12 | ||
|
21:46
kjp left
21:47
kjp joined
23:25
librasteve_ left
|
|||