ShimmerFairy I'm not actually super familiar with how MoarVM stores its Unicode properties, but if I ever get around to modernizing that ucd perl script I certainly will be. 06:36
08:03 vrurg_ joined, vrurg left 11:40 librasteve_ left
[Coke] ShimmerFairy: any thoughts on making the unicode_1_name property available? I may take a stab at that so I can try to close a module issue request I gave myself. :) 13:04
ShimmerFairy So long as you don't expose it as if it were just another kind of character name (that is, .uniprop("Unicode_1_Name") but not "\c[OLD NAME]"), I think that'd be OK. I need to refresh my memory, but I'm confident that 1.0 names are not part of the character name namespace, so integrating it into that namespace in places like \c[...] needs to be thought through. 13:11
[Coke] I just want it to be available in .uniprop, honestly 13:18
ShimmerFairy Yeah, that's fine, and in general all Unicode properties ought to be available anyway (well, except for the provisional ones, but none exist in the "base" UCD properties, only in Unihan and similar areas) 13:27
14:22 woodi_ left, woodi joined 14:23 librasteve_ joined 14:31 kjp left 14:43 kjp joined
timo I see that we are storing "EGYPTIAN HIEROGLYPH-13460"/* 13460 */ through "EGYPTIAN HIEROGLYPH-143FA"/* 143FA */ with their full name; we do have a mechanism to handle codepoints where the number is part of the name, so that could be added to that 15:37
lizmat does that also apply to: 15:39
m: say 0xEFFFD.chr.uniname
evalable6 <reserved-EFFFD>
lizmat ?
or is there logic to generate that uniname ? 15:40
timo same for "KHITAN SMALL SCRIPT CHARACTER-18B00"/* 18B00 */ through /"KHITAN SMALL SCRIPT CHARACTER-18CD5"/* 18CD5 */ but that's a much smaller block
CONTROL, RESERVED, SURROGATE, PRIVATE-USE all use this 15:41
also CJK UNIFIED IDEOGRAPH-, CJK COMPATIBILITY IDEOGRAPH-, and TANGUT IDEOGRAPH- 15:42
[Coke] I opened github.com/rakudo/rakudo/issues/6108 15:45
timo gist.github.com/timo/5b4c22ffeb8a5...7da5b03ec1 words in unicode character names by frequency with examples of each 15:57
so i think unicode names have just the letters A through Z, the digits 0 through 9, the dash - and a single space 16:08
leaving me with theoretically 226 byte values to encode a longer word, or choosing a few of these values as a prefix to a second byte giving 225 + 255, or 224 + 2 * 255, or 225 - x + x * 255 for x < 225? 16:11
hm, but when we create the hash for looking up characters by name we put const char * in the entries for the actual names 16:16
so then in the cases where the hash is needed we would end up with 1x the "compressed" storage and then extra space for the expanded versions of these characters? 16:17
OTOH the "compression" scheme is round-trippable; we could compress a string before looking it up in the hash then we'd just store the compressed name in there and can be using const char* into static memory still 16:19
ShimmerFairy Looking at the standard, EGYPTIAN HIEROGLYPH is one of the kinds of "derived" names listed, so the fact that MoarVM doesn't encode them in the same way as the others comes down to me not knowing I should think of it when I was doing the 17.0 upgrade. 16:21
timo the hieroglyphs are kind of two sections, one where the code point number is in it, one where the mapping isn't 1:1 because there's like multiple variants of one in a row before the next number 16:22
ShimmerFairy Oh that's weird, UnicodeData.txt didn't "compress" the egyptian hierogylphs into ranges like the CJK ideographs are, which explains why this new kind of derived name slipped through unnoticed. 16:23
timo m: .uniname.say for 0x13000..0x13010
evalable6 EGYPTIAN HIEROGLYPH A001
EGYPTIAN HIEROGLYPH A0…
timo, Full output: gist.github.com/b534312bdf6eb40c21...6cc773e732
timo m: .uniname.words.tail.say for 0x13000..0x13010
evalable6 A001
A002
A003
A004
A005
A005A
A006
A006A
A006B
A007
A008
A009
A010
A011
A012
A013
A014
timo ... hard to see but there's 006, 006A, 006B there 16:24
ShimmerFairy Actually, it looks like a few of the derived name ranges are spelled out.
Oh huh, looks like Unicode's wording is in need of updating. Rule NR2 suggests that only characters with Ideographic=True are possibly affected by its rule, but the Egyptian hieroglyphs mentioned in the relevant table don't have that property. 16:29
timo sounds like you can post a bug report to the unicode consortium! :D 16:30
ShimmerFairy I wonder, would it be worth it for ucd2c.pl to handle derived names that UnicodeData spells out? Currently it only does special stuff for the ranges that aren't already spelled out, and that just gives them all a dummy name that tells moarvm to generate the real name at runtime, I think. 16:35
timo: btw, was it mimalloc that was causing fetch issues a little while ago? Updating moar I just got the message (on a second 'git pull') of "fatal: couldn't find remote ref refs/heads/master", and turns out the mimalloc main branch is, er, "main". 16:40
timo right, we have something called "extents", MVM_NUM_UNICODE_EXTENTS counts them, and they do something in generate_codepoints_by_name
uh yeah could have been mimalloc 16:41
i wonder why it was referencing "master" in your case; is that something that requires "git submodule sync" to fix or just update? 16:42
ShimmerFairy On the second pull it fell back to grabbing the commit directly. I just ran "git submodule sync" but I don't know how to check if it solved the problem, since my copy of moar is now up-to-date. 16:43
timo git submodules still serve to stump, it seems 16:51
ShimmerFairy I just checked, and I've had this copy of the repo since at least mid-2016, so I'm willing to bet that something was just outdated, and that perhaps that sync fixed things for the future. 16:53
18:25 vrurg_ left 18:37 vrurg joined 18:40 vrurg left 18:57 vrurg joined