00:09 Ven joined 00:49 Ven joined 01:09 Ven joined 01:12 vendethiel joined 01:28 Ven joined 01:48 Ven joined 02:08 Ven joined 02:29 Ven joined 02:46 Ven joined 02:48 ilbot3 joined 03:04 Ven joined 03:24 Ven joined 03:26 lizmat joined 03:45 Ven joined 04:04 Ven joined 04:24 Ven joined 04:44 Ven joined 05:04 Ven joined 05:24 Ven joined 05:44 Ven joined 06:04 Ven joined 06:24 Ven joined 06:50 Ven joined 07:02 brrt joined 07:13 domidumont joined 07:19 domidumont joined 07:21 Ven joined 07:33 geekosaur joined
nwc10 good *, jnthn 08:11
jnthn: subtle attempt to get me to create a coffee slick - table football players send a ball ricocheting round the kitchen floor under me, to distract me 08:12
and I feel that I'll use that as a feed line (er, drink) to spam the channel with the information that we're trying to recruit coffee drinking table football players: unternehmen.geizhals.at/about/de/jobs/ 08:13
(says so in the job ads)
brrt good *, nwc10 08:17
arnsholt I'm decent at table football (for a Norwegian, anyways), but I've already got a job (and not that keen on moving to Vienna =) 08:18
I'm very good at drinking coffee, though
brrt i suck at table football and at table tennis and most table sports 08:19
nwc10 I suck too (at all these things)
brrt caffeine intake tolerance is reasonable
nwc10 I think that they tolerate me because I drink some coffee and can do Perl OK.
brrt nwc10: you're situated in vienna, then? 08:21
do they speak german all the time?
nwc10 no. because I slack at tech stuff in German
and "for some value of" German
(they have different words for some things, which they insist they are correct)
and the best bit - "so that word that I don't know, is it German German, Austrian German, or something anyone outside of Vienna would look at me funny for?" 08:22
"don't know"
(sometimes)
arnsholt =D
nwc10 Vienna is intersting, particularly coming from London and having lived in Cambridge 08:23
it's small enough to feel like Cambridge
but I think I can just see (a corner of) the OPEC HQ (if I look down a particular gap from exactly the right place in the office) 08:24
and it's the UN's number 3 city (after NY and Geneva)
so it's not even just "a capital city"
it's got some pretentions above that
brrt is vienna expensive to live in? 08:25
nwc10 not by London standards :-)
(this is hard to answer) 08:26
brrt i guess, yeah.
comparing 'costs of living' is really difficult accross countries
nwc10 I forget where you'd need to go to find the numbers for the "Big Mac" index 08:27
but more useful things like the ratio of "median propery cost" to "median salary"
or other stuff that tries to figure out how hard you need to run just to stand still
no idea
08:35 zakharyas joined 08:37 Ven joined
arnsholt nwc10: Big Mac index is The Economist, IIRC 08:50
nwc10 I can tell you that McDonalds here sells beer and takes Amex 08:51
which partially makes up for the fact that it's McDonalds :-)
08:54 Ven joined
brrt hehehe 09:04
arnsholt Just like Denmark! 09:08
samcv idk if you guys saw what i said in #perl6 but, we can save 1/3 of the size of the unicode name database if we compress as base 40
arnsholt (Beer at McD's, that is)
samcv which is pretty nice. saves 250KB out of a total of 787KB 09:09
arnsholt Nice!
Since your fiddling with that kind of idea: Have you tried Huffman coding it too? 09:10
samcv nope
i mean the strings are short. so. idk
though i guess the whole thing could be compressed together but, still 09:11
nwc10 runtime readonly access (from memory shared between processes) usually for the (most) win
samcv also one thing that is annoying is that it takes about as much space to store the point indexes compared to the actual unicode data 09:12
so I need to find a way to somehow compress that
i mean maybe I could have a bunch of smaller structs? and one for each set of codepoints? 09:13
arnsholt Point indexes?
samcv so I would have to store much lower numbers and could use narrower int's. actually that's a fantastic idea. have been thinking what to do about that for a while
that map each point to a column in the unicode data struct 09:14
arnsholt Aha
samcv well it depends how many we end up having tbh
09:14 Ven joined
samcv it could be a lot and it gets expensive if it goes over the length of a short 09:15
well even short's are too big
there's so many points
either way, splitting up points with an offset so i can use the narrowest value will save a lot of space 09:16
nwc10 I'm not familiar enough with the data to know if this is a daft suggestion, but IIRC Unicode does tend to do things in 256 codepoint blocks 09:17
so is there a saving if you do some sort of "long" pointer based on the block, and then a shorter offset table for each of the 256 code points in the block, and add them together?
09:18 domidumont joined
samcv that is sort of kinda truish. but the blocks are mostly irrelevant because they basically automatically sort themself. because no row in the bitmap is identical 09:18
so storing indexes to that bitmap is the bigger issue
since there are 0xE01EF codepoints, well technically there are higher, but that's the highest named codepoint 09:19
so that fits into a 20bit integer, but if we did that for all codepoints :P 09:20
right now we _sort_ of do that 09:22
lizmat but we're basically talking about ascending integer values ?
samcv we still use up a 16bit integer * 52102
even though we do apply offsets for ranges of codepoints
yeah lizmat
so i'm thinking of just splitting it so i only have to store a short instead 09:23
and one of the reason we ONLY store 52,000 is because CJK ideographs and stuff the name is derived from its properties 09:24
well. that's not exactly why but.
but atm it's a huge if else tree 09:25
lizmat I was reminded of some search engine internals I was involved with 14+ years ago 09:26
samcv but yeah it does look like it does it by plane I think
lizmat it was able to encode an offset with about .5 bit in the end 09:27
(on average)
samcv nice
atm i think it's by block or something, it's divided up and i *think* each divided up section is deduplicated, but not the whole thing 09:28
09:29 Ven joined
jnthn morning o/ 09:37
samcv morning jnthn
09:38 domidumont joined
samcv also i'm guessing let's say I have a char *things[100] = { NULL, NULL.... }; it's going to take up the space of how many pointers? i mean it depends on how it stores where the pointers are to the pointers. becuase it has to store where the pointers are somewhere 09:44
jnthn (nwc10 job add) I can say that Vienna seems really nice; when I was planning a move back to central Europe a couple of years back, it was on my shortlist of options. :)
*ad
samcv if anybody knows. i'm guessing obv you can't depend on anything, but worst case an array of half null pointers, the null pointers could cost the size of 2 poniters for every NULL? 09:45
jnthn samcv: If you just have something like `static Foo *bar = { baz, NULL, wat, NULL };` you mean? 09:47
samcv yep
jnthn Pretty certain there's no compression of any kind on that
NULL will take as much as any other pointer in the array
samcv well i know AT LEAST it takes up the size of a pointer
but somewhere it has pointers to point to the NULL pointers 09:48
jnthn Since arrays are accesed by multiplying the element size by the index
samcv err or any pointer
ah ok. yeah
jnthn So the storage of an array is elems * sizeof(elem_type)
samcv kk, so all the pointers are all contiguous 09:49
jnthn Yeah, a C array will be contiguous in virtual memory :)
samcv so 1 pointer + the size of an arrays worth of pointers
jnthn *nod*
samcv so if i compress the strings, instead of a NULL pointer for something with no name having 8 bytes, it will be stored in 2 bytes instead :) 09:58
in addition to the 1/3 size savings
10:07 dogbert17_ joined 10:08 Ven joined 10:16 brrt joined 10:28 Ven joined 10:48 Ven joined 11:08 Ven joined 11:23 Ven joined
timotimo if you have a whole lot of prefixes like that "LATIN LOWER CASE LETTER", you can have a little table of that 11:25
however
if you can't just pass around a pointer into the big table of strings
you have to malloc and free
which ... ugh
i imagine that problem would also exist if you use base40 for our strings 11:27
hmm. but most of the time we're already creating a VMString from those things 11:28
yeah, my worries are entirely unfounded. cool! 11:29
11:44 brrt joined 11:45 zakharyas joined 11:52 Ven joined
samcv hmm for some reason storing the strings in base 40 is not any smaller. at least compiled size. it must have a way of storing a char * [1000] more efficiently than many short arrays 11:59
not sure how to do it without pointers though. and having one array with all the pointers to the short arrays caused the file to be ridiculous. 12:00
:\
*file size
jnthn samcv: Which file are you checking the size of?
samcv unicode names file
jnthn Yes, I meant compiled output.
samcv as a char *unicode_names[2000] or whatever, compared to a bunch of unsigned short unicode_name_xx [] 12:01
yeah i'm talking about compiled
jnthn Yes, which compiled file did you look at?
samcv one I made?. all the file has in it is unicode names
jnthn Ah, OK
I thought maybe you'd built it into moar already
samcv that is the only thing in it. and I even removed all NULL and empty values
jnthn the size of moar woudln't change
samcv heh 12:02
jnthn but libmoar.so would :)
timotimo how do you store the base 40 thing? C doesn't support 40 bits per array element, so you'll have to do things manually with bit masks and shifts if they are to be stored tightly
samcv timotimo, well
this is my script github.com/samcv/UCD/blob/master/l...Base40.pm6
you can store 3 characters inside two short's
err 12:03
3 characters inside 1 short
timotimo oh, base 40 is not 40 bits, duh
samcv yeah
you can even do different case if you want to get fancy
timotimo it's a bit late in the day to still be asleep inside your brain
samcv and use one of the extra characters as a shift 12:04
but c is compiling it to much bigger, but it should really be 1/3 the size in raw data
timotimo you're actually spelling it "short"? 12:05
i'm not sure if short is the same length everywhere
samcv possible
timotimo i'd go extra-sure and use MVMint8 or whatever
do you know of dwarfdump? i've used it in the past a few times to get the actual size of things, but i'm not sure how well it deals with arrays 12:06
samcv i have not used it before
i mean it must be storing extra pointers or things to the arrays or whatever idk how else it would be the same final size, well actually 10% bigger 12:07
and that's not making an array of pointers to these arrays
timotimo just dwarfdump path/to/libmoar.so and go through a pager. it's a firehose of info, but searching for identifiers from the code can get you where you need to be
12:07 Ven joined
samcv well i'm not compiling it into moar yet 12:07
timotimo can you show me a diff or something?
samcv just checking on it by itself
stand by 12:08
gist.github.com/ad1a2161645c2ac3b6...96245d8e7e here is names.str.c
timotimo ah, yes, it's a big'un 12:09
samcv ye that's the string one
uploading the base 40 encoded one now
gist.github.com/dd924ae9336bfb1605...956ded79ea here is that one
i am storing the number of base 40 numbers as the 1st element
timotimo ah, indeed 12:10
samcv but that should be smaller than a 64bit pointer, and i'd think that it should be smaller
since with the char * it has to store the pointers + the strings, but it must be packing the data differently? idk
i've only checked compiled size
timotimo it could be trying to ensure aligning
you could hexdump the file to see if you can spot lots and lots of null bytes
samcv yeah
maybe it's not aligning the strings but is aligning the arrays hmm 12:11
timotimo since we don't care a lot about aligned reads, you could make one big table with all the shorts in it and then assign &bigtable[offset] to all the uniname_* thingies
samcv yeah
timotimo give the C compiler less rope to hang us with
samcv ^
one big table sounds like it would work 12:12
timotimo you might be able to get the size of the C source to be a bit smaller by using 0x (if the number in decimal is longer than the number in hex)
samcv though how do i figure out where to go in this table for a specific point. i'm guessing the number of chars of everything is pretty long. wait actually i already computed this h/o 12:13
ok 272267 16 bit unsigned integers
is all the names
but then i'd have to store the index inside the big table to access them
timotimo maybe we only need a linear scan or prefix sum once when we write the code out to the .c file? 12:14
i.e. not actually store it, just compute it that one time from the lengths of string
worst case we can go through the big table and jump from one entry to the next just because at the start it has the length already written in the table 12:15
so we read 2, so we kip ahead 3, we read 6, we skip ahead 7, we read 5, we skip ahead 6, etc etc
samcv i am not familiar with linear scan
hm 12:16
timotimo oh, i mean just go through all elements with a for loop
samcv yeah i mean when we want a name lookup we load a hash table anyway
timotimo oh, right, we do
samcv so could just start from 0, and store number of 16bit ints after, if we see a 0 then the char has no name
if it's not 0 then load the name 12:17
timotimo ah, we only need the list of starting points for the hash table anyway? 12:18
samcv starting? well i would think we'd start at 0 12:19
and then possibly skip certain ranges that are long enough to matter
timotimo er, i mean, the location where each name starts
like, do we need one array that gives us codepoint to position in table?
samcv well if we just go through the structure once and load the hash we don't care where it starts 12:20
no
well. i *think* we can access the name from the cp with the hash, but you would know better than me
timotimo only if we ever use the codepoint itself as a hash key
samcv atm i think we lookup the name in the array of char *'s for looking up by cp and when looking up by name use the hash
hm
timotimo sounds like we do need that 12:21
samcv we might already i'm not sure
idk at least we supply it to the macro in two places
i really don't know though
timotimo our new array of codepoint-to-name might look like unsigned short *cp_to_name = {&bigtable[0], &bigtable[4], &bigtable[18], &bigtable[42], &bigtable[1337], ...} 12:22
might want a macro for &bigtable[N] there ...
samcv so store it in multiple big tables?
are we going to be able to find the index in the data structure directly or have to scan through it and load it? 12:23
timotimo we definitely can create the indices while generating the .c file
samcv imo it should be fine if we need to load a hash for it, because using \c[whatever] already has to load it
but there's so many codepoints
timotimo and we should. the less stuff we have to initialize at startup time, the better
samcv ok
so similar to how i was thinking of splitting up the indexes from cp to the bitfield? 12:24
timotimo i don't remember how you were going to do that, but ... probably?
samcv so that we don't have to store really big numbers inside it and can use narrower types?
yeah
was going to split it up into as many would fit into a 16bit uint
timotimo you mean so that the index fits into 16bit? 12:25
samcv which would half the number of bytes needed to store the index. because it ends up being much more than the data needed to store the property values
yeah
timotimo OK. if the index always fits into a whole number, that's good
samcv we will have like 5k-10k bitfield rows, and then like
huge number of codepoints
timotimo because some codepoints share bitfield rows? 12:26
samcv err wait what am i thinking about. 10k will fit into a 16bit fine, uhm
most do
if you dedup it properly 12:27
timotimo content-addressed storage :D
samcv but yeah the names take up MUCH more space than all the other content 12:29
even without doing all these optimizations like splitting things up
timotimo oh twitter 12:30
samcv ?
timotimo someone tweets #perl6: how to use ... and Daily Tech Issues also tweets that exact thing
no matter, just some random tangent 12:31
samcv btw here is the C code that will convert from the base 40 numbers github.com/samcv/UCD/blob/master/base40decode.c 12:33
and it's nice because we can extend it later and add more letters 12:34
timotimo mhm, that looks simple enough
samcv that \n' there should prolly be a '-', but. we can remove the \0 ones and have one value be a shift
and if it sees that character in front of another it will change case or access another character
whatever we want really
timotimo right, we basically do utf8 :)
samcv hmm? 12:35
timotimo not important
samcv oh lol
timotimo tbh, it's not like utf8 at all
ok, so what's the current state of generating the .c from our list of names?
samcv in moarvm now?
or in my repo
timotimo whatever's newest with our ideas and experiments 12:36
samcv oh. well the base 40 is what we should try to go for, because 1/3 reduction in size
and figure out some way to get the data into some way that won't waste space
timotimo right 12:37
do we have code to stash all our base40 values into one big table yet?
and generate a second table that has a pointer into the big table for every codepoint?
so we can just get_chars(bigtable[codepoint], buf) or something? 12:38
samcv yeah i do
timotimo cool. but that still doesn't give us a small .o file?
samcv nope. it gets to be like 90MB
if i remove the table of pointers though, it becomes like 5% more than just an array of char *'s 12:39
timotimo oh 12:40
samcv but you can checkout the repo i have. and run UCD-download.p6, then run perl6 ./UCD-gen.p6 --less=1000 or something
timotimo don't forget if you have uniname_1, uniname_2, ... it'll also generate one entry in the symbol table for each of those
samcv and it will generate a file in ./build/names.c 12:41
gist.github.com/ad811b58480561061a...c69ded1e73
this is it with --less=2000
err i did 100. but yeah
you can run 'make' to compile both names and bitfield.c 12:43
bitfield.c, if you run the compiled file, bitfield. it will work fine
print out the property values and chars for at least like non control characters up to 100 or something
using the grapheme cluster break to figure out whether to print the character verbatim otherwise just show U+
timotimo did you hexdump the resulting file when you compile names.c? 12:44
it has a section in it that's just:
00002a10: 756e 696e 616d 655f 3000 756e 696e 616d uniname_0.uninam
00002a20: 655f 3200 756e 696e 616d 655f 3400 756e e_2.uniname_4.un
00002a30: 696e 616d 655f 3600 756e 696e 616d 655f iname_6.uniname_
00002a40: 3430 0075 6e69 6e61 6d65 5f34 3200 756e 40.uniname_42.un
00002a50: 696e 616d 655f 3434 005f 4954 4d5f 6465 iname_44._ITM_de
00002a60: 7265 6769 7374 6572 544d 436c 6f6e 6554 registerTMCloneT
00002a70: 6162 6c65 0075 6e69 6e61 6d65 5f31 3800 able.uniname_18.
samcv heh
timotimo 00002a80: 756e 696e 616d 655f 3132 0075 6e69 6e61 uniname_12.unina
00002a90: 6d65 5f31 3400 756e 696e 616d 655f 3539 me_14.uniname_59
00002aa0: 0075 6e69 6e61 6d65 5f35 3700 756e 696e .uniname_57.unin
samcv no wonder it uses so much space
timotimo 00002ab0: 616d 655f 3136 0075 6e69 6e61 6d65 5f35 ame_16.uniname_5
00002ac0: 3100 756e 696e 616d 655f 3130 0075 6e69 1.uniname_10.uni
i expect that's where your overhead comes from
samcv well then proof it must be smaller! 12:45
well the underlying data :P
jnthn o.O 12:46
Is that debug data, or a linking table, or?
timotimo the code was:
unsigned short uniname_32[3] = {2,31041,5000};
unsigned short uniname_33[7] = {6,8963,19253,2409,24597,20858,17600};
unsigned short uniname_34[6] = {5,28055,32060,15014,59721,29240};
unsigned short uniname_35[5] = {4,23253,3418,59969,11760};
so yeah, linking data
it might go away if we put "static" in front?
ilmari const too? 12:47
timotimo doesn't
ilmari are you looking at the actual code/text segments, or debug info
timotimo 1651889 16K -rwxr-xr-x. 1 timo timo 14K Jan 17 13:43 names* 12:48
1669101 16K -rwxr-xr-x. 1 timo timo 16K Jan 17 13:47 staticnames*
this is without any flags, so shouldn't have -g, right?
-O3 doesn't make it better
OK, strip makes it go down to 8.3K
ilmari use size, not ls
samcv from how big? 12:49
which file are you testing on?
ilmari timotimo: how about const?
samcv this is the 100 line file?
timotimo i added const, it made ti bigger
samcv err 100 name file
heh
timotimo samcv: i took your last gist with names.c in it
samcv this ? gist.github.com/samcv/ad811b584805...c69ded1e73
kk
timotimo precisely 12:50
samcv ok i'm going to generate 2000 names. that may be better for comparison 12:51
timotimo OK
samcv and closer to real life
100 is a little small
timotimo well, as close as unicode gets to real life :P
ilmari making it static consts takes it from 8992 to 7648 12:52
samcv i go from 198 to 125K if i strip this
<samcv> 100 is a little small
ilmari text data bss dec hexfilename
2503 1408 8 3919 f4fconstnames
1115 2752 72 3939 f63names
1115 2752 72 3939 f63staticnames
samcv gist.github.com/7c95f29c5f89460f5b...dd72c3e689
err here it is
ilmari note how it moves from data to text, so it'll be mapped shared between processes 12:53
timotimo that is desirable
ilmari text data bss dec hexfilename 12:54
55983 16608 8 72599 11b97constnames
1115 71264 256 72635 11bbbnames
1115 71264 256 72635 11bbbstaticnames
the 2k-name one
samcv all three of those are 2k name ones? 12:55
ilmari yes
samcv so we want static but not const?
ilmari we want both, if they're actually constant 12:56
samcv even if the size is bigger?
ilmari the total size is 200 bytes bigger, but the actual data is moved from data (unshared) to text (shared) 12:57
samcv ah ok
ilmari so you'll save 55k per process
samcv 200 bytes is not much
nice 12:58
nwc10 200 bytes should be enough for ilmari's lunch :-)
samcv i get this warning though initialization discards ā€˜constā€™ qualifier from pointer target type 12:59
timotimo right
you need to put a const after the *
samcv yeah. just noticed that
damn you search and replace 13:00
timotimo also, the file still contains all the uniname_* strings :(
ilmari timotimo: that's just the debug info
timotimo not when stripped, though
'k
ilmari which doesn't actually get mapped at runtie
s/tie/time/
timotimo x-wing vs runtime fighter
ilmari randomascii.wordpress.com/2017/01/...nst-there/ 13:01
timotimo neat 13:02
13:09 Ven joined
timotimo we could totally get sizes of moar and libmoar.so from statisfiable 13:09
though ... we'd probably want per-moar-commit rather than per-rakudo-commit resolution there 13:10
samcv that would be cool 13:16
timotimo i asked in #whateverable 13:17
13:26 Ven joined
ilmari nwc10: šŸŒÆ time! 13:28
13:37 ilmari[m] joined
timotimo hm, nqp-m -e '' takes about 14000 maxresidentk, a reduction of 250k would almost be noticable :3 13:45
but with rakudo ... you'd hardly feel it at all :(
samcv timotimo, how do I go from a unsigned short *, and iterate over the values? 13:48
do I have to do bitwise operations to do that? 13:49
i have never tried to do this before... never needed to. just used normal ints
13:53 ggoebel joined
timotimo depends 14:13
if you want to use pointer arithmetic, i.e. p++, or if you can deal with an index into the thing
i don't actually know what your current use case is, so i can't advise well 14:14
samcv storing pointers to the ararys in another array 14:15
for the unicode names
timotimo ah
i'd spell that &otherarray[index]
like, as constant
actually, you could even store indices into the otherarray in an array and do an extra indirection when looking up stuff 14:23
samcv what do you mean otherarray. you mean the array which contains pointers to the arrays?
timotimo um
samcv the data array or the indices array?
timotimo i was still thinking of when i suggested to have all data in one gigantic array 14:24
samcv yeah. i think that would be decent
I should see how big that would end up
14:52 Ven joined 15:04 zakharyas joined 15:23 Ven joined
timotimo i'm listening :P 15:39
16:02 brrt joined 16:18 brrt joined 16:22 zakharyas joined 16:28 colomon joined 16:30 brrt joined 16:38 Ven joined 16:53 Ven joined 18:08 Ven joined 18:14 Geth joined 18:17 zakharyas joined 18:24 domidumont joined 18:34 domidumont joined 18:50 FROGGS joined
nine What do you people use for profiling moar? 19:09
timotimo perf for rough ideas of what's going on (with -g) and callgrind if i need more details 19:10
aaw, perf c2c will be available starting with linux 4.10, but i'm still getting 4.9 kernels 19:13
nine Is "MVM_CALLSTACK_REGIION_SIZE" really correct or is the double I a typo? 19:45
timotimo it's being used consistently at least :) 19:46
Geth arVM: 21d7d6e603 | (Stefan Seifert)++ | 2 files
Fix typo in MVM_CALLSTACK_REGIONS_SIZE's name

Hopefully reduces confusion and distraction for the next one to dig into this code
19:47
nine So, where were I? 19:48
timotimo were you singing death, death, death, death, devil, devil, evil, evil songs? 19:50
20:00 domidumont joined 20:10 Ven joined 20:26 Ven joined
jnthn Heh, I can to read the thing three times to spot the doubled letter :P 20:28
*I had
20:41 Ven joined 20:55 Ven joined 21:03 zakharyas joined 21:40 Ven joined 21:55 Ven joined 22:40 cygx joined
cygx so there's some weirdness happening when trying to build dyncall (and dyncallback specifically) with the latest 32-bit version of Strawberry Perl 22:42
yoleaux2 30 Dec 2016 21:41Z <samcv> cygx: i will look at it as soon as I wake up again. if i misinterpreted the bug
cygx for some reason I've yet to figure out, make things the C files should be compiled with C++
consequently, it will first fail to not find the dyncall header (as CFLAGS will not be picked up), and it will finally fail to link due to C++ name mangling 22:43
good night o/ 22:45