github.com/moarvm/moarvm | IRC logs at colabti.org/irclogger/irclogger_logs/moarvm
Set by AlexDaniel on 12 June 2018.
nine dogbert17, Altai-man_: I've thought about that commit for a bit when I was in bed, but I'm not sure. It's a hard hit on performance, but it also supposedly fixes the regression github.com/rakudo/rakudo/issues/3569 07:19
Geth MoarVM/master: 4 commits pushed by (Dave Lewis)++, niner++ 07:26
MasterDuke timotimo: did you mean put the code to increase the initial number of FSA bin pages in rakudo/rakudo-m.c? 08:05
timotimo MasterDuke:i think so, yeah 10:43
MasterDuke timotimo: currently github.com/rakudo/rakudo/blob/mast...build.c.in doesn't really do much other than execing moar with some arguments. think it should be a new argument to moar? maybe a mapping of bins and initial # of pages? 12:04
timotimo oh 12:07
MasterDuke btw, those numbers were on my branch. i updated the gist gist.github.com/MasterDuke17/bf4bd...d8e892e481 to include numbers on master also 12:25
re github.com/MoarVM/MoarVM/blob/mast...c.c#L5-L10 what other improvements are possible? 12:27
timotimo do hashes use the FSA already? i do believe i wrote code for that once, but not sure if it got merged 12:37
MasterDuke think so 12:38
github.com/MoarVM/MoarVM/blob/mast...Hash.c#L34 12:39
timotimo we could probably have a look at how many instances of each set of keys we have at multiple points during compilation, for example
someone, perhaps jonathan, mentioned replacing the most common hashes with classes perhaps? 12:40
MasterDuke yeah, lizmat and jnthn were chatting about that recently 12:41
lizmat yeah, but that was in the bootstrap mostly 12:42
MasterDuke but i was more thinking, are there improvements to the FSA itself. if we end up using it for more things, making it better/faster would have a bigger impact
lizmat MasterDuke: care to blog about the work you've been doing?
lizmat sure would like to understand better what all of this FSA business is :-) 12:43
timotimo the bootstrap, eh? 12:44
i wonder if there's anything we can do with QAST::Want nodes; half the time they have three children, one QAST::WVal, one literal string, and one QAST::IVal, QAST::SVal, or QAST::NVal 12:46
the other half they have one sub-tree, the "v" or "Vv" or whatever literal string, and another subtree
that's a three-item array every time 12:47
but i think we use $blerp[0] a lot of the time, and for @$blerp and such 12:48
changing all of QAST code to work with that is probably not quickly done, and the payoff isn't even assured
also, modifications to the array are often done as well 12:49
MasterDuke lizmat: honestly, i'm rather unlikely to blog. i've never blogged anything online before and i don't really like writing. what's more likely is maybe doing a presentation sometime/somewhere. i actually like public speaking even less than writing, but feel like there's more of a reward
lizmat ok, well, eh, I would take a paragraph about FSA here also :-) 12:50
timotimo if the Md would prefer, i can give a bit of an explanation
MasterDuke go for it, please 12:51
timotimo sure
ok, so we're mostly switching places that use malloc to places that use the FSA 12:52
MasterDuke if github.com/MoarVM/MoarVM/pull/689 does get merged maybe i could be persuaded to write a paragraph for the weekly
timotimo malloc has a couple of nice affordances, like every piece of allocated memory has pertinent information stored next to it, like the size of the allocated piece of memory 12:53
that way, free can know to do the right thing when you give it a pointer that had come from malloc or realloc
lizmat so far so good :-)
timotimo but in most places we keep around a "size of the allocated memory piece" anyway, for example when we're adding stuff over time and realloc whenever space runs out
there isn't a public API to get the size of a malloced piece of memory, at least to my knowledge 12:54
at the very least not portable i guess
storing the information next to the data makes it convenient to use, but it also increases the amount of memory taken up by allocated memory
and malloc also rounds up, not exactly sure how much, but probably generously 12:55
and malloc also has to handle allocations of all kinds of different sizes, and is expected to pack them efficiently in memory
lizmat still with you 12:56
timotimo the FSA has a different API that always requires you to pass the size of an allocated thing whenever you do an operation on the allocated data
so fsa_alloc takes the size just like malloc does, but free also takes the size
fsa_free in the last one there 12:57
and fsa_realloc takes the old size and the new size
lizmat hmmm... that *feels* potentially more fragile 12:58
timotimo yes, that's right
that's what makes changing from malloc to fsa a bit more difficult
lizmat otoh, you would have all memory allocation data in one place
timotimo that's right
the FSA allocates space for one class of sizes for like a hundred items at a time 12:59
lizmat but that also implies that fsa_alloc needs to be smarter with regards to memory fragmentation, no ?
timotimo the trick is that it's split into these "bins" where all objects are about the same size
lizmat not exactly the same size?
timotimo so you don't get the "where can i fit this?" problem that gives you internal fragmentation
not exactly, no, but only rounded up a little bit 13:00
lizmat I could see that everything in a bin has the same size, even if less is actually needed
turning a pointer into an index ?
timotimo that would require everything that wants to use fsa-allocated memory to be a bit more complicated
and going from an index to a usable pointer that regular code can work with may be expensive, too 13:01
so we do keep regular pointers as the API
lizmat ah, there's that of course :-)
timotimo so you can imagine that especially for very small allocations the FSA has far less overhead than using malloc
allocation into the FSA is done via a free list, i.e. whenever you free something, the free list now starts with that item and the item points at the previously first-in-line free spot and so on 13:02
if the free list is exhausted, a new page is added 13:03
on top of that, last year or so the FSA got per-thread free-lists in addition to the global free-list
along with some heuristics for situations like producer-consumer, where one thread is the one that does all the allocations and another thread is the one doing all the frees
lizmat I see, so we're at the bleeding edge of memory allocation logic
timotimo not entirely, there's still some really impressive tricks 13:04
but we're pretty far along :)
anyway, the per-thread free-lists are cool because when you're allocating and freeing the same size of object over and over, you may stay in a situation where you never need to hold a lock or do any inter-thread synchronization of any kind 13:05
MasterDuke my recent stats about number of add_page calls just running `raku -e ''` suggests to me that we need to increase the amount of memory we initially malloc to populate the FSA 13:06
timotimo not sure if i may have missed anything
MasterDuke: can you output the size of the pages we actually allocate for each bin, and how many items fit in each?
i think the numbers are different for each bin 13:07
lizmat timotimo: thanks for this exposé, pretty sure I'm not the only one who found this enlighting 13:08
timotimo sure thing!
oh! how could i forget this 13:09
the FSA also has a neat little mechanism called "free at safepoint"
with that you can free memory pieces that other threads may currently still be accessing, without causing trouble 13:10
because FSA Safepoints™ only happen outside of functions, usually
MasterDuke timotimo: github.com/MoarVM/MoarVM/blob/mast....h#L91-L92 suggests it's always 128 items per page
timotimo: gist.github.com/MasterDuke17/bf4bd...d8e892e481 updated with page sizes 13:11
timotimo could you output how often a request for 2 or 4 bytes comes in? that's where rounding up is, relatively, the biggest overhead 13:12
though in absolute terms it may not matter at all
lizmat wouldn't it make sense to up an request for <8 to 8 on 64-bit machines, with the proper alignment and such ? 13:15
MasterDuke timotimo: gist updated 13:16
timotimo that's what the FSA currently does i think
MasterDuke and remember, this is just for `raku -e ''`
timotimo OK, it literally doesn't happen ever in this example
MasterDuke i don't see any requests for 2 or 4 bytes. wonder if that would change on my branch that uses the fsa for strings 13:17
timotimo good question 13:18
anything smaller than or equal to 8 bytes could in theory have its data live where the pointer is 13:19
but that's possibly a boatload of special case code all over the place
lizmat yeah, that smells of Perl type of optimizations :-) 13:27
timotimo aye, it's fairly common in other places, like string implementations in c++ 13:28
i've seen it mostly be called "in-situ storage"
lizmat reminds me of turning in IV to an NV or a PV or a GV o :-) 13:29
nine Btw. for topics like this, the Linux kernel offers plenty of advanced techniques. It contains a range of allocators (like the slab cache), synchronization privimites and concurrency safe data structures 13:56
MasterDuke yeah, i saw some articles on phoronix recently about facebook wanting to upstream a new slab allocator that apparently is already giving them tons of memory reduction in production 14:01
MasterDuke timotimo: why doesn't this patch gist.github.com/MasterDuke17/4840f...a0d5395ee2 show a reduced number of calls to add_page? 14:37
ah, needed to also adjust alloc_limit 14:41
timotimo MasterDuke: can you try just increasing the page_size and not upping the num_pages? 15:03
MasterDuke timotimo: by increasing MVM_FSA_PAGE_ITEMS, or adding some other factor to the calculation? 15:34
timotimo the same number you're currently using to generate more than one page could be used to make a single, bigger page instead 15:36
MasterDuke timotimo: i changed it to `if (bin == 2 || bin == 5) page_size *= 24; else if (bin == 15) page_size *= 8; else if (bin <= 8) page_size *= 4;` in setup_bin and add_page. updated numbers in the gist 16:51
timotimo that looks pretty good, right? 16:52
am i wrong to expect some bin numbers to appear twice in that listing? 16:53
since they would get one huge page and then normal-sized pages?
or are the pages now always bigger, rather than just the very first one?
MasterDuke always bigger 16:54
timotimo oh, hm, ok 17:38
would you be interested in collecting the data for a core setting compilation? 17:40
MasterDuke timotimo: sure. which stats? on my branch or master or both? 18:19
timotimo both would be fine, but your branch is more interesting first maybe? 18:39
MasterDuke but unmodified otherwise, right? i.e., not my my recent changes to increase the size of pages 18:40
timotimo after would be okay i think 18:41
MasterDuke log bytes requested? or just when adding pages?
timotimo i thought "just when adding pages" for now
the other would likely take ages to make the whole output
MasterDuke true 18:43
timotimo could perhaps want to turn it into a binary format, or at least much shorter output 18:45
%d is probably a tiny bit slower than %x
MasterDuke eh, doesn't really take too much longer
timotimo ok 18:46
MasterDuke gist.github.com/MasterDuke17/0cde6...89c817ef53 18:49
oh, let me add the diff 18:50
updated 18:51
MasterDuke timotimo: see anything useful there? 21:07
timotimo it doesn't really speak to me just yet 21:14
i'm wondering if checking what lines tend to do what sizes gives us any interesting spots, like spots that only ever allocate a single size? but no idea what that could actually give us
MasterDuke fwiw, that diff against master brings down the number of instructions callgrind reports for `raku -e ''` by ~600k. from ~702,800,000 to ~702,200,000
well, if they don't already maybe they could do a setelems 21:15
guess it depends it the size they usually produce causes the array to realloc, and if so, how much is wasted? 21:16
afk for a bit
timotimo i'm thinking more about the other places where the FSA is used, but yeah, setelems would be one that has boatloads of different sizes, i'd expect 21:24
MasterDuke it definitely seems worth it to me to somehow (either bigger pages or more created at setup) increase the initial size of the FSA. 400 (or 1k on my branch) calls to add_pages just for `raku -e ''` seems like an unneeded cost to pay 22:16
timotimo after a specific size we'll end up mmapping instead of mallocing, too 22:51
i'm not entirely sure what the exact costs come down to for those two operations