| IRC logs at
Set by AlexDaniel on 12 June 2018.
brrt \o 12:58
jnthn o/ brrt 12:59
brrt ohai jnthn
I find that I'm not sure how pass-by-reference works in nativecall 13:00
i would have expected that we'd pass a pointer to the MVMRegister in args 13:01
but that doesn't appear to be how it works
jnthn I don't know, alas 13:24
nine++ probably does
Geth MoarVM/vectorization: 8 commits pushed by (Timo Paulssen)++ 14:08
timotimo lizmat: ^- here's the op i was talking about 14:09
lizmat timotimo: does it come with documentation in ops.markdown ? 14:10
timotimo not yet
my num @a = 1e0..500_000e0; my num @b = 500_000e0...1e0; my num $c = 5e0; my num @out; my $time = now; for ^500_000 { @out[$_] = @a[$_] + @b[$_] * $c; }; say now - $time; say @out[99] 14:11
evalable6 0.34655233
timotimo use nqp; my num @a = 1e0..500_000e0; my num @b = 500_000e0...1e0; my num @c = 5e0; my num @out; my $time = now; nqp::vectorapply(@b, @c, @b, 95, 1, 64); nqp::vectorapply(@a, @b, @out, 93, 0, 64); say now - $time; say @out[99]
those are roughly equivalent 14:12
because 95 is mul_n and 93 is add_n
one of them is a cross operator, the one with a 1 in between, the other is a zip operator, the one with a 0 in between
lizmat that looks pretty cool 14:13
timotimo what i'd like you to have a look at is:
make @out = @a Z+ @b X* $c turn into vectorapply calls
they currently only work for 64bit wide arrays of int and num, and if it's a cross operator the smaller one has to be a native array, too, of the right kind and size, with only one element 14:14
lizmat intriguing! :-) looks very cool
timotimo \o/ 14:15
lizmat fwiw, I was first going to take a stab at documenting the new MAIN interface and write tests for it
timotimo sure!
no hurry :)
lizmat and then I was planning to have a look at R#2360, attempting to fix nqp::p6store
synopsebot R#2360 [open]: my %*FOO is Set = <a b c> dies
timotimo the vectorapply version of that code can run 300 times and still finish a tiny bit faster than the for ^500_000 version 14:17
lizmat and before all of that, first some sun / wind / cycling&
timotimo: so you're saying that's potentially 300x as fast ?
timotimo maybe i'll figure out soon-ish why it's even faster to have $c replaced with a 500_000 element @c array and using @c[$_] as well 14:18
yeah, and potentially about 1.5kx faster than using Z+ and X* 14:19
mhhh, my num @a = 1e0..500_000e0; takes about no time at all, but my num @a = 500_000e0...1e0; takes about 10 seconds; we recently optimized special cases of ... for for loops, surely we can put that into the push_all for the ... iterator, too :) 14:24
brrt timotimo++ pretty cool work 17:08
lizmat timotimo: afaik, ... is still a gather / take combo 17:12
nine brrt: but....that should be exactly how it works? 17:32
brrt: that's also why I added a getarg op for reading the value back from the args buffer
brrt oh, really 17:37
..... so, I don't have to add a 'copy-back-to-frame' for rw arguments 17:38
that's good news
that simplifies things tremendously
nine++ 17:39
nine My initial implementation just read the value from the local with lots of assumptions about which local that might be. But that was a tiny bit too fragile ;) 17:44
timotimo lizmat: OK! 17:46
brrt yeah, i can imagine :-) 17:47
timotimo so i'm using nine's example profile data again, and the "paths" data for one function that appears in 522 call sites was a proud ~12 megabytes, which my program took about one and a half minutes to put together into a json blob
with a whole lot of memory usage 17:48
i.e. when i tried it earlier, it tried to dump core because it reached the maximum my ram had to offer
timotimo that's not quite acceptable %) 17:48
timotimo also, it'll be interesting to build the flame graph data when there's theoretically hundreds of megabytes of data in there 17:49
timotimo brrt: you think the vectorization branch is an acceptable way forward? it's surely not optimal, but it's certainly faster than what our zip/cross ops currently can do 17:51
brrt I have totally not reviewed it 17:52
timotimo it's probably more efficient to try to do all operations on each little bunch of data?
rather than going through all data with one operation, then through all data with another
and it's surely wasteful to require intermediate arrays to be made 17:53
brrt hmmmm 17:54
timotimo though if every operation only goes from two arrays to one, i'd assume most of the time you can have at most one temporary array?
brrt in honesty you may have exceeded my expertise :-)
timotimo haha
i have no expertise either, that's why i just let the C compiler do 100% of the work
brrt scarily, I'm getting good at writing adhoc jit templates 17:57
not the most portable of skills..
dogbert11 brrt: do you have any theories as to why some spectest files fails when run with MVM_JIT_EXPR_DISABLE=1 ?
brrt dogbert11: nope, can you point me to the right ones? 17:58
dogbert11 brrt; try running - MVM_JIT_EXPR_DISABLE=1 ./perl6 t/spec/S05-mass/properties-block.t
brrt huh, that's funny 17:59
dogbert11 I thought so too. quite strange
brrt goes away with MVM_JIT_DISABLE=1 18:00
okay, I can probably figure that out
I'll put it somewhere on my todo list
dogbert11 ++brrt
.oO( we need an inverse jit bisect )
I need to fixup jit bisect anyway ...
anyway, I'll have to do all that later, afk for now :-) 18:02
timotimo oh, the cro process is still at like 3.9 gigs RSS 18:05
japhb yikes 18:27
timotimo oh lord, this can't be right 18:29
the json was being created with :pretty 18:30
that's pretty bad for a deeeeeeeply nested structure
routine-paths in 2.7811155 18:31
routine-paths json in 2.95350603: 263873 characters
^- with :!pretty
routine-paths in 2.910559
routine-paths json in 120.8043488: 13517443 characters
^- with :pretty
japhb timotimo: When you're doing really serious vector/matrix/tensor operations, beyond a certain point runtime will be utterly dominated by memory hierarchy effects. Chunking large arrays so that all operations on a given set of data fit in fast caches makes a huge difference (consider e.g. multiplying a pair of 8k x 8k matrices).
timotimo japhb: sadly, that means much more work :) 18:32
japhb timotimo: Actually ... maybe not. It may be that if you want to do that sort of thing, we instead automate using one of the fast linear algebra libraries.
timotimo true 18:33
japhb Don't get me wrong, I think your current research is very useful. I was just answering your question earlier about vectorization of large volumes of data.
timotimo alternatively, maybe the liboil compiler would actually be nice to put into moar 18:34
yeah, i think i got you right :)
TBF with the stuff i've implemented so far, i don't think matrix multiplication is particularly possible to implement 18:35
japhb timotimo: Have you looked at PDL from the Perl 5 world? 18:36
timotimo i have not
japhb It's interesting just from the point of view of the things it makes easy, and the magic it does behind the scenes to make that fast-ish. 18:37
timotimo i've looked a little into numpy 18:38
japhb But it was not trying to do true CPU vectorization, rather just able to pump large multidim arrays into optimized C routines
timotimo scipy has a thing that lets you write C++ code using some c++ library that does multidim arrays that you can slice every which way
japhb It could not, for example, hold a candle to the real C/C++ fast linear algebra stuff. Still, it beat the blazes off doing things element-wise.
timotimo last time i looked it was barely documented, barely hackable if you want very specific behaviour of the compiler, and apparently hadn't been touched in a couple of years 18:39
got a tree with 162338 nodes 18:42
routine-paths in 272.2697089
oh jeez here we go
in comparison, the stuff i pasted above had "got a tree with 5755 nodes" 18:43
routine-paths json in 90.608918: 7433700 characters 18:44
japhb m: say 162338 / 5755, 272.2697089 / 2.7811155
camelia 28.20816797.899461169
japhb m: say 162338 / 5755, ' ', 272.2697089 / 2.7811155
camelia 28.208167 97.899461169
timotimo now chrome is chugging along on the json and the react component tree
japhb Hmmm, some nonlinear effects there, but at least not O(n**2)
timotimo aye, you must imagine the call graph and we've got a set of leaf nodes 18:45
japhb Are you sorting the keys? Looks like there might be an N log N effect
(Just staring at the ratios) 18:46
timotimo and the code goes via the parent ids towards the known roots
japhb Ah, yeah, that would do it
timotimo i should be able to construct an sql query that picks every "current node"'s parent rather than going node-by-node 18:46
diakopter heh portable 19:58
timotimo i'm not sure where to stop adding "vectorized" stuff. like, i think coercing an array of int to an array of num and vice versa seems very useful to have 23:29
but coercing int or num to str ... useful for sure, but not appropriate for the vectorapply op, i don't think 23:30