MasterDuke samcv: so compiling the core setting is now *not* slower with that PR? 00:42
01:00 lizmat joined 01:03 MasterDuke_ joined 01:28 leedo__ joined, avar joined, moritz joined 01:31 yoleaux joined
samcv yep 01:45
i fixed it. it was not returning the same string when it was already flat but i remedied that 01:46
MasterDuke_ ah, cool 01:48
02:01 lizmat joined
Geth MoarVM: de6b0e4b13 | (Samantha McVey)++ | src/strings/ops.c
collapse_strands with memcpy if all strands are same type 4x faster

If all the strands to collapse are of the same type (ASCII, 8bit, or 32bit) then use memcpy to collapse the strands. If they are not all the same type then we use the traditional grapheme iterator based collapsing that we previously used to collapse strands.
If it's 8bit and a repetition with only one grapheme, it will use memset to more quickly write the memory.
This is 4-4.5x faster as long as all the strands are of the same type.
02:02
MoarVM: e876f1484e | (Samantha McVey)++ (committed using GitHub Web editor) | src/strings/ops.c
Merge pull request #753 from samcv/collapse_better

collapse_strands with memcpy if all strands are same type 4x faster
MasterDuke samcv: interesting. i just tested this one-liner: `my $a = "a" x 1_000_000; for ^1000 {$a ~~ /./;}; say now - INIT now` 02:34
4.3s before your PR, 93% of the time spent in iterate_gi_into_string 02:35
5.9s after the PR, 43% in collapse_strands, 33% in __memmove_sse2_unaligned_erms, 10.6% in [email@hidden.address] 5.4% in memcpy@plt 02:36
02:56 ilbot3 joined 03:17 colomon joined
samcv MasterDuke: well it's 2x faster if it is more than one character repeated 04:18
"ab" x 1_000_000
well about 1.5x faster with the new code
interesting it takes longer afterward though
well that "a" is a 32 bit string 04:21
so it doesn't end up doing memset on it
06:28 domidumont joined 06:35 domidumont joined 06:40 brrt joined
japhb samcv: Why is it a 32-bit string? 06:47
samcv japhb: probably because it was a substring of the whole document 07:09
is my best guess
brrt good * #moarvm 07:18
also, good * japhb, samcv
jnthn: bisecting the jit issue now 07:19
07:40 lizmat joined 07:48 brrt joined 08:17 zakharyas joined 08:21 domidumont joined 09:22 zakharyas joined 09:39 brrt joined
brrt hmm, damnit, it's multithreaded? 09:39
oh, it is multiprocess 09:42
jnthn brrt++ 10:13
Yes, 'fraid so, it shows up in something using a Channel
You may or may not have luck producing a golf
brrt hmmmm 10:14
always when using a channel?
jnthn Well, the place things go wrong is (try $channel.receive) // buf8 10:21
The code in the try there is a thunk, and receive is a method call
receive is inlined into the thunk, and the thunk is inlined into the code with the try and // 10:22
And the try then fails to catch the exception
It may be that you can set up something very similar with a single-threaded program
Just my $channel = Channel.new; $channel.close;
And then trying to receive will always throw
samcv the peak memory usage during core compilation is 1.3G with or without my recent change. though total allocations is down from 13.95Gb to 13.74Gb 10:37
i wish it gave me more detailed info on peak memory usage though
11:04 domidumont joined
timotimo jnthn: we need some way to spurt/write bufs bigger than int8 or uint8 into files, otherwise our utf16 encoding is almost completely useless 11:32
jnthn timotimo: It'll just need some tweaks to the stuff behind write_fhb to support things other than 1-byte VMArrays 11:40
(So, nothing more than an NYI) 11:41
timotimo will we accidentally impose an endianness if we just split the 16 into 8 naively? 11:43
or is that why there's UTF16LE and UTF16BE encodings?
jnthn By this point we're already past encodings
But yeah, we'll impose native endian
Hm
Maybe our utf16 encoding should spit out a buf8 too, then we don't have this issue. 11:44
Or it could always spit out the correct BE/LE BOM at the start for the current platform
timotimo if the utf16 encoder spits out anything, it'd have to be the same value regardless of platform endianness, because depending on how it gets turned into 8 bit pieces by the write_fhb instruction it'll end up being the correct bom 12:33
... or something?
ilmari encoders should output bytes. full stop. 12:34
the endianness is an intergral part of the encoding 12:35
lower layers should not have to know about this. I/O is streams of bytes
timotimo hum. the utf16 encoder in moar already just gives you a char *, i wonder where it gets turned into 16 bit pieces 12:37
oh, that just happens if you pass a 16-bit-per-entry VMArray to the decode call 12:38
so we'd have to either turn the utf16 type into a buffer of 8bit ints or do something different there 12:42
same with utf32, of course
brrt hmm, i'll try it out at least 13:00
fwiw, i can try to 'beat' some information out of a single run as well, but it's just not as happy as a bisect 13:01
jnthn timotimo: We should do what ilmari is suggesting, and always have a buf8, I think 13:28
13:39 markmont joined
nwc10 imlari is suggesting a buffetā€½ Om nom nom 13:42
oops, that won't highlight 13:46
ilmari: ^^
14:08 zakharyas joined 14:24 zakharyas joined 15:13 AlexDaniel joined 15:26 zakharyas joined 15:56 zakharyas joined 16:08 zakharyas joined 16:14 releasable6 joined 16:27 brrt joined
brrt yay, i golfed it 16:42
jnthn++
your advice worked
jnthn yay :) 16:43
japhb jnthn: I've been reading the current Cro docs and going through the examples. I'm *really* impressed. My stint in the world of web dev seems absolutely ancient in comparison. 16:44
brrt gist.github.com/bdw/13cb662504b3f4...acc63c56c6 16:45
jnthn I bet you can pull the first two lines out of the loop and still get it? 16:46
(might make the generated code you need to debug smaller)
japhb jnthn: Is there a FreeNode channel for Cro yet? 16:47
brrt hmm, i can try
jnthn japhb: Nice to hear. :)
japhb: Not yet, though maybe it's time... :)
brrt yep, you are correct
japhb (I don't see it in the results from alis, but alis seems to miss some already.)
jnthn: Please! :-)
brrt aye!
jnthn wonders if #cro is taken or not 16:48
brrt heh, thats a delightfully fast bisect now 16:50
japhb jnthn: Looks like it's free, I just joined and am the only person
brrt and there is a guard control inserted into the treeā€¦ let's see if it is compiled differently in any way 16:55
16:57 zakharyas1 joined 17:04 zakharyas joined 18:04 domidumont joined 18:12 zakharyas joined 19:09 evalable6 joined 19:26 robertle joined
nine I'm now reasonably sure that the remaining issue is about multi-level un-inlines but only in deopt-one cases, not for deopt-all 19:52
timotimo .o( you are crorect ) 20:06
jnthn Oh goodness, deopt /o\
nine It's not certain though, but the statistics point at this. I've seen lots of multi-level un-inlines that are harmless, but those were all deopt-all. The deopt-one cases appear in failing test files. 20:08
timotimo fascinating
nine It also fits the incredible rarity of the failures. rakudo builds fine, make test passes (with blocking and nodelay) and most spec test files pass.
Intriguingly, I could golf one of the failures down to: MVM_SPESH_BLOCKING=1 MVM_SPESH_NODELAY=1 perl6 -e '1; { my $a; }; { my Int $a; }' 20:09
timotimo if only we had a simple way/tool to run a frame once from the beginning until it deopts and the next time the unoptimized version until it hits the point it deopted into
nine Resuling in "No int multidim positional reference type registered for current HLL"
timotimo so we could compare register content and all that
could that be from version skew in rakudo's .c parts and moarvm's parts? 20:10
nine ll-exception backtrace shows the failure coming from a frame that's involved with the Multi-level un-inline 20:12
And the error goes away as soon as I leave the pointless goto entering the nested inline in 20:13
I.e. this case: github.com/MoarVM/MoarVM/blob/inli...ze.c#L2369 20:14
Another interesting point: if I don't delete the goto op but just turn it into a no_op, the error disappears. 20:18
timotimo so another case where we rely on a goto existing to know about the structure of things? 20:21
nine In this case it looks like it doesn't have to be a goto which is consistent with me being unable to find a reliance on a goto in deopt.c. 20:23
Looks more like it stumbles over the removal of an instruction, making me think more about some offset becoming incorrect. 20:24
Sooooo....when deopting an inline, wouldn't it look for the instruction calling the inlined frame? And in a nested inline, wouldn't that instruction be that goto op that eliminate_pointless_gotos tries to remove? 20:27
From what I see, uninline does not look for some annotation. It relies on the inlines table to get its information. But that table is not updated by eliminate_pointless_gotos 20:29
timotimo hm, but we only ever compute offsets at code-gen time, or at least we should 20:33
nine Offset or this mysterious deopt_idx that I haven't really found out yet what it means 20:34
20:42 lizmat joined
jnthn deopt_idx is just an index into the deopt table 20:55
Which contains mappings to locations in the original, interpreted, bytecdoe 20:56
nine And those mappings are created during codegen? 20:57
jnthn The original locations are written in graph.c, iirc
github.com/MoarVM/MoarVM/blob/mast...raph.c#L37 20:58
nine That's this I guess: g->deopt_addrs[2 * g->num_deopt_addrs] = deopt_target;
jnthn And yes, code-gen fills the rest in: github.com/MoarVM/MoarVM/blob/mast...raph.c#L37 20:59
nine And deopt_target is the unoptimized code I guess.
jnthn Right, it's a table of pairs
Yes, github.com/MoarVM/MoarVM/blob/mast...aph.c#L360 for example
Just passes pc - g->bytecode
Which is a relative offset from the start of the unoptimized bytecode 21:00
nine So that value is certainly still correct regardless of what we do to the optimized bytecode. 21:01
And the deopt_offset is only generated at code gen, i.e. after our optimizations. So they ought to be correct, too.
jnthn tries to remember how this thing works 21:05
Ah, right, github.com/MoarVM/MoarVM/blob/b9a0...ine.c#L163 is used to identify the location that we return to when doing a multi-level inline 21:06
nine Oooooooh 21:08
/* -1 all the deopt targets, so we'll easily catch those that don't get
* mapped if we try to use them. Same for inlines. */
But unline inlines, there is no code for actually checking those deopt targets.
When I add that I get MoarVM oops: Spesh: failed to fix up deopt_addr 1
But I get that even if the program would actually work...hm... 21:09
jnthn Hm, and also it stores and uses the deopt *index*, so the location in the optimized bytecdoe isn't important for this. 21:10
Is it sensitive to JIT, btw?
nine no 21:11
jnthn Hmm. 21:13
jnthn doesn't have any more guesses, alas
But hopefully those pointers helped a little 21:14
nine jnthn: does this look odd to you? gist.github.com/niner/5626227d1397...c4bcb0b5e3 21:23
jnthn Hmm, where's that "uninline expecting a goto" coming from? 21:24
nine just an additional fprintf I added
The working version does lots of deopts in frame 'MATCH' (cuid '139'), but none in postcircumfix:<{ }> 21:25
jnthn 6632 -> 144 is kinda interesting too 21:26
Oh, though I guess if we're in an inline the top index is the top frame which would be an inlinee
nine It's 6636 -> 144 in the working version 21:27
jnthn If you're seeing totally different deopts and you're using MVM_SPESH_BLOCKING, though...
Then something's odd
nine Some output from the working version: gist.github.com/niner/35accfac21f2...f702a35810 21:28
The difference between the version should really just be the removal of the no_op 21:29
jnthn Right
It's odd it'd cause different deops
*deopts
nine I guess this riddle needs at least one more night of sleep. Thanks for the help so far :) 21:42
22:08 markmont joined 22:42 zakharyas joined 22:53 MasterDuke joined
MasterDuke samcv: what benchmark were you testing with. the one i've tried `my $a = "a" x 1_000_000; for ^1000 {$a ~~ /./;}; say now - INIT now`, is faster before your recent change, whether it's "a", "ab", or "abcd" 23:54