MasterDuke | samcv: so compiling the core setting is now *not* slower with that PR? | 00:42 | |
01:00
lizmat joined
01:03
MasterDuke_ joined
01:28
leedo__ joined,
avar joined,
moritz joined
01:31
yoleaux joined
|
|||
samcv | yep | 01:45 | |
i fixed it. it was not returning the same string when it was already flat but i remedied that | 01:46 | ||
MasterDuke_ | ah, cool | 01:48 | |
02:01
lizmat joined
|
|||
Geth | MoarVM: de6b0e4b13 | (Samantha McVey)++ | src/strings/ops.c collapse_strands with memcpy if all strands are same type 4x faster If all the strands to collapse are of the same type (ASCII, 8bit, or 32bit) then use memcpy to collapse the strands. If they are not all the same type then we use the traditional grapheme iterator based collapsing that we previously used to collapse strands. If it's 8bit and a repetition with only one grapheme, it will use memset to more quickly write the memory. This is 4-4.5x faster as long as all the strands are of the same type. |
02:02 | |
MoarVM: e876f1484e | (Samantha McVey)++ (committed using GitHub Web editor) | src/strings/ops.c Merge pull request #753 from samcv/collapse_better collapse_strands with memcpy if all strands are same type 4x faster |
|||
MasterDuke | samcv: interesting. i just tested this one-liner: `my $a = "a" x 1_000_000; for ^1000 {$a ~~ /./;}; say now - INIT now` | 02:34 | |
4.3s before your PR, 93% of the time spent in iterate_gi_into_string | 02:35 | ||
5.9s after the PR, 43% in collapse_strands, 33% in __memmove_sse2_unaligned_erms, 10.6% in [email@hidden.address] 5.4% in memcpy@plt | 02:36 | ||
02:56
ilbot3 joined
03:17
colomon joined
|
|||
samcv | MasterDuke: well it's 2x faster if it is more than one character repeated | 04:18 | |
"ab" x 1_000_000 | |||
well about 1.5x faster with the new code | |||
interesting it takes longer afterward though | |||
well that "a" is a 32 bit string | 04:21 | ||
so it doesn't end up doing memset on it | |||
06:28
domidumont joined
06:35
domidumont joined
06:40
brrt joined
|
|||
japhb | samcv: Why is it a 32-bit string? | 06:47 | |
samcv | japhb: probably because it was a substring of the whole document | 07:09 | |
is my best guess | |||
brrt | good * #moarvm | 07:18 | |
also, good * japhb, samcv | |||
jnthn: bisecting the jit issue now | 07:19 | ||
07:40
lizmat joined
07:48
brrt joined
08:17
zakharyas joined
08:21
domidumont joined
09:22
zakharyas joined
09:39
brrt joined
|
|||
brrt | hmm, damnit, it's multithreaded? | 09:39 | |
oh, it is multiprocess | 09:42 | ||
jnthn | brrt++ | 10:13 | |
Yes, 'fraid so, it shows up in something using a Channel | |||
You may or may not have luck producing a golf | |||
brrt | hmmmm | 10:14 | |
always when using a channel? | |||
jnthn | Well, the place things go wrong is (try $channel.receive) // buf8 | 10:21 | |
The code in the try there is a thunk, and receive is a method call | |||
receive is inlined into the thunk, and the thunk is inlined into the code with the try and // | 10:22 | ||
And the try then fails to catch the exception | |||
It may be that you can set up something very similar with a single-threaded program | |||
Just my $channel = Channel.new; $channel.close; | |||
And then trying to receive will always throw | |||
samcv | the peak memory usage during core compilation is 1.3G with or without my recent change. though total allocations is down from 13.95Gb to 13.74Gb | 10:37 | |
i wish it gave me more detailed info on peak memory usage though | |||
11:04
domidumont joined
|
|||
timotimo | jnthn: we need some way to spurt/write bufs bigger than int8 or uint8 into files, otherwise our utf16 encoding is almost completely useless | 11:32 | |
jnthn | timotimo: It'll just need some tweaks to the stuff behind write_fhb to support things other than 1-byte VMArrays | 11:40 | |
(So, nothing more than an NYI) | 11:41 | ||
timotimo | will we accidentally impose an endianness if we just split the 16 into 8 naively? | 11:43 | |
or is that why there's UTF16LE and UTF16BE encodings? | |||
jnthn | By this point we're already past encodings | ||
But yeah, we'll impose native endian | |||
Hm | |||
Maybe our utf16 encoding should spit out a buf8 too, then we don't have this issue. | 11:44 | ||
Or it could always spit out the correct BE/LE BOM at the start for the current platform | |||
timotimo | if the utf16 encoder spits out anything, it'd have to be the same value regardless of platform endianness, because depending on how it gets turned into 8 bit pieces by the write_fhb instruction it'll end up being the correct bom | 12:33 | |
... or something? | |||
ilmari | encoders should output bytes. full stop. | 12:34 | |
the endianness is an intergral part of the encoding | 12:35 | ||
lower layers should not have to know about this. I/O is streams of bytes | |||
timotimo | hum. the utf16 encoder in moar already just gives you a char *, i wonder where it gets turned into 16 bit pieces | 12:37 | |
oh, that just happens if you pass a 16-bit-per-entry VMArray to the decode call | 12:38 | ||
so we'd have to either turn the utf16 type into a buffer of 8bit ints or do something different there | 12:42 | ||
same with utf32, of course | |||
brrt | hmm, i'll try it out at least | 13:00 | |
fwiw, i can try to 'beat' some information out of a single run as well, but it's just not as happy as a bisect | 13:01 | ||
jnthn | timotimo: We should do what ilmari is suggesting, and always have a buf8, I think | 13:28 | |
13:39
markmont joined
|
|||
nwc10 | imlari is suggesting a buffetā½ Om nom nom | 13:42 | |
oops, that won't highlight | 13:46 | ||
ilmari: ^^ | |||
14:08
zakharyas joined
14:24
zakharyas joined
15:13
AlexDaniel joined
15:26
zakharyas joined
15:56
zakharyas joined
16:08
zakharyas joined
16:14
releasable6 joined
16:27
brrt joined
|
|||
brrt | yay, i golfed it | 16:42 | |
jnthn++ | |||
your advice worked | |||
jnthn | yay :) | 16:43 | |
japhb | jnthn: I've been reading the current Cro docs and going through the examples. I'm *really* impressed. My stint in the world of web dev seems absolutely ancient in comparison. | 16:44 | |
brrt | gist.github.com/bdw/13cb662504b3f4...acc63c56c6 | 16:45 | |
jnthn | I bet you can pull the first two lines out of the loop and still get it? | 16:46 | |
(might make the generated code you need to debug smaller) | |||
japhb | jnthn: Is there a FreeNode channel for Cro yet? | 16:47 | |
brrt | hmm, i can try | ||
jnthn | japhb: Nice to hear. :) | ||
japhb: Not yet, though maybe it's time... :) | |||
brrt | yep, you are correct | ||
japhb | (I don't see it in the results from alis, but alis seems to miss some already.) | ||
jnthn: Please! :-) | |||
brrt | aye! | ||
jnthn wonders if #cro is taken or not | 16:48 | ||
brrt | heh, thats a delightfully fast bisect now | 16:50 | |
japhb | jnthn: Looks like it's free, I just joined and am the only person | ||
brrt | and there is a guard control inserted into the treeā¦ let's see if it is compiled differently in any way | 16:55 | |
16:57
zakharyas1 joined
17:04
zakharyas joined
18:04
domidumont joined
18:12
zakharyas joined
19:09
evalable6 joined
19:26
robertle joined
|
|||
nine | I'm now reasonably sure that the remaining issue is about multi-level un-inlines but only in deopt-one cases, not for deopt-all | 19:52 | |
timotimo | .o( you are crorect ) | 20:06 | |
jnthn | Oh goodness, deopt /o\ | ||
nine | It's not certain though, but the statistics point at this. I've seen lots of multi-level un-inlines that are harmless, but those were all deopt-all. The deopt-one cases appear in failing test files. | 20:08 | |
timotimo | fascinating | ||
nine | It also fits the incredible rarity of the failures. rakudo builds fine, make test passes (with blocking and nodelay) and most spec test files pass. | ||
Intriguingly, I could golf one of the failures down to: MVM_SPESH_BLOCKING=1 MVM_SPESH_NODELAY=1 perl6 -e '1; { my $a; }; { my Int $a; }' | 20:09 | ||
timotimo | if only we had a simple way/tool to run a frame once from the beginning until it deopts and the next time the unoptimized version until it hits the point it deopted into | ||
nine | Resuling in "No int multidim positional reference type registered for current HLL" | ||
timotimo | so we could compare register content and all that | ||
could that be from version skew in rakudo's .c parts and moarvm's parts? | 20:10 | ||
nine | ll-exception backtrace shows the failure coming from a frame that's involved with the Multi-level un-inline | 20:12 | |
And the error goes away as soon as I leave the pointless goto entering the nested inline in | 20:13 | ||
I.e. this case: github.com/MoarVM/MoarVM/blob/inli...ze.c#L2369 | 20:14 | ||
Another interesting point: if I don't delete the goto op but just turn it into a no_op, the error disappears. | 20:18 | ||
timotimo | so another case where we rely on a goto existing to know about the structure of things? | 20:21 | |
nine | In this case it looks like it doesn't have to be a goto which is consistent with me being unable to find a reliance on a goto in deopt.c. | 20:23 | |
Looks more like it stumbles over the removal of an instruction, making me think more about some offset becoming incorrect. | 20:24 | ||
Sooooo....when deopting an inline, wouldn't it look for the instruction calling the inlined frame? And in a nested inline, wouldn't that instruction be that goto op that eliminate_pointless_gotos tries to remove? | 20:27 | ||
From what I see, uninline does not look for some annotation. It relies on the inlines table to get its information. But that table is not updated by eliminate_pointless_gotos | 20:29 | ||
timotimo | hm, but we only ever compute offsets at code-gen time, or at least we should | 20:33 | |
nine | Offset or this mysterious deopt_idx that I haven't really found out yet what it means | 20:34 | |
20:42
lizmat joined
|
|||
jnthn | deopt_idx is just an index into the deopt table | 20:55 | |
Which contains mappings to locations in the original, interpreted, bytecdoe | 20:56 | ||
nine | And those mappings are created during codegen? | 20:57 | |
jnthn | The original locations are written in graph.c, iirc | ||
github.com/MoarVM/MoarVM/blob/mast...raph.c#L37 | 20:58 | ||
nine | That's this I guess: g->deopt_addrs[2 * g->num_deopt_addrs] = deopt_target; | ||
jnthn | And yes, code-gen fills the rest in: github.com/MoarVM/MoarVM/blob/mast...raph.c#L37 | 20:59 | |
nine | And deopt_target is the unoptimized code I guess. | ||
jnthn | Right, it's a table of pairs | ||
Yes, github.com/MoarVM/MoarVM/blob/mast...aph.c#L360 for example | |||
Just passes pc - g->bytecode | |||
Which is a relative offset from the start of the unoptimized bytecode | 21:00 | ||
nine | So that value is certainly still correct regardless of what we do to the optimized bytecode. | 21:01 | |
And the deopt_offset is only generated at code gen, i.e. after our optimizations. So they ought to be correct, too. | |||
jnthn tries to remember how this thing works | 21:05 | ||
Ah, right, github.com/MoarVM/MoarVM/blob/b9a0...ine.c#L163 is used to identify the location that we return to when doing a multi-level inline | 21:06 | ||
nine | Oooooooh | 21:08 | |
/* -1 all the deopt targets, so we'll easily catch those that don't get | |||
* mapped if we try to use them. Same for inlines. */ | |||
But unline inlines, there is no code for actually checking those deopt targets. | |||
When I add that I get MoarVM oops: Spesh: failed to fix up deopt_addr 1 | |||
But I get that even if the program would actually work...hm... | 21:09 | ||
jnthn | Hm, and also it stores and uses the deopt *index*, so the location in the optimized bytecdoe isn't important for this. | 21:10 | |
Is it sensitive to JIT, btw? | |||
nine | no | 21:11 | |
jnthn | Hmm. | 21:13 | |
jnthn doesn't have any more guesses, alas | |||
But hopefully those pointers helped a little | 21:14 | ||
nine | jnthn: does this look odd to you? gist.github.com/niner/5626227d1397...c4bcb0b5e3 | 21:23 | |
jnthn | Hmm, where's that "uninline expecting a goto" coming from? | 21:24 | |
nine | just an additional fprintf I added | ||
The working version does lots of deopts in frame 'MATCH' (cuid '139'), but none in postcircumfix:<{ }> | 21:25 | ||
jnthn | 6632 -> 144 is kinda interesting too | 21:26 | |
Oh, though I guess if we're in an inline the top index is the top frame which would be an inlinee | |||
nine | It's 6636 -> 144 in the working version | 21:27 | |
jnthn | If you're seeing totally different deopts and you're using MVM_SPESH_BLOCKING, though... | ||
Then something's odd | |||
nine | Some output from the working version: gist.github.com/niner/35accfac21f2...f702a35810 | 21:28 | |
The difference between the version should really just be the removal of the no_op | 21:29 | ||
jnthn | Right | ||
It's odd it'd cause different deops | |||
*deopts | |||
nine | I guess this riddle needs at least one more night of sleep. Thanks for the help so far :) | 21:42 | |
22:08
markmont joined
22:42
zakharyas joined
22:53
MasterDuke joined
|
|||
MasterDuke | samcv: what benchmark were you testing with. the one i've tried `my $a = "a" x 1_000_000; for ^1000 {$a ~~ /./;}; say now - INIT now`, is faster before your recent change, whether it's "a", "ab", or "abcd" | 23:54 |