github.com/moarvm/moarvm | IRC logs at colabti.org/irclogger/irclogger_logs/moarvm
Set by AlexDaniel on 12 June 2018.
nine dogbert11: oh, that's interesting 06:45
nwc10 good *, * 06:47
nine That a segfault is connected to GC may (yet again) explain the seeming randomness of segfaults we see on CI
Nicholas good *, brrt 07:40
brrt good * Nicholas 07:54
dogbert11 nine: now I 'only' have to catch it in the debugger :) 09:07
sena_kun hi, folks 09:18
how is the state of the revert revert revert commit? I remember it exposed some issues we had to address before the release, are they patched already or we should do a re-re-re-revert as the release is tomorrow?
also, if there are any new blockers, please share. 09:19
nine sena_kun: AFAIK the commit is still in there. Have any issues come up? 10:34
sena_kun nine, not yet, though I have no means to do a Blin run now as usual, so I was wondering if something have show up on your (plural) side. 10:36
nine Blin would be mighty helpful... 10:40
sena_kun :/
dogbert11 now I'm running with optimizations on, an 8k nursery and the gc debug flag set to two. It has now stopped, in gdb, with 'non-AsyncTask fetched from eventloop active work list' 11:00
gist.github.com/dogbert17/e4a3993b...853d014649 11:02
nine: is it possible to make something out of this or do we need to catch things earlier? 11:03
nine dogbert11: the immediate question is: what _did_ it catch? 11:09
So, good *s work here the same as on freenode? Checked
nine Apparently a VMNull because the array slot work_idx is NULL 11:39
dogbert11 nine: (gdb) p REPR(task_obj)->ID 11:48
value has been optimized out
:(
nine Yeah, you have to get it from the source: call MVM_repr_at_pos_o(tc, tc->instance->event_loop_active, work_idx)
Or: p ((MVMArray*)(tc->instance->event_loop_active))->body.slots.o[1] 11:49
dogbert11 (gdb) p ((MVMArray*)(tc->instance->event_loop_active))->body.slots.o[1]
$1 = (MVMObject *) 0x0
tbrowder hi, working issue #1469 has lead to needing a CFLAGS change for libuv that may conflict with other libs. a casual look at the build situation, and confirmed by MasterDuke17, shows all objects being built with same CFLAGS. seems to me we should compile 3rdparty lin 11:53
libs with the same CFLAGS they use.
would require an overhaul of build but it would be more robust for future 3rdparty libs 11:54
dogbert11 nine: in case you want to try teasing the error out, here's the 'golf': gist.github.com/dogbert17/8eded7bd...02c1781405
I have also updated the Panic gist a bit, i.e. with some 'l' commands, your 'p' command and 'info threads' 11:57
nine oh a golf. That's useful!
dogbert11 more like a bogey :) 11:57
I'm running with 8k nursery and GC_DEBUG=1
nine of course it refuses to break in rr 11:59
nine OTOH use Test can be removed from the golf 12:03
nine The segfault happens because when run-one is called args[1] is NULL 12:39
The most curious thing about this is: since args[1] is a register it must not ever be NULL 12:43
dogbert11 so how can that happen? 12:45
it sounds like you've managed to repro :)
nine at SETTING::src/core.c/ThreadPoolScheduler.pm6:297 (/home/nine/rakudo/blib/CORE.c.setting.moarvm:) 12:58
That's where the call happens 12:59
And the NULL we get from nqp::shift($queue)
Added an assert in ConcBlockingQueue's shift and it triggers
dogbert11 cool 13:00
lizmat so it's shifting from the queue when it shouldn't? or another thread beat it to it ? 13:21
nine No, the whole point of ConcBlockingQueue is that it's safe to use from different threads. It's just that somehow a NULL ends up in that queue. But in both unshift and push we explicitly guard against that 13:22
lizmat so the number of elems is > 0 when the shift produces a NULL, so it really sits in the queue, is what you're saying ? 13:39
nine yes 13:48
lizmat is it clear if the value got produced by a push or an unshift ? 13:49
also: you said: "it's safe to use from different threads" 13:50
are we 200% sure of that ?
because *if* the guard in unshift / push is correct, the only other way *I* see is that another thread snatched it and thus you're looking at element #1 really, and if there is none left, that'd be a NULL ? 13:51
nine Well it's meant to be thread safe. Of course the implementation may have bugs 13:52
lizmat well, if it walks like a duck and talks like a duck (aka , push and unshift have guarded against NULL entry) 13:53
jnthn The bugs there in the past have always been about GC handling around the lock acquisitions
lizmat it can only be a duck (aka, a race on the queue.shift) 13:54
jnthn At least, those I can remember have :)
nine Well this bug seems to require a small nursery to reproduce, so maybe there's yet another GC handling issue there 13:56
Well the node got into the queue via push and it definitely had a value back then 14:02
dogbert11 (gdb) bt
#0 MVM_panic (exitCode=0, messageFormat=0x0) at src/core/exceptions.c:853
#1 0x00007ffff78d85d2 in gc_mark (tc=0x7fffe00d42e0, st=0x5555555b5178, data=0x5555576392e8, worklist=0x7fffdc1cbec0) at src/6model/reprs/MVMCode.c:48 14:03
#2 0x00007ffff7896c99 in MVM_gc_mark_collectable (tc=0x7fffe00d42e0, worklist=0x7fffdc1cbec0, new_addr=0x5555576392d0) at src/gc/collect.c:439
#3 0x00007ffff7890a40 in MVM_gc_root_add_gen2s_to_worklist (tc=0x7fffe00d42e0, worklist=0x7fffdc1cbec0) at src/gc/roots.c:349
#4 0x00007ffff7893870 in MVM_gc_collect (tc=0x7fffe00d42e0, what_to_do=1 '\001', gen=0 '\000') at src/gc/collect.c:155
#5 0x00007ffff788766f in run_gc (tc=0x7fffe00d42e0, what_to_do=1 '\001') at src/gc/orchestrate.c:443
#6 0x00007ffff78882e4 in MVM_gc_enter_from_interrupt (tc=0x7fffe00d42e0) at src/gc/orchestrate.c:728
Adding pointer %p to past fromspace to GC worklist 14:05
nine: should I do a MVM_dump_backtrace(tc) or something else 14:07
nine Can you have a look at what that collectable actually is? 14:08
dogbert11 48 MVM_gc_worklist_add(tc, worklist, &body->outer); is it body->outer we want?
nine Or even body itself since that's the one containing the outdated pointer. What code object is it? 14:11
dogbert11 (gdb) p *body 14:12
$3 = {sf = 0x55555741f070, outer = 0x7fffdc22cbb8, code_object = 0x0, name = 0x555556d1c110, state_vars = 0x0, is_static = 1, is_compiler_stub = 0}
nine name and sf->body.name are of interest 14:13
dogbert11 so how do I get an MVMString to something readable? 14:15
nine MVM_dump_string(tc, string)
dogbert11 thx
nine Or if it's not a debug build MVM_string_utf8_maybe_encode_C_string(tc, string) 14:16
dogbert11 I'll try that as well 14:17
(gdb) p MVM_string_utf8_maybe_encode_C_string(tc, body.name)
$8 = 0x7fffdc5b49b0 ""
(gdb) p MVM_string_utf8_maybe_encode_C_string(tc, body->name)
$9 = 0x7fffdc151dd0 ""
(gdb)
I'm probably doing something wrong but it seems to be the empty string 14:23
nine How on earth? It looks like we're pushing the same MVMConcBlockingQueueNode onto two different queues! A poll on the one queue sets the node's value to NULL (when it becomes the new dummy head node) and a shift on the other queue then finds the broken node 14:27
dogbert11 oops 14:29
nine It gets weirder: even after replacing the FSA with plain calloc, not freeing the nodes at all anymore and commenting out the NULL assignment, I still get NULLs in node values 14:49
dogbert11 the plot thickens, will this be a one line fix 14:56
nine I fear it will be a fix at all only when I manage to reproduce in rr. Because I'm running out of ideas. There's just no code left that would overwrite a queue node's value with NULL 14:59
dogbert11 and rr is not cooperating 15:09
tbrowder seems embarassing to use python in our tool chain 15:39
nine feel free to change that :)
nine This just doesn't make sense. It's always the ConcBlockingQueueNode's value that suddenly turns into NULL, while it's next pointer stays intact. So it's a very precise change. 16:18
It's probably not a random memory overwrite as nothing else seems to get hit and when I replace usage of the FSA with malloc that would surely change the behavior as we're talking about different memory areas. But it stays the same 16:19
But ConcBlockingQueueNodes are only used and modified in src/6model/reprs/ConcBlockingQueue.c and I already removed all setting to NULL 16:20
So what's left?
Geth MoarVM: tbrowder++ created pull request #1497:
Define _GNU_SOURCE for GNU builds
16:52
Geth MoarVM: tbrowder++ created pull request #1498:
Quell compiler warnings on Linux with gcc
19:59
tbrowder nine: see last PR, two uninitiated values giving warnings about vfork and jumps 20:01
MasterDuke just got a segfault in t/spec/S17-lowlevel/cas.t with only change being an 8k nursery 20:19
haven't been able to catch it in rr though 20:25
ran it under rr ~250 times, but never an error of any kind 20:33
dogbert11 MasterDuke: I got it as well 20:38
0x00007ffff79b9e3b in evaluate_guards (gs=0x555558c0cac8, gs=0x555558c0cac8, callsite=0x555558c0cac8, guard_offset=0x7fffeea5ab66, tc=0x7fffe00d6ea0) at src/spesh/plugin.c:85
85 outcome = STABLE(test) == gs->guards[pos].u.type;
MasterDuke interesting 20:39