00:57 librasteve_ left 03:00 sjn_ joined 03:02 sjn left 03:54 sjn_ is now known as sjn 05:03 melezhik joined 07:05 melezhik left 08:22 melezhik joined 09:36 librasteve_ joined
melezhik ugexe: I set RAKULIB to FEZ_INSTALL_TO, interestingly I see a lot of “can open file ..” errors during zef test stage , like this ( remember we have concurrency zef process from podman containers trying to install things into shared storage, and my theory is they might remove each other cache leading to such an errors ) brw.sparrowhub.io/file_view/agent.r...tUnion.log 10:13
What you think ?
www.irccloud.com/pastebin/mtdeHN4j 10:18
timo there's probably some distros that wrongly use "use lib" in their tests, or put extra libs for testing purposes in a spot that isn't right 11:58
i don't think `use lib ".";` or `use lib "test";` or something like that should ever be in tests?
12:15 lucs left 12:16 librasteve_ left
melezhik Yeah but this error only happens when we install on shared storage with multiple zefs 12:17
12:19 lucs joined
timo oh, i see 12:23
lizmat So, do we think the time is right to create our own CVE Numbering Authority for the Raku Programming Language ? 13:07
github.com/ossf/wg-vulnerability-d...project.md
timo does perl have its own CNA? do you know how they did it / are doing it? 13:18
13:18 melezhik3 joined
timo they do say that at least submitting the form for additional information is free and recommended if we are even somewhat interested 13:19
melezhik3 [Coke]: ab5tract: ok - let me reconsider my previous statement on "no such file or directory" error, as it seems to appear even on none shared storage installation - brw.sparrowhub.io/file_view/agent.r...::json.log
timo i'm not sure why for example red hat wouldn't be an okay address to ask for doing stuff for us
melezhik3 sorry meant for timo: 13:20
lizmat timo: there's a cpan-sec group that runs a CNA for Perl 13:21
melezhik3 keep it in gist gist.github.com/melezhik/7c85bafcf...2b862b5a3b as brownie reports are vanished once the server restarted
timo are stdout and stderr mis-ordered in that log? 13:22
melezhik3 more errors like that - gist.github.com/melezhik/7c85bafcf...nt-5835428 13:23
timo that is a very weird error. can we see the contents of the exact archive used for this test? does it not have that file in it? if not, why does it even try to open that file in the first place?
is it a folder that another instance of zef is deleting as we are trying to run tests inside of it? 13:24
melezhik3 in this log only stdout is shown
ab5tract: in THIS brownie round we don't use shared storage right? 13:25
timo can we reproduce the issues with ZEF_TEST_DEGREE= set to just 2, and also with just 1?
ab5tract melezhik3: that's correct
melezhik3 but you still mount files via podman volumes ?
ab5tract no
no volume instructions at all
these containers could probably go for a bit more memory 13:26
melezhik3 but there is setting ZEF_INSTALL_TO=/root/ 13:27
which is probably fine
one more error like that - gist.github.com/melezhik/7c85bafcf...2b862b5a3b 13:29
ab5tract it should be. looks like the job just finished?
ok, we should do another run without any ZEF_INSTALL_TO set 13:30
melezhik3 yeah, this is really reird
weird
ab5tract so that we are using the staging area for the installs. ugexe said that the staging is not done when ZEF_INSTALL_TO is set 13:31
melezhik3 :shrug: 🤷
ab5tract it might be that the existence of staging is specifically to avoid these kinds of erroors :)
melezhik3 how does that relate to test failures
?
ok, do you want just stop current round, and remove ZEF_INSTALL_TO at all > 13:32
?
ab5tract these aren't test failures, though. they are test files failing to exist
melezhik3 yep
ab5tract is it possible to conditionally set ZEF_INSTALL_TO? 13:33
melezhik3 yes, let me to the patch
so that is it set to none empty string it is applied 13:34
if it is empty it does not apply
will this work ?
ab5tract yup 👍
melezhik3 ok, will take few minutes to make a patch 13:35
ab5tract btw here's some thoughts I had on how SSE could apply : gist.github.com/ab5tract/c52559003...eb2ed44330
timo ah, so you did mean server-sent events 13:37
ab5tract yeah :)
melezhik3: also I think if we were to use zef's libraries directly from within the agen, we would have structured results from the installation attempt 13:39
timo yes, better than parsing stdout/stderr messages
melezhik3 ab5tract: sure, we could refactor things later, right now I more focus on performance ) 13:41
ab5tract sure 13:46
ugexe CompUnit::Repository::Staging, the core raku class for staging, simply doesn't work right with non-standard repos 14:03
14:03 librasteve_ joined
ugexe it might be more appropriate to say it only works with repositories that have names e.g. site, vendor, core, which non-standard repos do not 14:04
my recollection is vague, but essentially the repository name can get baked into the precomp files so even though the staging repo, which is initially in somewhere like /tmp when precompilation occurs, can be cp'd into the final location (usually the site directory under rakudo somewhere) without having issues with the path changing 14:11
without the names a staging repo that is precompiling into like /tmp/foo will have paths including /tmp/foo in them, but /tmp/foo gets moved into something like /home/me/rakudo/whatever/site after testing passes so that /tmp/foo that got precompiled into files would no longer exist 14:13
melezhik3 so, that means technically setting ZEF_INSTALL_TO does not work with some modules? 14:15
I guess like we see it?
ugexe no 14:16
timo i don't think that's what ugexe means
ugexe the reason zef_install_to pointing at a custom location doesn't use the staging repository is precisely so it *does* work
melezhik3 ok, may ask from initial context, why the errors occurred ? 14:17
ugexe presumably because there is "no such file or directory" 14:18
melezhik3 ?
ugexe is that not the error you saw?
melezhik3 not sure if I follow, why we don't see such an error when ZEF_INSTALL_TO is not set?
it's not just error, but the error which like I believe only occurs IF ZEF_INSTALL_TO set 14:19
timo the test files shouldn't be installed to any repository anyway. maybe it's just dumb luck that it works with one but not the other
ugexe right
if i had to guess why it doesn't work it would be because of the module authors doing bad things 14:20
for example changing the current working directory
but that is outside of raku or zefs control, and is really something anyone could figure out if they poked around
melezhik3 ok, in the next round of brownie we age going to run with ZEF_INSTALL_TO and see if errors remain for problematic modules 14:21
timo i'd maybe look at an strace -f -e trace=%file for the whole process tree 14:23
so that an unfortunate unlink could be spotted 14:24
ugexe i would be surprised if something is deleting a test file
timo me too. it'd have to be something like one process is doing `rm -r` on the extracted source archive while the other is trying to run tests inside of it 14:25
but zef uses temporary folders for each extraction if i'm not mistaken? 14:26
ugexe ah right, i would be surprised if something was deleting files specifically but it would be less surprising for something to delete the archive folder
timo yeah, it could just be the first file it's trying to open after everything has been deleted
ab5tract i think that's what's happening
timo doesn't have to mean it's just a single file that's missing
ab5tract it is that every time
(that I saw) 14:27
timo is there a way to reproduce it without the entire orchestration and such? literally just the "zef install" line perhaps?
ugexe there is github.com/ugexe/zef/blob/2c6ab9ff...kumod#L640 but $tmp points at a path like /tmp/asdfasf/MyDistribution-1.00 so for it to get deleted there have to be multiple installs trying to use that directory (which I'm not sure can happen) 14:29
14:29 melezhik3 left
timo so it'd have to be a `rm -rf /tmp/*` kind of thing that could do this ... also quite strange 14:29
ab5tract maybe not 14:30
these are containers and they have a memory limit set
so couldn't it be a flush of tmpfs due to memory constraint?
timo you mean extraction might be failing to write these files in the first place because it's a tmpfs?
ab5tract or that too
timo i don't think it'd flush, it's just error ENOSPC or something
when trying to write to the file 14:31
14:31 melezhik_ joined
ab5tract I don't know how that works in a container, to be honest. I don't know what kind of guarantees linux makes for "persistence" in tmpfs 14:33
But it seems like you want to throw out tmpfs if you're about to oom
*might want
melezhik: the rakudo we are testing with is pretty old btw 14:34
14:36 melezhik_ left
ugexe it would probably be useful to see what the file layout of e.g. /tmp is when that happens 14:36
to figure out if the file does actually exist and its the cwd changing
additionally you could try commenting out that line I linked to and see if that fixes it. if so we could just add a flag to zef to disable it 14:37
ab5tract wouldn't the dist always fail to install if it is a cwd or something?
14:39 melezhik_ joined
ugexe if there is no parallelization, yes. with parallelization i'm not sure 14:39
further using ZEF_TEST_DEGREE isn't a great idea (which is why it defaults to 1) 14:42
ab5tract oh :)
ugexe especially when the staging repo isn't used, i.e. using ZEF_INSTALL_TO
ab5tract right, we had a bad combo there.
ugexe the reason being when staging repo is not used then precompilation also occurs when the tests are run
and if tests are run in parallel you end up with rakudo trying to precompile a bunch of different distributions at the same time and things are not deterministic 14:43
melezhik_ ok, what we gonna do next round 1) disable FEZ_INSTALL_TO 2) no shared storage - which means zef concurrent installs is excluded 3) ZEF_*+DEGREE -? I am not sure what value is better to use
ab5tract I had no idea it was discouraged, just saw them in --help
melezhik_ yep - me also )
anyway look like ZEF_*+DEGREE=1 are good default ? 14:44
ugexe well its set to 1 beccause of github.com/rakudo/rakudo/issues/1920
zef already comes with sane defaults i believe
melezhik_ yeah, I mean in case we would want to override those on our side ... 14:45
ugexe fetching is fine to set to whatever. it defaults to 5 i think 14:46
melezhik_ ok 14:47
ab5tract: let then ensure that ZEF_TEST_DEGREE=1 and ZEF_FETCH_DEGREE=5 in our agents setup 14:48
ugexe another hail mary might be to try setting ZEF_CONFIG_TEMPDIR to a different directory for each agent. while that might not be the actual solution we'd want, it would help determine a better solution assuming it works 14:51
ab5tract ok, so that's an interesting idea. I could bind a single tmpfs into the containers, and then set each one to a subdirectory 14:54
My thought is that this will reduce the in-container memory pressure, even if the overall memory footprint from the host POV is the same 14:55
melezhik_ ab5tract: I'd try that next round , this round I would like to test without ZEF_INSTALL_TO 14:56
ab5tract ok 14:57
15:23 melezhik_ left, melezhik_ joined 15:37 melezhik_ left, melezhik_ joined 15:42 melezhik_ left, melezhik_ joined 15:48 melezhik_ left 15:50 melezhik_ joined 15:52 melezhik_ left
timo why was i never pointed at R#1920 it's relatively clear how it's going wrong; two threads try to run MVM_serialization_deserialize against the same SC which conflicts quite explosively 16:03
linkable6 R#1920 [open]: github.com/rakudo/rakudo/issues/1920 [SEGV] Module loading is not thread safe
ab5tract points timo towards R#1920 16:04
lizmat hopes timo can create a fix 16:05
16:06 ds7832 joined
timo the question is, what do we want to happen exactly when an SC is attempted to be deserialized a second time. the caller of nqp::deserialize passes in arrays that they expect to be filled. do we just fill them with the data from the previous run? 16:08
ab5tract maybe try that and see what negative externalities it introduces (if any)? 16:09
timo ah, the caller of the nqp::deserialize in question is actually the dependencies+deserialize frame in the precompiled file 16:17
Geth rakudo/main: b92e33cc8e | (Elizabeth Mattijsen)++ | src/core.c/Junction.rakumod
Give any() a @foo candidate

Commit 51c3d86ccf removed this candidate (which apparently made things go a lot faster for its siblings all(), one() and none(), because it broke tests. Today, it does not break any tests, so there's no reason not to have a any(@foo) candidate.
Spotted by arkiuat++
16:24
timo if we just abort the dependencies+deserialize when an "this SC is already loaded" situation is detected, we may very well return from the loadbytecode op before the other thread that's loading the CU at the same time has finished running the same dependencies+deserialize, and that could cause bad trouble as well 16:44
the SC isn't really the right hook to place a collision detection mechanism, the CU would be a better place 16:45
lizmat could we make the SC to be thread-local during the time it is being deserialized ? 16:46
timo i don't think that is sensible 16:51
especially since we lazily deserialize parts of the SC whenever they are needed, which can be minutes, hours, days after the deserialize process is started 16:52
the part that nqp::deserialize does is relatively brief, but the parts that dependencies+deserialize does after that call are important for things to work correctly 16:53
16:56 librasteve_ left
timo oh, the thing i was just implementing makes even less sense now that i think about it, since different CUs can very well have shared SC dependencies, so if one of them is already there, that doesn't mean they all have been handled by some other thread in the mean time 16:57
extra fun: if you have RAKUDO_MODULE_DEBUG=1 and do the multi-thread require mode, you'll get warnings about the `$level` state variable in Process.rakumod being uninitialized 17:02
lizmat heh 17:03
17:08 librasteve_ joined
timo i see now in the middle of the issue thread there's a few patches with locking suggestions, that's a good start, i'll have to think a bit if that's enough to get things right 17:12
ah, i was wrong again 17:17
lizmat
.oO( progressive insight )
ugexe i don't think we want to add those locks though 17:29
i want to say nine mentioned as much but i could be misremembering 17:30
(if you're talking about all the locks i demonstrated in the issue on github) 17:31
timo yeah i tentatively agree 17:37
17:38 melezhik_ joined
timo the immediate thing that explodes the code is that an SC ID is kind of process-global 17:40
(also, dependencies+deserialize grabs hllsym "GLOBAL" and works with it, which ... are we careful about that?) 17:42
17:42 melezhik_ left 17:55 melezhik_ joined 18:07 melezhik_ left, melezhik_ joined
melezhik_ . 18:19
lizmat m: dd &(42,666) # feels this should warn or do something else then just silently assume $ 18:20
camelia $(42, 666)
timo looks like it code-gens to .item, also for other stuff you can put in the parens 18:26
lizmat yeah, I guess it checks for '@', '%' and then elses to the .item case 18:28
timo i feel like instead of doing mitigation after a CU has been mapped a second time, we should just™ not attempt to load the same CU twice 18:34
normally moarvm handles that when we load CUs from a filename. then it will use the filename as key for the "already loaded" hash 18:35
but we mostly load CUs from Blob objects
probably because of the stuff that sits before the MOARVM header, like repo ids and dependencies and stuff
oh, huh. 18:38
disregard what i just said 18:40
18:43 melezhik_ left 18:44 melezhik_ joined
melezhik_ so brownie round with ZEF_INSTALL_TO unset process zero error of "failed to stat file" for 470 processed distributions 18:46
gist.github.com/melezhik/877dff68a...50d865456c
18:49 melezhik_ left
melezhik Zerro number of errors 18:51
19:11 melezhik_ joined
timo so, preventing the same file from being loaded even when using loadbytecodefh isn't hard, and it is a relatively early spot to do an abort for duplicate files. however, if we don't run the bytecode's `<load>` frame, $*CTXSAVE.ctxsave is never called, so the CompUnit::Handle instance that load-precompilation-file creates and returns won't have anything put into it when a file is loaded for the 19:21
second time
since the CTXSAVE stuff is merely a side effect, there's not really an obvious way for moarvm to remember something for the existing compunit to return or otherwise offer 19:22
so i think CompUnit::Loader will have to do that remembering by itself
19:32 melezhik_ left, melezhik_ joined
timo well, CU::L isn't meant to be instantiated 19:32
and i think if we just do locking so that we don't concurrently load the same thing twice, we will prevent crashes due to memory corruption, but if we just allow the same precomped file to be loaded twice "independently" we will end up with strange effects where things you would expect to be shared actually exist multiple times 19:35
i'm thinking this would happen with at least my-scoped things inside of compunits 19:36
lizmat: does that kind of thing ring a bell btw? a variable in a module not retaining changes when looked at from different files that require-d or use-d them, when concurrency is involved, even when there aren't crashes? 19:37
actually, can this be triggered without concurrency?
19:43 melezhik_ left, melezhik_ joined 19:47 melezhik_ left, melezhik_ joined 19:49 melezhik_ left 19:51 lizmat left
ugexe in rakudo i believe precomp files are loaded with CUR::AbsolutePath, and that should already prevent the same path from being loaded twice here - github.com/rakudo/rakudo/blob/b92e...akumod#L26 19:51
although there are edge cases with using the path like that since it would still load something twice if referenced by two different string paths i.e. foo/bar with e.g. foo/.bar foo/../foo/bar 19:53
timo i do see load-handle-for-path from CompUnit::PrecompilationRepository involved 19:56
indeed, that case we can't so trivially figure out
the SC ID inside of the CU should be a valid unique key, but I'm not sure if getting it out of the file is trivial enough for our purposes. i can generate something in the dependencies+deserialize frame that sends it out somehow, though 19:59
so the code path we have for your example code where we `require ::($module)` in multiple start blocks goes through PrecompilationRepository::load which checks if there's something loaded yet for a given name, and if it's not it goes ahead and loads it, and then stores the result. but it's only keeping the lock on the hash where these things are stored for the duration of the hash lookup and the 20:05
hash store, not the loading, so two threads tasked with loading the same thing at the same time will both see that nothing's loaded yet, go to load that thing, and store their results, with the winner clobbering the result of the loser
20:05 melezhik_ joined 20:08 melezhik_ left
ugexe i don't have time for following the code path to see if it applies here, but do note there is some file level locking when loading precompilation files as well: github.com/rakudo/rakudo/blob/b92e...akumod#L60 20:09
timo or i guess the loser clobbers the result of the winner
i think we might only use that when we precompile something and try to store the precompilation result on disk? 20:16
i mean the lock and unlock methods you linked
ugexe sounds correct 20:17
timo yeah, the issue i'm looking at right now happens when loading existing precomps, i haven't given parallel attempts to precompile the source of something a look yet in respect to this particular bit of trouble 20:18