|
00:57
librasteve_ left
03:00
sjn_ joined
03:02
sjn left
03:54
sjn_ is now known as sjn
05:03
melezhik joined
07:05
melezhik left
08:22
melezhik joined
09:36
librasteve_ joined
|
|||
| melezhik | ugexe: I set RAKULIB to FEZ_INSTALL_TO, interestingly I see a lot of “can open file ..” errors during zef test stage , like this ( remember we have concurrency zef process from podman containers trying to install things into shared storage, and my theory is they might remove each other cache leading to such an errors ) brw.sparrowhub.io/file_view/agent.r...tUnion.log | 10:13 | |
| What you think ? | |||
| www.irccloud.com/pastebin/mtdeHN4j | 10:18 | ||
| timo | there's probably some distros that wrongly use "use lib" in their tests, or put extra libs for testing purposes in a spot that isn't right | 11:58 | |
| i don't think `use lib ".";` or `use lib "test";` or something like that should ever be in tests? | |||
|
12:15
lucs left
12:16
librasteve_ left
|
|||
| melezhik | Yeah but this error only happens when we install on shared storage with multiple zefs | 12:17 | |
|
12:19
lucs joined
|
|||
| timo | oh, i see | 12:23 | |
| lizmat | So, do we think the time is right to create our own CVE Numbering Authority for the Raku Programming Language ? | 13:07 | |
| github.com/ossf/wg-vulnerability-d...project.md | |||
| timo | does perl have its own CNA? do you know how they did it / are doing it? | 13:18 | |
|
13:18
melezhik3 joined
|
|||
| timo | they do say that at least submitting the form for additional information is free and recommended if we are even somewhat interested | 13:19 | |
| melezhik3 | [Coke]: ab5tract: ok - let me reconsider my previous statement on "no such file or directory" error, as it seems to appear even on none shared storage installation - brw.sparrowhub.io/file_view/agent.r...::json.log | ||
| timo | i'm not sure why for example red hat wouldn't be an okay address to ask for doing stuff for us | ||
| melezhik3 | sorry meant for timo: | 13:20 | |
| lizmat | timo: there's a cpan-sec group that runs a CNA for Perl | 13:21 | |
| melezhik3 | keep it in gist gist.github.com/melezhik/7c85bafcf...2b862b5a3b as brownie reports are vanished once the server restarted | ||
| timo | are stdout and stderr mis-ordered in that log? | 13:22 | |
| melezhik3 | more errors like that - gist.github.com/melezhik/7c85bafcf...nt-5835428 | 13:23 | |
| timo | that is a very weird error. can we see the contents of the exact archive used for this test? does it not have that file in it? if not, why does it even try to open that file in the first place? | ||
| is it a folder that another instance of zef is deleting as we are trying to run tests inside of it? | 13:24 | ||
| melezhik3 | in this log only stdout is shown | ||
| ab5tract: in THIS brownie round we don't use shared storage right? | 13:25 | ||
| timo | can we reproduce the issues with ZEF_TEST_DEGREE= set to just 2, and also with just 1? | ||
| ab5tract | melezhik3: that's correct | ||
| melezhik3 | but you still mount files via podman volumes ? | ||
| ab5tract | no | ||
| no volume instructions at all | |||
| these containers could probably go for a bit more memory | 13:26 | ||
| melezhik3 | but there is setting ZEF_INSTALL_TO=/root/ | 13:27 | |
| which is probably fine | |||
| one more error like that - gist.github.com/melezhik/7c85bafcf...2b862b5a3b | 13:29 | ||
| ab5tract | it should be. looks like the job just finished? | ||
| ok, we should do another run without any ZEF_INSTALL_TO set | 13:30 | ||
| melezhik3 | yeah, this is really reird | ||
| weird | |||
| ab5tract | so that we are using the staging area for the installs. ugexe said that the staging is not done when ZEF_INSTALL_TO is set | 13:31 | |
| melezhik3 | :shrug: 🤷 | ||
| ab5tract | it might be that the existence of staging is specifically to avoid these kinds of erroors :) | ||
| melezhik3 | how does that relate to test failures | ||
| ? | |||
| ok, do you want just stop current round, and remove ZEF_INSTALL_TO at all > | 13:32 | ||
| ? | |||
| ab5tract | these aren't test failures, though. they are test files failing to exist | ||
| melezhik3 | yep | ||
| ab5tract | is it possible to conditionally set ZEF_INSTALL_TO? | 13:33 | |
| melezhik3 | yes, let me to the patch | ||
| so that is it set to none empty string it is applied | 13:34 | ||
| if it is empty it does not apply | |||
| will this work ? | |||
| ab5tract | yup 👍 | ||
| melezhik3 | ok, will take few minutes to make a patch | 13:35 | |
| ab5tract | btw here's some thoughts I had on how SSE could apply : gist.github.com/ab5tract/c52559003...eb2ed44330 | ||
| timo | ah, so you did mean server-sent events | 13:37 | |
| ab5tract | yeah :) | ||
| melezhik3: also I think if we were to use zef's libraries directly from within the agen, we would have structured results from the installation attempt | 13:39 | ||
| timo | yes, better than parsing stdout/stderr messages | ||
| melezhik3 | ab5tract: sure, we could refactor things later, right now I more focus on performance ) | 13:41 | |
| ab5tract | sure | 13:46 | |
| ugexe | CompUnit::Repository::Staging, the core raku class for staging, simply doesn't work right with non-standard repos | 14:03 | |
|
14:03
librasteve_ joined
|
|||
| ugexe | it might be more appropriate to say it only works with repositories that have names e.g. site, vendor, core, which non-standard repos do not | 14:04 | |
| my recollection is vague, but essentially the repository name can get baked into the precomp files so even though the staging repo, which is initially in somewhere like /tmp when precompilation occurs, can be cp'd into the final location (usually the site directory under rakudo somewhere) without having issues with the path changing | 14:11 | ||
| without the names a staging repo that is precompiling into like /tmp/foo will have paths including /tmp/foo in them, but /tmp/foo gets moved into something like /home/me/rakudo/whatever/site after testing passes so that /tmp/foo that got precompiled into files would no longer exist | 14:13 | ||
| melezhik3 | so, that means technically setting ZEF_INSTALL_TO does not work with some modules? | 14:15 | |
| I guess like we see it? | |||
| ugexe | no | 14:16 | |
| timo | i don't think that's what ugexe means | ||
| ugexe | the reason zef_install_to pointing at a custom location doesn't use the staging repository is precisely so it *does* work | ||
| melezhik3 | ok, may ask from initial context, why the errors occurred ? | 14:17 | |
| ugexe | presumably because there is "no such file or directory" | 14:18 | |
| melezhik3 | ? | ||
| ugexe | is that not the error you saw? | ||
| melezhik3 | not sure if I follow, why we don't see such an error when ZEF_INSTALL_TO is not set? | ||
| it's not just error, but the error which like I believe only occurs IF ZEF_INSTALL_TO set | 14:19 | ||
| timo | the test files shouldn't be installed to any repository anyway. maybe it's just dumb luck that it works with one but not the other | ||
| ugexe | right | ||
| if i had to guess why it doesn't work it would be because of the module authors doing bad things | 14:20 | ||
| for example changing the current working directory | |||
| but that is outside of raku or zefs control, and is really something anyone could figure out if they poked around | |||
| melezhik3 | ok, in the next round of brownie we age going to run with ZEF_INSTALL_TO and see if errors remain for problematic modules | 14:21 | |
| timo | i'd maybe look at an strace -f -e trace=%file for the whole process tree | 14:23 | |
| so that an unfortunate unlink could be spotted | 14:24 | ||
| ugexe | i would be surprised if something is deleting a test file | ||
| timo | me too. it'd have to be something like one process is doing `rm -r` on the extracted source archive while the other is trying to run tests inside of it | 14:25 | |
| but zef uses temporary folders for each extraction if i'm not mistaken? | 14:26 | ||
| ugexe | ah right, i would be surprised if something was deleting files specifically but it would be less surprising for something to delete the archive folder | ||
| timo | yeah, it could just be the first file it's trying to open after everything has been deleted | ||
| ab5tract | i think that's what's happening | ||
| timo | doesn't have to mean it's just a single file that's missing | ||
| ab5tract | it is that every time | ||
| (that I saw) | 14:27 | ||
| timo | is there a way to reproduce it without the entire orchestration and such? literally just the "zef install" line perhaps? | ||
| ugexe | there is github.com/ugexe/zef/blob/2c6ab9ff...kumod#L640 but $tmp points at a path like /tmp/asdfasf/MyDistribution-1.00 so for it to get deleted there have to be multiple installs trying to use that directory (which I'm not sure can happen) | 14:29 | |
|
14:29
melezhik3 left
|
|||
| timo | so it'd have to be a `rm -rf /tmp/*` kind of thing that could do this ... also quite strange | 14:29 | |
| ab5tract | maybe not | 14:30 | |
| these are containers and they have a memory limit set | |||
| so couldn't it be a flush of tmpfs due to memory constraint? | |||
| timo | you mean extraction might be failing to write these files in the first place because it's a tmpfs? | ||
| ab5tract | or that too | ||
| timo | i don't think it'd flush, it's just error ENOSPC or something | ||
| when trying to write to the file | 14:31 | ||
|
14:31
melezhik_ joined
|
|||
| ab5tract | I don't know how that works in a container, to be honest. I don't know what kind of guarantees linux makes for "persistence" in tmpfs | 14:33 | |
| But it seems like you want to throw out tmpfs if you're about to oom | |||
| *might want | |||
| melezhik: the rakudo we are testing with is pretty old btw | 14:34 | ||
|
14:36
melezhik_ left
|
|||
| ugexe | it would probably be useful to see what the file layout of e.g. /tmp is when that happens | 14:36 | |
| to figure out if the file does actually exist and its the cwd changing | |||
| additionally you could try commenting out that line I linked to and see if that fixes it. if so we could just add a flag to zef to disable it | 14:37 | ||
| ab5tract | wouldn't the dist always fail to install if it is a cwd or something? | ||
|
14:39
melezhik_ joined
|
|||
| ugexe | if there is no parallelization, yes. with parallelization i'm not sure | 14:39 | |
| further using ZEF_TEST_DEGREE isn't a great idea (which is why it defaults to 1) | 14:42 | ||
| ab5tract | oh :) | ||
| ugexe | especially when the staging repo isn't used, i.e. using ZEF_INSTALL_TO | ||
| ab5tract | right, we had a bad combo there. | ||
| ugexe | the reason being when staging repo is not used then precompilation also occurs when the tests are run | ||
| and if tests are run in parallel you end up with rakudo trying to precompile a bunch of different distributions at the same time and things are not deterministic | 14:43 | ||
| melezhik_ | ok, what we gonna do next round 1) disable FEZ_INSTALL_TO 2) no shared storage - which means zef concurrent installs is excluded 3) ZEF_*+DEGREE -? I am not sure what value is better to use | ||
| ab5tract | I had no idea it was discouraged, just saw them in --help | ||
| melezhik_ | yep - me also ) | ||
| anyway look like ZEF_*+DEGREE=1 are good default ? | 14:44 | ||
| ugexe | well its set to 1 beccause of github.com/rakudo/rakudo/issues/1920 | ||
| zef already comes with sane defaults i believe | |||
| melezhik_ | yeah, I mean in case we would want to override those on our side ... | 14:45 | |
| ugexe | fetching is fine to set to whatever. it defaults to 5 i think | 14:46 | |
| melezhik_ | ok | 14:47 | |
| ab5tract: let then ensure that ZEF_TEST_DEGREE=1 and ZEF_FETCH_DEGREE=5 in our agents setup | 14:48 | ||
| ugexe | another hail mary might be to try setting ZEF_CONFIG_TEMPDIR to a different directory for each agent. while that might not be the actual solution we'd want, it would help determine a better solution assuming it works | 14:51 | |
| ab5tract | ok, so that's an interesting idea. I could bind a single tmpfs into the containers, and then set each one to a subdirectory | 14:54 | |
| My thought is that this will reduce the in-container memory pressure, even if the overall memory footprint from the host POV is the same | 14:55 | ||
| melezhik_ | ab5tract: I'd try that next round , this round I would like to test without ZEF_INSTALL_TO | 14:56 | |
| ab5tract | ok | 14:57 | |
|
15:23
melezhik_ left,
melezhik_ joined
15:37
melezhik_ left,
melezhik_ joined
15:42
melezhik_ left,
melezhik_ joined
15:48
melezhik_ left
15:50
melezhik_ joined
15:52
melezhik_ left
|
|||
| timo | why was i never pointed at R#1920 it's relatively clear how it's going wrong; two threads try to run MVM_serialization_deserialize against the same SC which conflicts quite explosively | 16:03 | |
| linkable6 | R#1920 [open]: github.com/rakudo/rakudo/issues/1920 [SEGV] Module loading is not thread safe | ||
| ab5tract points timo towards R#1920 | 16:04 | ||
| lizmat hopes timo can create a fix | 16:05 | ||
|
16:06
ds7832 joined
|
|||
| timo | the question is, what do we want to happen exactly when an SC is attempted to be deserialized a second time. the caller of nqp::deserialize passes in arrays that they expect to be filled. do we just fill them with the data from the previous run? | 16:08 | |
| ab5tract | maybe try that and see what negative externalities it introduces (if any)? | 16:09 | |
| timo | ah, the caller of the nqp::deserialize in question is actually the dependencies+deserialize frame in the precompiled file | 16:17 | |
| Geth | rakudo/main: b92e33cc8e | (Elizabeth Mattijsen)++ | src/core.c/Junction.rakumod Give any() a @foo candidate Commit 51c3d86ccf removed this candidate (which apparently made things go a lot faster for its siblings all(), one() and none(), because it broke tests. Today, it does not break any tests, so there's no reason not to have a any(@foo) candidate. Spotted by arkiuat++ |
16:24 | |
| timo | if we just abort the dependencies+deserialize when an "this SC is already loaded" situation is detected, we may very well return from the loadbytecode op before the other thread that's loading the CU at the same time has finished running the same dependencies+deserialize, and that could cause bad trouble as well | 16:44 | |
| the SC isn't really the right hook to place a collision detection mechanism, the CU would be a better place | 16:45 | ||
| lizmat | could we make the SC to be thread-local during the time it is being deserialized ? | 16:46 | |
| timo | i don't think that is sensible | 16:51 | |
| especially since we lazily deserialize parts of the SC whenever they are needed, which can be minutes, hours, days after the deserialize process is started | 16:52 | ||
| the part that nqp::deserialize does is relatively brief, but the parts that dependencies+deserialize does after that call are important for things to work correctly | 16:53 | ||
|
16:56
librasteve_ left
|
|||
| timo | oh, the thing i was just implementing makes even less sense now that i think about it, since different CUs can very well have shared SC dependencies, so if one of them is already there, that doesn't mean they all have been handled by some other thread in the mean time | 16:57 | |
| extra fun: if you have RAKUDO_MODULE_DEBUG=1 and do the multi-thread require mode, you'll get warnings about the `$level` state variable in Process.rakumod being uninitialized | 17:02 | ||
| lizmat | heh | 17:03 | |
|
17:08
librasteve_ joined
|
|||
| timo | i see now in the middle of the issue thread there's a few patches with locking suggestions, that's a good start, i'll have to think a bit if that's enough to get things right | 17:12 | |
| ah, i was wrong again | 17:17 | ||
| lizmat | .oO( progressive insight ) |
||
| ugexe | i don't think we want to add those locks though | 17:29 | |
| i want to say nine mentioned as much but i could be misremembering | 17:30 | ||
| (if you're talking about all the locks i demonstrated in the issue on github) | 17:31 | ||
| timo | yeah i tentatively agree | 17:37 | |
|
17:38
melezhik_ joined
|
|||
| timo | the immediate thing that explodes the code is that an SC ID is kind of process-global | 17:40 | |
| (also, dependencies+deserialize grabs hllsym "GLOBAL" and works with it, which ... are we careful about that?) | 17:42 | ||
|
17:42
melezhik_ left
17:55
melezhik_ joined
18:07
melezhik_ left,
melezhik_ joined
|
|||
| melezhik_ | . | 18:19 | |
| lizmat | m: dd &(42,666) # feels this should warn or do something else then just silently assume $ | 18:20 | |
| camelia | $(42, 666) | ||
| timo | looks like it code-gens to .item, also for other stuff you can put in the parens | 18:26 | |
| lizmat | yeah, I guess it checks for '@', '%' and then elses to the .item case | 18:28 | |
| timo | i feel like instead of doing mitigation after a CU has been mapped a second time, we should just™ not attempt to load the same CU twice | 18:34 | |
| normally moarvm handles that when we load CUs from a filename. then it will use the filename as key for the "already loaded" hash | 18:35 | ||
| but we mostly load CUs from Blob objects | |||
| probably because of the stuff that sits before the MOARVM header, like repo ids and dependencies and stuff | |||
| oh, huh. | 18:38 | ||
| disregard what i just said | 18:40 | ||
|
18:43
melezhik_ left
18:44
melezhik_ joined
|
|||
| melezhik_ | so brownie round with ZEF_INSTALL_TO unset process zero error of "failed to stat file" for 470 processed distributions | 18:46 | |
| gist.github.com/melezhik/877dff68a...50d865456c | |||
|
18:49
melezhik_ left
|
|||
| melezhik | Zerro number of errors | 18:51 | |
|
19:11
melezhik_ joined
|
|||
| timo | so, preventing the same file from being loaded even when using loadbytecodefh isn't hard, and it is a relatively early spot to do an abort for duplicate files. however, if we don't run the bytecode's `<load>` frame, $*CTXSAVE.ctxsave is never called, so the CompUnit::Handle instance that load-precompilation-file creates and returns won't have anything put into it when a file is loaded for the | 19:21 | |
| second time | |||
| since the CTXSAVE stuff is merely a side effect, there's not really an obvious way for moarvm to remember something for the existing compunit to return or otherwise offer | 19:22 | ||
| so i think CompUnit::Loader will have to do that remembering by itself | |||
|
19:32
melezhik_ left,
melezhik_ joined
|
|||
| timo | well, CU::L isn't meant to be instantiated | 19:32 | |
| and i think if we just do locking so that we don't concurrently load the same thing twice, we will prevent crashes due to memory corruption, but if we just allow the same precomped file to be loaded twice "independently" we will end up with strange effects where things you would expect to be shared actually exist multiple times | 19:35 | ||
| i'm thinking this would happen with at least my-scoped things inside of compunits | 19:36 | ||
| lizmat: does that kind of thing ring a bell btw? a variable in a module not retaining changes when looked at from different files that require-d or use-d them, when concurrency is involved, even when there aren't crashes? | 19:37 | ||
| actually, can this be triggered without concurrency? | |||
|
19:43
melezhik_ left,
melezhik_ joined
19:47
melezhik_ left,
melezhik_ joined
19:49
melezhik_ left
19:51
lizmat left
|
|||
| ugexe | in rakudo i believe precomp files are loaded with CUR::AbsolutePath, and that should already prevent the same path from being loaded twice here - github.com/rakudo/rakudo/blob/b92e...akumod#L26 | 19:51 | |
| although there are edge cases with using the path like that since it would still load something twice if referenced by two different string paths i.e. foo/bar with e.g. foo/.bar foo/../foo/bar | 19:53 | ||
| timo | i do see load-handle-for-path from CompUnit::PrecompilationRepository involved | 19:56 | |
| indeed, that case we can't so trivially figure out | |||
| the SC ID inside of the CU should be a valid unique key, but I'm not sure if getting it out of the file is trivial enough for our purposes. i can generate something in the dependencies+deserialize frame that sends it out somehow, though | 19:59 | ||
| so the code path we have for your example code where we `require ::($module)` in multiple start blocks goes through PrecompilationRepository::load which checks if there's something loaded yet for a given name, and if it's not it goes ahead and loads it, and then stores the result. but it's only keeping the lock on the hash where these things are stored for the duration of the hash lookup and the | 20:05 | ||
| hash store, not the loading, so two threads tasked with loading the same thing at the same time will both see that nothing's loaded yet, go to load that thing, and store their results, with the winner clobbering the result of the loser | |||
|
20:05
melezhik_ joined
20:08
melezhik_ left
|
|||
| ugexe | i don't have time for following the code path to see if it applies here, but do note there is some file level locking when loading precompilation files as well: github.com/rakudo/rakudo/blob/b92e...akumod#L60 | 20:09 | |
| timo | or i guess the loser clobbers the result of the winner | ||
| i think we might only use that when we precompile something and try to store the precompilation result on disk? | 20:16 | ||
| i mean the lock and unlock methods you linked | |||
| ugexe | sounds correct | 20:17 | |
| timo | yeah, the issue i'm looking at right now happens when loading existing precomps, i haven't given parallel attempts to precompile the source of something a look yet in respect to this particular bit of trouble | 20:18 | |