|
03:01
librasteve_ left
05:13
melezhik joined
|
|||
| melezhik | [Coke]: ab5tract the issue reproduced on my agent as well gist.github.com/melezhik/c9414b605...5a6f678c21 , investigating, interesting, it’s 599 http error on specific http method , however iterating that agent web server is alive | 06:01 | |
|
07:36
sjn left
|
|||
| melezhik | I ve read on 599 errors it implies resource exhaustion | 08:19 | |
| usercontent.irccloud-cdn.com/file/...058887.JPG | 08:21 | ||
|
09:40
disbot5 left
09:41
disbot6 joined
10:32
melezhik left
11:25
melezhik joined
|
|||
| melezhik | I have gathered piece of log here for clarity - gist.github.com/melezhik/a0fc8908f...2d3737f44b | 11:25 | |
| So in the end we have this important bit “599 : Malformed Status-Line:” | 11:26 | ||
| Maybe it’s just not a web server is not available as I thought initially , it just fails to handle this weird Acme:: none ascii symbols URI requests? | 11:27 | ||
| If this is a case - it’s easy to fix | |||
| Maybe 🤔 I just need to add Acme::ಠ_ಠ to skip list ? | 11:30 | ||
| timo | you would expect that to have to be urlencoded in order to be sent over http | 11:32 | |
| melezhik | Yep. | 11:33 | |
| This will also work | |||
| lizmat | fwiw, that module exists exactly for this purpose :-) | 11:37 | |
| melezhik | lizmat: ++ | 11:43 | |
| Here we go - github.com/melezhik/brownie/commit...1e5b788ce5 | |||
| [Coke]: ab5tract please update agent from the last commit and start it. I believe I have fixed 599 error issue , but we need to check | 13:38 | ||
| [Coke] | rebuilding the agent, just in case. | 13:45 | |
| running, will check in later. | 13:46 | ||
| melezhik | [Coke]: ++ | 14:14 | |
| [Coke] | 2025-11-25T14:31:47.763186Z --- [agent] neither crontab nor scm setup found, consider manual start, SKIP ... | 14:34 | |
| came back, web site down | |||
| container still running | |||
|
14:40
melezhik_ joined
|
|||
| melezhik_ | . | 14:40 | |
| [Coke] | there is no 599 in ~/.sparky/*.log | 14:42 | |
| melezhik_ | yep 👍 | ||
| started another round of 200 tests | 14:43 | ||
| [Coke]: what is the name of the agent you run? | |||
| I only see wonder-thunder which is I guess ab5tract: agent | 14:44 | ||
| and practical-euclid which is mine | |||
| brw.sparrowhub.io/builds | |||
| ab5tract | seems to be working | 14:48 | |
| melezhik_: are there still issues with job id collisions? | 14:49 | ||
| melezhik_ | what do you mean ? | ||
| ab5tract | brownie used to stop working because two jobs with the same id were generated | 14:50 | |
| melezhik_ | are you talking about agent? | 14:51 | |
| ab5tract | I don't recall a collision in the agent IDs. those were stable in a run. doesn't sparky work by creating job files with random numbers as names? | 14:52 | |
| melezhik_ | if not specified ( which is often the case ) job id is generated as - github.com/melezhik/sparky-job-api...akumod#L13 | 14:53 | |
| I am not sure what collision you're talking about, but this thing is random enough | 14:54 | ||
| ab5tract | Maybe I'm misremembering, but I thought that at one point you modified the generation function due to collisions in generated job IDs | 14:55 | |
| melezhik_ | so when a job files is attached ( via http POST call ) to a job, the URI of the file has a job_id | ||
| ah ... ok | |||
| we are talking about job names ( aka project names ) which are not job id ) | 14:56 | ||
| so yeah, parallelization in Sparky relies on a fact it always runs jobs from different projects in parallel | |||
| ab5tract | ah, there's the confusion :) | ||
| melezhik_ | and from other side ( the same idea ) if we run job for the same project many times the run requests will be placed in queue for this project | 14:57 | |
|
14:57
sjn joined
|
|||
| melezhik_ | this is why to randomly generate project names is not very good, as default Raku random INT generator is not perfect , this is why at some point I changed to Linux epochs | 14:58 | |
| github.com/melezhik/brownie/blob/6...wfile#L136 | 15:02 | ||
|
15:03
sjn left
|
|||
| ugexe | is that really any better? | 15:08 | |
| i.e. can't two agents be created at the same time | |||
| melezhik_ | yep, I guarantee random name across agent, given that master job spawning those child jobs runs in single thread | 15:09 | |
| yeah, this is job name for a specific agent, not agent name in global pool | 15:10 | ||
| ugexe | i generally use timestamp + process id | ||
| melezhik_ | but this ( I agree ) would not guarantee if those project names were in global | ||
| yeah - pretty much what I use for Sparky job ids (random string + PID ) | 15:11 | ||
| ab5tract: wonder-thunder looks good ) | 15:12 | ||
|
15:24
sjn joined
|
|||
| ab5tract | is the run over? | 15:25 | |
| melezhik_ | not yet | 15:26 | |
| tests total: 200 | finished tests: 191 | 15:27 | ||
| but it is finishing | 15:28 | ||
| ping from agent: wonder-thunder-21935026, version: 0.0.24, jobs-run-cnt: 3, max-threads: 6 | |||
| only 3 jobs now on wonder-thunder | |||
| ab5tract | we're winding down | ||
| melezhik_ | finished 200 modules - brw.sparrowhub.io/report/brw-orch/241 | 15:36 | |
| run 1000 now | |||
| still don't see [Coke] agent | 15:37 | ||
| ab5tract | do I need to re-start the agent? | ||
| melezhik_ | no | ||
| ab5tract | cool :) | ||
| melezhik_ | agents just tap in. even though I run new round on o10r, agents need no to restart | 15:38 | |
| so good, so far, no 599 errors ) | |||
| what about CPU load btw on your agent ab5tract: ? | 15:39 | ||
| [Coke] | melezhik_: "cokebot" | ||
| melezhik_ | [Coke]: yeah, I don't see pings from it | ||
| [Coke] | melezhik_: you never saw anything from cokebot today? | 15:40 | |
| I'll restart it. | |||
| melezhik_ | what localhost:4000/builds say ? | ||
| [Coke] | re-run | ||
| melezhik_ | once I saw cokebot , but not sure maybe yesterday | ||
| ok, now I see the ping | 15:42 | ||
| ping from agent: cokebot-41962464, version: 0.0.24, jobs-run-cnt: 0, max-threads: 4 | |||
| [Coke] | looks like one job succeeded? | 15:51 | |
| again, is something like "mjzprhtxdwavynfblsqi" a kind of GUID? (should we use actual guids?) | |||
| melezhik_ | if you go to localhost:4000/builds you will see how many jobs are done on your agent | 15:53 | |
| where do you get mjzprhtxdwavynfblsqi ? | |||
| to see stat from all agents one need to go to o10r page - brw.sparrowhub.io/project/brw-orch/builds | 15:54 | ||
| ab5tract | [Coke]: yeah, that thingy is a sort of GUID .. I think it's the same thing that melezhik was just clarifying for me (job name?) but I'm not 100% sure | 15:55 | |
| melezhik_ | yes, this is job id | 15:56 | |
| ab5tract | and job id is different than job name :) | 15:57 | |
| melezhik_ | but again it's better to see what jobs are finished and how many use UI | ||
| yes ) | |||
| job id is kind ID for job job run | |||
| ab5tract | melezhik_: I do think it would be helpful for you to work on a script that we can run locally that will generate the output you want | 15:58 | |
| that will make our side of the equation more pleasant | |||
| melezhik_ | so job or project or sparky project is Raku scenario that gets run, it could be run either by cron or triggered by SCM changes or from some "parent" job | ||
| ab5tract | I don't want to deny you logs, but I'm also terribly lazy and easily distracted to boot | ||
| melezhik_ | so when when a job is run, it's assigned a JOB ID to track this run ( aka build ) | 15:59 | |
| also build has an internal ID, which is INT | |||
| ab5tract | I think it also might be helpful to always call the project style job a project, then the "job job" can just be job | ||
| melezhik_ | ok, fair enough ) | ||
| ab5tract | eg, "a project can be launched fropm a parent project" | ||
| probably this will need some tweaking in the logs | 16:00 | ||
| melezhik_ | sorry I live with the terminology all by own long time, and I think for new comers this should be not very easy to grasp ), sorry | ||
| "a project can be launched fropm a parent project" - yep | 16:01 | ||
| ab5tract | no worries, just want to understand what's going on as best as possible :) | ||
| melezhik_ | yeah | 16:02 | |
| ab5tract | so with the new terminology, the phrase "job name" becomes the same as "job id" | ||
| melezhik_ | jobs / builds are explained in 2 places - github.com/melezhik/sparky/tree/master - basic intro | ||
| ab5tract | because the old meaning of "job name" is now only referred to as "project name" | ||
| melezhik_ | and github.com/melezhik/sparky/blob/ma...job_api.md - in depth explanation of job ids, parent / child jobs , etc | 16:03 | |
| ||| TESTS STAT: time: 27m | tests total: 1000 | finished tests: 341 | sent to queue: dist=0 / redist=30 | agents cnt: 3 | 16:04 | ||
| ab5tract | not bad! | ||
| melezhik_ | bearing in mind that 200 tests have beed done in previous runs, we have 141 tests done in 27 min | ||
| which is also good | |||
| it's about 5 tests per minute , 300 test per hour, 1800 tests per 6 hours, but I guess the more tests agents finish the more cache they have the faster further tests are finished | 16:06 | ||
| so this could be even faster then this naive approximation | 16:07 | ||
| also my agent only runs in 2 threads | |||
| I guess ab5tract: agent runs in 6 threads | |||
| and [Coke] in 4 threads | |||
| ab5tract | should do, yeah | 16:08 | |
| melezhik_ | [Coke]: I guess you get those GUIDs by tailing container logs ( which is sparkyd log ), and this is fine. Those are just finished job-id, another way to see things happening by localhost:4000/builds hopefully this helps | 16:12 | |
| you can also go to any job m by clicking ID ref and get job report, and even job artifacts - which are some useful files, like raku env data , etc | 16:13 | ||
| but it's up to you if you prefer to get things from command line | |||
| I think I can in future create a helpful script to get pretty much the same info by running docker / podman exec /some/util | 16:14 | ||
| for terminal users ) | |||
| it's sparky initially designed with the idea of UI as a dashboard and agent runs as sparky job | 16:15 | ||
| [Coke] | web app dead again | 16:19 | |
| melezhik_ | ab5tract: will "docker|podman exec agent stat" work as making "equation more pleasant" ? )) | ||
| so `curl localhost:4000` does not respond from host machine ? | 16:20 | ||
| but container is running, right ? | 16:21 | ||
| I can only suggest that kernel kills process at some point, but I may be wrong | |||
| it's possible to run web app by `docker exec -it sparman worker_ui start` | 16:22 | ||
| [Coke] | Yup, and with the base image, I can't run dmesg to figure out why | ||
| melezhik_ | can you say `sparman worker_ui status` from within container ? | 16:23 | |
| [Coke] | AHA | ||
| docker run --privileged -e "container=docker" --rm -it --name agent -p 4000:4000 -e BRW_AGENT_NAME_PREFIX=cokebot agent | 16:24 | ||
| ^^ that enables dmesg. | |||
| melezhik_ | ok | ||
| [Coke] | [ 7399.502740] Out of memory: Killed process 18529 (raku) total-vm:1929832kB, anon-rss:552116kB, file-rss:420kB, shmem-rss:0kB, UID:0 pgtables:1700kB oom_score_adj:0 | ||
| melezhik_ | yeah, expected | ||
| sparky web which is a cro web app is quite demanding on RAM | |||
| I guess at least it needs about 2-4 GB RAM | 16:25 | ||
| 4 ideally | |||
| ab5tract | :O | ||
| melezhik_ | but it does not leak | ||
| ab5tract | there was one of these running in each agent when we were doing multiple agents | ||
| no wonder the computer kept freezing lol | |||
| melezhik_ | Oh, sorry ))) | ||
| [Coke] | found the memory settings in docker desktop on ma... | 16:26 | |
| mac | |||
| ok, restarted in priv'd mode with 20GB memory 4GB swap | 16:27 | ||
| (will probably start impacting my actual laptop if it gets greedy. :) | 16:28 | ||
| Might want to have two running instances, one for the web app, one for the testing. | |||
| (because some of the tests are themselves memory hungry, and you don't want them to clobber the webapp) | 16:29 | ||
| ok, will leave *this* running for a bit and see if we do better. | 16:30 | ||
|
16:30
melezhik_ left
16:40
melezhik_ joined
|
|||
| melezhik_ | . | 16:41 | |
| [Coke] | correct me if I'm wrong, but there is no effort to order the jobs with least to most dependencies? | 16:42 | |
| so I could get as my first four jobs things like Cro and Red, which are just going to take a LONG time to test. | |||
| AND you're testing all the dependencies as well? | 16:43 | ||
| if we're not saving the fact that we've tested the dependencies, we can at least install the deps with --/test so we're not spending time running those tests 2x. | 16:44 | ||
| (e.g. my first job this run was Colletion, which has a huge # of deps) | 16:45 | ||
| "Collection" | |||
|
16:47
melezhik_ left,
melezhik_ joined
|
|||
| [Coke] | Or maybe have some more logic in terms of which groups of modules are handed out - so instead of X unrelated modules, maybe we get some with shared dependencies for a speedup? | 16:49 | |
| Would also be nice if we had a way to analyze the logs to determine what was being tested *when* the OOM killer happens, so that we can narrow the list of suspected modules to perhaps add to the skiplist | 16:51 | ||
| looks like one job completed (with an install failure on Color::Named) | 16:52 | ||
| melezhik_ | [Coke]: another way to see what jobs are completed by looking at centralized orchestrator ( we call it o10r ) dashboard at brw.sparrowhub.io/builds | 16:53 | |
| you will find $agent_name.$job_number.report jobs here | 16:54 | ||
| for example brw.sparrowhub.io/project/cokebot-9...094.report | |||
| "correct me if I'm wrong, but there is no effort to order the jobs with least to most dependencies?" yes we - can traverse modules "river" by modules sorted by reverse dependencies count | 16:55 | ||
| currently this is commented - github.com/melezhik/brownie/blob/6...wfile#L151 | 16:56 | ||
| but if we think this is more optimal way to test modules it's easy to enable | 16:57 | ||
| "Would also be nice if we had a way to analyze the logs to determine what was being tested *when* the OOM killer happens" sure, if or cause RAM is eaten by module test not by sparky web itself | |||
| "so I could get as my first four jobs things like Cro and Red, which are just going to take a LONG time to test." - yeah, but bear in mind once those LONG list of decencies is installed, the further modules tests are going to be presumably faster given that some or many dependencies are already met | 16:58 | ||
| so "strategically" if we run tests on distributed pool of agents it works for good | 16:59 | ||
| because eventually all agents will build up cache of dependencies which makes it faster to install further modules, but this is just a hypothesis, I might be wrong here | 17:01 | ||
| "so we're not spending time running those tests 2x." we never run tests twice for that matter, as zef would just skip installing dependencies ( and so running tests ) if they are already installed | 17:02 | ||
|
17:36
melezhik_ left,
melezhik_ joined
17:42
melezhik_ left
18:12
melezhik left
18:22
melezhik_ joined
|
|||
| melezhik_ | tests total: 1000 | finished tests: 625 | 18:22 | |
| so far | |||
| I made some performance optimization to skip modules already tested by other agents , this should give some increase as well | 18:23 | ||
|
18:27
melezhik_ left
18:33
melezhik_ joined
|
|||
| melezhik_ | to be more accurate to synchronizes such an information across all agents more frequently and faster | 18:36 | |
|
18:40
melezhik_ left
18:44
melezhik_ joined
18:48
melezhik_ left
19:09
melezhik joined
19:15
melezhik_ joined
19:19
melezhik_ left
|
|||
| [Coke] | melezhik: if module A depends on moduleB: tests for B and A are run. What then happens when module B comes up in the list? Instant success because it was already installed? Or do you run the tests again at that point? | 20:28 | |
| releasable6: next | 21:01 | ||
| releasable6 | [Coke], Next release in ≈24 days and ≈21 hours. There are no known blockers. 0 out of 29 commits logged | ||
| [Coke], Details: gist.github.com/0e24e397ec24021522...37af2bee02 | |||
|
21:10
melezhik left
|
|||