03:01 librasteve_ left 05:13 melezhik joined
melezhik [Coke]: ab5tract the issue reproduced on my agent as well gist.github.com/melezhik/c9414b605...5a6f678c21 , investigating, interesting, it’s 599 http error on specific http method , however iterating that agent web server is alive 06:01
07:36 sjn left
melezhik I ve read on 599 errors it implies resource exhaustion 08:19
usercontent.irccloud-cdn.com/file/...058887.JPG 08:21
09:40 disbot5 left 09:41 disbot6 joined 10:32 melezhik left 11:25 melezhik joined
melezhik I have gathered piece of log here for clarity - gist.github.com/melezhik/a0fc8908f...2d3737f44b 11:25
So in the end we have this important bit “599 : Malformed Status-Line:” 11:26
Maybe it’s just not a web server is not available as I thought initially , it just fails to handle this weird Acme:: none ascii symbols URI requests? 11:27
If this is a case - it’s easy to fix
Maybe 🤔 I just need to add Acme::ಠ_ಠ to skip list ? 11:30
timo you would expect that to have to be urlencoded in order to be sent over http 11:32
melezhik Yep. 11:33
This will also work
lizmat fwiw, that module exists exactly for this purpose :-) 11:37
melezhik lizmat: ++ 11:43
Here we go - github.com/melezhik/brownie/commit...1e5b788ce5
[Coke]: ab5tract please update agent from the last commit and start it. I believe I have fixed 599 error issue , but we need to check 13:38
[Coke] rebuilding the agent, just in case. 13:45
running, will check in later. 13:46
melezhik [Coke]: ++ 14:14
[Coke] 2025-11-25T14:31:47.763186Z --- [agent] neither crontab nor scm setup found, consider manual start, SKIP ... 14:34
came back, web site down
container still running
14:40 melezhik_ joined
melezhik_ . 14:40
[Coke] there is no 599 in ~/.sparky/*.log 14:42
melezhik_ yep 👍
started another round of 200 tests 14:43
[Coke]: what is the name of the agent you run?
I only see wonder-thunder which is I guess ab5tract: agent 14:44
and practical-euclid which is mine
brw.sparrowhub.io/builds
ab5tract seems to be working 14:48
melezhik_: are there still issues with job id collisions? 14:49
melezhik_ what do you mean ?
ab5tract brownie used to stop working because two jobs with the same id were generated 14:50
melezhik_ are you talking about agent? 14:51
ab5tract I don't recall a collision in the agent IDs. those were stable in a run. doesn't sparky work by creating job files with random numbers as names? 14:52
melezhik_ if not specified ( which is often the case ) job id is generated as - github.com/melezhik/sparky-job-api...akumod#L13 14:53
I am not sure what collision you're talking about, but this thing is random enough 14:54
ab5tract Maybe I'm misremembering, but I thought that at one point you modified the generation function due to collisions in generated job IDs 14:55
melezhik_ so when a job files is attached ( via http POST call ) to a job, the URI of the file has a job_id
ah ... ok
we are talking about job names ( aka project names ) which are not job id ) 14:56
so yeah, parallelization in Sparky relies on a fact it always runs jobs from different projects in parallel
ab5tract ah, there's the confusion :)
melezhik_ and from other side ( the same idea ) if we run job for the same project many times the run requests will be placed in queue for this project 14:57
14:57 sjn joined
melezhik_ this is why to randomly generate project names is not very good, as default Raku random INT generator is not perfect , this is why at some point I changed to Linux epochs 14:58
github.com/melezhik/brownie/blob/6...wfile#L136 15:02
15:03 sjn left
ugexe is that really any better? 15:08
i.e. can't two agents be created at the same time
melezhik_ yep, I guarantee random name across agent, given that master job spawning those child jobs runs in single thread 15:09
yeah, this is job name for a specific agent, not agent name in global pool 15:10
ugexe i generally use timestamp + process id
melezhik_ but this ( I agree ) would not guarantee if those project names were in global
yeah - pretty much what I use for Sparky job ids (random string + PID ) 15:11
ab5tract: wonder-thunder looks good ) 15:12
15:24 sjn joined
ab5tract is the run over? 15:25
melezhik_ not yet 15:26
tests total: 200 | finished tests: 191 15:27
but it is finishing 15:28
ping from agent: wonder-thunder-21935026, version: 0.0.24, jobs-run-cnt: 3, max-threads: 6
only 3 jobs now on wonder-thunder
ab5tract we're winding down
melezhik_ finished 200 modules - brw.sparrowhub.io/report/brw-orch/241 15:36
run 1000 now
still don't see [Coke] agent 15:37
ab5tract do I need to re-start the agent?
melezhik_ no
ab5tract cool :)
melezhik_ agents just tap in. even though I run new round on o10r, agents need no to restart 15:38
so good, so far, no 599 errors )
what about CPU load btw on your agent ab5tract: ? 15:39
[Coke] melezhik_: "cokebot"
melezhik_ [Coke]: yeah, I don't see pings from it
[Coke] melezhik_: you never saw anything from cokebot today? 15:40
I'll restart it.
melezhik_ what localhost:4000/builds say ?
[Coke] re-run
melezhik_ once I saw cokebot , but not sure maybe yesterday
ok, now I see the ping 15:42
ping from agent: cokebot-41962464, version: 0.0.24, jobs-run-cnt: 0, max-threads: 4
[Coke] looks like one job succeeded? 15:51
again, is something like "mjzprhtxdwavynfblsqi" a kind of GUID? (should we use actual guids?)
melezhik_ if you go to localhost:4000/builds you will see how many jobs are done on your agent 15:53
where do you get mjzprhtxdwavynfblsqi ?
to see stat from all agents one need to go to o10r page - brw.sparrowhub.io/project/brw-orch/builds 15:54
ab5tract [Coke]: yeah, that thingy is a sort of GUID .. I think it's the same thing that melezhik was just clarifying for me (job name?) but I'm not 100% sure 15:55
melezhik_ yes, this is job id 15:56
ab5tract and job id is different than job name :) 15:57
melezhik_ but again it's better to see what jobs are finished and how many use UI
yes )
job id is kind ID for job job run
ab5tract melezhik_: I do think it would be helpful for you to work on a script that we can run locally that will generate the output you want 15:58
that will make our side of the equation more pleasant
melezhik_ so job or project or sparky project is Raku scenario that gets run, it could be run either by cron or triggered by SCM changes or from some "parent" job
ab5tract I don't want to deny you logs, but I'm also terribly lazy and easily distracted to boot
melezhik_ so when when a job is run, it's assigned a JOB ID to track this run ( aka build ) 15:59
also build has an internal ID, which is INT
ab5tract I think it also might be helpful to always call the project style job a project, then the "job job" can just be job
melezhik_ ok, fair enough )
ab5tract eg, "a project can be launched fropm a parent project"
probably this will need some tweaking in the logs 16:00
melezhik_ sorry I live with the terminology all by own long time, and I think for new comers this should be not very easy to grasp ), sorry
"a project can be launched fropm a parent project" - yep 16:01
ab5tract no worries, just want to understand what's going on as best as possible :)
melezhik_ yeah 16:02
ab5tract so with the new terminology, the phrase "job name" becomes the same as "job id"
melezhik_ jobs / builds are explained in 2 places - github.com/melezhik/sparky/tree/master - basic intro
ab5tract because the old meaning of "job name" is now only referred to as "project name"
melezhik_ and github.com/melezhik/sparky/blob/ma...job_api.md - in depth explanation of job ids, parent / child jobs , etc 16:03
||| TESTS STAT: time: 27m | tests total: 1000 | finished tests: 341 | sent to queue: dist=0 / redist=30 | agents cnt: 3 16:04
ab5tract not bad!
melezhik_ bearing in mind that 200 tests have beed done in previous runs, we have 141 tests done in 27 min
which is also good
it's about 5 tests per minute , 300 test per hour, 1800 tests per 6 hours, but I guess the more tests agents finish the more cache they have the faster further tests are finished 16:06
so this could be even faster then this naive approximation 16:07
also my agent only runs in 2 threads
I guess ab5tract: agent runs in 6 threads
and [Coke] in 4 threads
ab5tract should do, yeah 16:08
melezhik_ [Coke]: I guess you get those GUIDs by tailing container logs ( which is sparkyd log ), and this is fine. Those are just finished job-id, another way to see things happening by localhost:4000/builds hopefully this helps 16:12
you can also go to any job m by clicking ID ref and get job report, and even job artifacts - which are some useful files, like raku env data , etc 16:13
but it's up to you if you prefer to get things from command line
I think I can in future create a helpful script to get pretty much the same info by running docker / podman exec /some/util 16:14
for terminal users )
it's sparky initially designed with the idea of UI as a dashboard and agent runs as sparky job 16:15
[Coke] web app dead again 16:19
melezhik_ ab5tract: will "docker|podman exec agent stat" work as making "equation more pleasant" ? ))
so `curl localhost:4000` does not respond from host machine ? 16:20
but container is running, right ? 16:21
I can only suggest that kernel kills process at some point, but I may be wrong
it's possible to run web app by `docker exec -it sparman worker_ui start` 16:22
[Coke] Yup, and with the base image, I can't run dmesg to figure out why
melezhik_ can you say `sparman worker_ui status` from within container ? 16:23
[Coke] AHA
docker run --privileged -e "container=docker" --rm -it --name agent -p 4000:4000 -e BRW_AGENT_NAME_PREFIX=cokebot agent 16:24
^^ that enables dmesg.
melezhik_ ok
[Coke] [ 7399.502740] Out of memory: Killed process 18529 (raku) total-vm:1929832kB, anon-rss:552116kB, file-rss:420kB, shmem-rss:0kB, UID:0 pgtables:1700kB oom_score_adj:0
melezhik_ yeah, expected
sparky web which is a cro web app is quite demanding on RAM
I guess at least it needs about 2-4 GB RAM 16:25
4 ideally
ab5tract :O
melezhik_ but it does not leak
ab5tract there was one of these running in each agent when we were doing multiple agents
no wonder the computer kept freezing lol
melezhik_ Oh, sorry )))
[Coke] found the memory settings in docker desktop on ma... 16:26
mac
ok, restarted in priv'd mode with 20GB memory 4GB swap 16:27
(will probably start impacting my actual laptop if it gets greedy. :) 16:28
Might want to have two running instances, one for the web app, one for the testing.
(because some of the tests are themselves memory hungry, and you don't want them to clobber the webapp) 16:29
ok, will leave *this* running for a bit and see if we do better. 16:30
16:30 melezhik_ left 16:40 melezhik_ joined
melezhik_ . 16:41
[Coke] correct me if I'm wrong, but there is no effort to order the jobs with least to most dependencies? 16:42
so I could get as my first four jobs things like Cro and Red, which are just going to take a LONG time to test.
AND you're testing all the dependencies as well? 16:43
if we're not saving the fact that we've tested the dependencies, we can at least install the deps with --/test so we're not spending time running those tests 2x. 16:44
(e.g. my first job this run was Colletion, which has a huge # of deps) 16:45
"Collection"
16:47 melezhik_ left, melezhik_ joined
[Coke] Or maybe have some more logic in terms of which groups of modules are handed out - so instead of X unrelated modules, maybe we get some with shared dependencies for a speedup? 16:49
Would also be nice if we had a way to analyze the logs to determine what was being tested *when* the OOM killer happens, so that we can narrow the list of suspected modules to perhaps add to the skiplist 16:51
looks like one job completed (with an install failure on Color::Named) 16:52
melezhik_ [Coke]: another way to see what jobs are completed by looking at centralized orchestrator ( we call it o10r ) dashboard at brw.sparrowhub.io/builds 16:53
you will find $agent_name.$job_number.report jobs here 16:54
for example brw.sparrowhub.io/project/cokebot-9...094.report
"correct me if I'm wrong, but there is no effort to order the jobs with least to most dependencies?" yes we - can traverse modules "river" by modules sorted by reverse dependencies count 16:55
currently this is commented - github.com/melezhik/brownie/blob/6...wfile#L151 16:56
but if we think this is more optimal way to test modules it's easy to enable 16:57
"Would also be nice if we had a way to analyze the logs to determine what was being tested *when* the OOM killer happens" sure, if or cause RAM is eaten by module test not by sparky web itself
"so I could get as my first four jobs things like Cro and Red, which are just going to take a LONG time to test." - yeah, but bear in mind once those LONG list of decencies is installed, the further modules tests are going to be presumably faster given that some or many dependencies are already met 16:58
so "strategically" if we run tests on distributed pool of agents it works for good 16:59
because eventually all agents will build up cache of dependencies which makes it faster to install further modules, but this is just a hypothesis, I might be wrong here 17:01
"so we're not spending time running those tests 2x." we never run tests twice for that matter, as zef would just skip installing dependencies ( and so running tests ) if they are already installed 17:02
17:36 melezhik_ left, melezhik_ joined 17:42 melezhik_ left 18:12 melezhik left 18:22 melezhik_ joined
melezhik_ tests total: 1000 | finished tests: 625 18:22
so far
I made some performance optimization to skip modules already tested by other agents , this should give some increase as well 18:23
18:27 melezhik_ left 18:33 melezhik_ joined
melezhik_ to be more accurate to synchronizes such an information across all agents more frequently and faster 18:36
18:40 melezhik_ left 18:44 melezhik_ joined 18:48 melezhik_ left 19:09 melezhik joined 19:15 melezhik_ joined 19:19 melezhik_ left
[Coke] melezhik: if module A depends on moduleB: tests for B and A are run. What then happens when module B comes up in the list? Instant success because it was already installed? Or do you run the tests again at that point? 20:28
releasable6: next 21:01
releasable6 [Coke], Next release in ≈24 days and ≈21 hours. There are no known blockers. 0 out of 29 commits logged
[Coke], Details: gist.github.com/0e24e397ec24021522...37af2bee02
21:10 melezhik left