lizmat timo: #masakism logs now also online 00:20
sleep&
jubilatious1_98524 Hi @shimmerfairy , I nudged an ICU conversation on the old perl6-users mailing list, and got 13 replies (latest 2024). Here take a look: www.nntp.perl.org/group/perl.perl6...g9241.html 04:55
lizmat m: .say with try 56792.chr # sorta expected to have that say Nil 11:41
camelia Error encoding UTF-8 string: could not encode Unicode Surrogate codepoint 56792 (0xDDD8)
in block <unit> at <tmp> line 1
lizmat actually, not say anything at all :-)
timo: for reading large (2GB+) files, would it make sense to presize the Buf, (and then resize to what's used) and then have nqp::readfh start adding bytes *after* what is already in there? 12:57
timo it makes sense that the 56792.chr works but the say doesn't; after all, our strings are capable of more than just™ utf8, so it can represent a lone surrogate codepoint ... or maybe we don't want that to be possible at all is what you're suggesting? 13:33
patrickb I'm puzzled again. Looking at github.com/MoarVM/MoarVM/blob/main...ode.c#L764 The frame deserializer only reads the debug_locals data if the debug server is active, but it does not advance `pos` when the debug server isn't active. 13:36
Won't this mess up all following deserialization? (Or is it fine simply because it's the last part of the frame data?) 13:37
timo it's the last part, the next frame's position will be read from the table of all frame's positions again 13:38
patrickb Ah. Understood. Then it's fine. Thanks for confirming! 13:39
timo no problem!
actually, a few months ago I hacked together an imhex pattern definition file for moarvm files, which you may find interesting to look at. also there's lizmat's module that can read these files and give statistics and such 13:40
lizmat MoarVM::Profile you mean ?
timo no i think it's MoarVM::Bytecode? 13:42
lizmat ah, ok, yes
Q: given enough RAM, should we be able to slurp a 2G+ file ? 13:48
timo slurp to string, right? 13:53
lizmat nope... binary 13:54
sorry
timo yeah that should surely work. do we still grow buffers only linearly after a specific size?
lizmat well, I'm working on optimizing that 13:55
timo ISTR a pull request or maybe just a branch with experiments for that
lizmat yeah, that got merged, but it also needs some rework 13:56
timo what would be really nice is if we could mmap a file into a Buf, but that's actually a different operation from slurping, since then changes to the file will cause changes in memory as well
lizmat well, I'll leave that for a later exercise
timo you can already get that with NativeCall and a CArray or CPointer 13:58
lizmat % rl '"m5.sql".IO.slurp'
initial read of 1048576: 1048576
taking slow path for 2417660750
sub slurp-PIO-uncertain(Mu \PIO, int $size)
trying to read 1879048192 bytes
appending 538612558 to 1879048192
trying to read 1879048192 bytes
MoarVM panic: Memory allocation failed; could not allocate 18446744066204519736 bytes
timo well, that's not great! :) 13:59
lizmat feels like that allocation number is... way too big ? 14:00
timo yes, negative number
I recall looking at some code paths where we made sure to use the biggest number type for file size information the OS / C library can give us
m: say 18446744066204519736.base(16)
camelia FFFFFFFE40AA4D38
lizmat m: my int $ = 18446744066204519736 14:01
camelia Cannot unbox 64 bit wide bigint into native integer. Did you mix int and Int or literals?
in block <unit> at <tmp> line 1
lizmat anyways, will contemplate while doing some gardening& 14:03
timo maybe check with strace if you can spot where the number may come from 14:05
patrickb When a raku process (my debugger UI) stops and doesn't perform any cleanup on shutdown, a Proc::Async child (the moar under debug) might first receive a SIGHUP (causing moar to try to shutdown) and then a SIGINT (causing it to die immediately)? I'm asking because when shutting the UI down I receive a last sole 83 byte from moar. I'd guess it's the start of a ThreadEnded notification. Does that sound plausible? 15:39
Byte 83 is the introducer of a fixed with map with 3 elements. 15:40
timo I can only recommend rr to record the whole process tree so you can figure things like that out reliably :) 15:58
sigint shouldn't kill moar immediately, but we also don't catch it by ourselves so we really don't do very much between receiving sigint and exiting
lizmat an update on the MoarVM panic: Memory allocation failed; could not allocate 18446744066204519736 bytes error 16:16
if I *don't* do the $blob.append($part) I don't get an error 16:17
so it feels like the nqp::splice of large buffers is doing something wonky
aha: looks like it fails when such a large blob is being returned from a sub 16:46
16:48 [Coke]_ joined
lizmat frobnicating& 16:48
16:51 [Coke] left
timo if we are adding newly read blobs to an existing blob one by one, if we don't otherwise have GC pressure (did not verify) we might keep a lot of smaller blobs around before gc can toss them out? 17:34
we don't really have a Blob.join or something right? but ... we probably could ...
actually, i don't think appending a native array with the splice op is bad?
rakudo -e '"bla/bla/bigfile.bin".IO.slurp(:bin).elems.say' → 2188252818 17:36
2.1G big
lizmat hmmmm that's a good point 18:02
right... I forgot the :bin in my slurp test 18:03
so it does trying to decode the 2.4G blob
*dies 18:04
timo > MoarVM panic: Memory allocation failed; could not allocate 18446744065282693704 bytes 18:11
ah yes
lizmat gist.github.com/lizmat/905751e5c76...c22a8c9d9c some decoding memory usage
timo #3 MVM_string_utf8_decode (tc=0x5ec4c020080, result_type=<optimized out>, utf8=0x7ffe6a010000 "Rar!\032\a", 18:13
lizmat note that in the end, even the ticker gets interrupted (for about 24 ticks)
timo bytes=2188252818) at src/strings/utf8.c:241
241 MVMGrapheme32 *buffer = MVM_malloc(sizeof(MVMGrapheme32) * bufsize);
"ticker gets interrupted" could be a surprisingly long GC pause? 18:14
MVMint32 bufsize = bytes ... yeah that can too easily overflow haha
lizmat yeah, about 2.4 seconds worth ?
ao appending bufs for more than 32bit ints is wonky 18:15
? 18:17
timo we should note that a MVMString has a num_graphs attribute that is a 32bit integer too 18:18
lizmat aaahhh I guess that's where the decoding of larger blobs dies on 18:19
would it work if it would make separate strands ?
timo there is however no upper limit to how many bytes you can decode to still fit into that because we might be creating some enormous composed graphemes
no the string that has the strands in it still has to have the total length in that attribute
lizmat ok, so slurping my sample file will just not work 18:20
so I guess we will need Blob.decode to check for max length and give a less LTA error message
timo the first thing we do is generate a buffer that's big enough to have 4 bytes for every 1 byte of the input 18:21
that's the line where the panic happens for us now
lizmat yeah, but the buf is not 4611686016551129934 bytes 18:22
timo correct
lizmat so the value it shows is bogus
timo but we go through a 32bit integer on the way to the malloc call
that makes it go negative and then back up to unsigned 64bit which makes it huge 18:24
lizmat ok, I guess that makes sense... but still LTA :-)
timo yes, for sure
rakudo -e 'say "x" x (2**32 + 1)' 18:25
Repeat count (4294967297) cannot be greater than max allowed number of graphemes 4294967295
18:38 [Coke]_ is now known as [Coke]
lizmat m: say "x" x (2**32) 18:48
camelia Repeat count (4294967296) cannot be greater than max allowed number of graphemes 4294967295
in block <unit> at <tmp> line 1
lizmat that's uint32 18:49
some handling appears to conk out at half of that because of signed int32 ?
timo could be
lizmat hehe, looks like .slurp(:bin) on a 2G file currently doesn't even work 18:50
timo i think it might just be really slow because the buffer grows linearly
lizmat Reading from filehandle failed: Invalid argument 18:51
timo you have local changes?
lizmat this is on 2026.01
timo could be a bug on macos only? 18:52
lizmat it's because nqp::readfh gets a too large value, exceeding max positive value on an int32 18:53
timo as i showed above I can slurp(:bin) a file that's 2188252818 bytes big just fine
you do "path".IO.slurp(:bin)? 18:54
lizmat ah, maybe af30c7bed30b725a124876addeb1303da97ce7cf is to blame
timo right, i'm on 2025.12-8-ga42e10a59 still i think
oh, i see 18:55
i see 64bits all over the place in the implementation of read_fhb that implements nqp::readfh 18:58
lizmat 0.47 to slurp a 2.4G file
try an nqp::readfh with 0x080000000 as size 18:59
timo the read function we use for the actual reading on the other hand returns signed int64 for the number of bytes read, but takes an unsigned int for the number of bytes to read
According to POSIX.1, if count is greater than SSIZE_MAX, the result is implementation-defined; see NOTES 19:00
for the upper limit on Linux.
lizmat % r 'use nqp; nqp::readfh(nqp::open("m5.sql","r"),Buf.new,0x080000000)'
Reading from filehandle failed: Invalid argument
m: say my int32 $ = 0x080000000 19:01
camelia -2147483648
lizmat m: say my int32 $ = 0x070000000
camelia 1879048192
timo that file has to actually be big enough?
lizmat nope 19:02
timo no error on my machine
lizmat hmmm... ok lemme try on a linux box 19:03
timo [pid 138197] <... read resumed>, "QFI\373\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\24\200\0\0\0"..., 2147483648) = 2147479552
m: say 2147483648.base(16)
camelia 80000000
timo On Linux, read() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, return‐ 19:04
ing the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)
can you check the man page on your system `man 2 read` what it says?
lizmat The read() and pread() call may also return the following error: 19:07
[EINVAL] The value provided for nbyte exceeds INT_MAX.
confirmed it's not an issue on Linux
timo ok, so on mac the maximum is INT_MAX, which i think is 32bit integers? and on linux the maximum is SSIZE_MAX which is 64bit on 64bit systems?
lizmat I guess
no idea what INT_MAX is on MacOS 19:08
timo i wonder if limits.h has something
in lldb you should be able to print(INT_MAX)
lizmat google says:
INT_MAX is a macro that represents the maximum value of the upper limit of the integer data type in C/C++. The value of INT_MAX is: INT_MAX = 2147483647 (for 32-bit Integers)
timo though of course a sufficiently wild platform could have different definitions :P 19:09
(we do not target platforms like that)
lizmat yeah... so the underlying issue is read on MacOS being 32bit bound
timo if there's no standardised #define available we can either probe for that in the Configure.pl or we just always cap to INT_MAX for all systems, or we try once with the value passed and if we get EINVAL we reduce the size we attempt to read 19:12
lizmat I think capping for INT_MAX for now is ok :-) 19:13
timo we don't do anything inside the read_bytes function of syncfile to handle having read less than was requested, do all users of nqp::readfh deal with smaller actual results? 19:14
i guess they already have to
lizmat I guess I'll check them after I commit this 19:15
anyways, one probably shouldn't be slurping files with that size anyway 19:16
timo it's quite possibly not what you actually want to do, yeah 19:17
20:27 patrickb left 20:41 patrickb joined