| lizmat | timo: #masakism logs now also online | 00:20 | |
| sleep& | |||
| jubilatious1_98524 | Hi @shimmerfairy , I nudged an ICU conversation on the old perl6-users mailing list, and got 13 replies (latest 2024). Here take a look: www.nntp.perl.org/group/perl.perl6...g9241.html | 04:55 | |
| lizmat | m: .say with try 56792.chr # sorta expected to have that say Nil | 11:41 | |
| camelia | Error encoding UTF-8 string: could not encode Unicode Surrogate codepoint 56792 (0xDDD8) in block <unit> at <tmp> line 1 |
||
| lizmat | actually, not say anything at all :-) | ||
| timo: for reading large (2GB+) files, would it make sense to presize the Buf, (and then resize to what's used) and then have nqp::readfh start adding bytes *after* what is already in there? | 12:57 | ||
| timo | it makes sense that the 56792.chr works but the say doesn't; after all, our strings are capable of more than just™ utf8, so it can represent a lone surrogate codepoint ... or maybe we don't want that to be possible at all is what you're suggesting? | 13:33 | |
| patrickb | I'm puzzled again. Looking at github.com/MoarVM/MoarVM/blob/main...ode.c#L764 The frame deserializer only reads the debug_locals data if the debug server is active, but it does not advance `pos` when the debug server isn't active. | 13:36 | |
| Won't this mess up all following deserialization? (Or is it fine simply because it's the last part of the frame data?) | 13:37 | ||
| timo | it's the last part, the next frame's position will be read from the table of all frame's positions again | 13:38 | |
| patrickb | Ah. Understood. Then it's fine. Thanks for confirming! | 13:39 | |
| timo | no problem! | ||
| actually, a few months ago I hacked together an imhex pattern definition file for moarvm files, which you may find interesting to look at. also there's lizmat's module that can read these files and give statistics and such | 13:40 | ||
| lizmat | MoarVM::Profile you mean ? | ||
| timo | no i think it's MoarVM::Bytecode? | 13:42 | |
| lizmat | ah, ok, yes | ||
| Q: given enough RAM, should we be able to slurp a 2G+ file ? | 13:48 | ||
| timo | slurp to string, right? | 13:53 | |
| lizmat | nope... binary | 13:54 | |
| sorry | |||
| timo | yeah that should surely work. do we still grow buffers only linearly after a specific size? | ||
| lizmat | well, I'm working on optimizing that | 13:55 | |
| timo | ISTR a pull request or maybe just a branch with experiments for that | ||
| lizmat | yeah, that got merged, but it also needs some rework | 13:56 | |
| timo | what would be really nice is if we could mmap a file into a Buf, but that's actually a different operation from slurping, since then changes to the file will cause changes in memory as well | ||
| lizmat | well, I'll leave that for a later exercise | ||
| timo | you can already get that with NativeCall and a CArray or CPointer | 13:58 | |
| lizmat | % rl '"m5.sql".IO.slurp' | ||
| initial read of 1048576: 1048576 | |||
| taking slow path for 2417660750 | |||
| sub slurp-PIO-uncertain(Mu \PIO, int $size) | |||
| trying to read 1879048192 bytes | |||
| appending 538612558 to 1879048192 | |||
| trying to read 1879048192 bytes | |||
| MoarVM panic: Memory allocation failed; could not allocate 18446744066204519736 bytes | |||
| timo | well, that's not great! :) | 13:59 | |
| lizmat | feels like that allocation number is... way too big ? | 14:00 | |
| timo | yes, negative number | ||
| I recall looking at some code paths where we made sure to use the biggest number type for file size information the OS / C library can give us | |||
| m: say 18446744066204519736.base(16) | |||
| camelia | FFFFFFFE40AA4D38 | ||
| lizmat | m: my int $ = 18446744066204519736 | 14:01 | |
| camelia | Cannot unbox 64 bit wide bigint into native integer. Did you mix int and Int or literals? in block <unit> at <tmp> line 1 |
||
| lizmat | anyways, will contemplate while doing some gardening& | 14:03 | |
| timo | maybe check with strace if you can spot where the number may come from | 14:05 | |
| patrickb | When a raku process (my debugger UI) stops and doesn't perform any cleanup on shutdown, a Proc::Async child (the moar under debug) might first receive a SIGHUP (causing moar to try to shutdown) and then a SIGINT (causing it to die immediately)? I'm asking because when shutting the UI down I receive a last sole 83 byte from moar. I'd guess it's the start of a ThreadEnded notification. Does that sound plausible? | 15:39 | |
| Byte 83 is the introducer of a fixed with map with 3 elements. | 15:40 | ||
| timo | I can only recommend rr to record the whole process tree so you can figure things like that out reliably :) | 15:58 | |
| sigint shouldn't kill moar immediately, but we also don't catch it by ourselves so we really don't do very much between receiving sigint and exiting | |||
| lizmat | an update on the MoarVM panic: Memory allocation failed; could not allocate 18446744066204519736 bytes error | 16:16 | |
| if I *don't* do the $blob.append($part) I don't get an error | 16:17 | ||
| so it feels like the nqp::splice of large buffers is doing something wonky | |||
| aha: looks like it fails when such a large blob is being returned from a sub | 16:46 | ||
|
16:48
[Coke]_ joined
|
|||
| lizmat | frobnicating& | 16:48 | |
|
16:51
[Coke] left
|
|||
| timo | if we are adding newly read blobs to an existing blob one by one, if we don't otherwise have GC pressure (did not verify) we might keep a lot of smaller blobs around before gc can toss them out? | 17:34 | |
| we don't really have a Blob.join or something right? but ... we probably could ... | |||
| actually, i don't think appending a native array with the splice op is bad? | |||
| rakudo -e '"bla/bla/bigfile.bin".IO.slurp(:bin).elems.say' → 2188252818 | 17:36 | ||
| 2.1G big | |||
| lizmat | hmmmm that's a good point | 18:02 | |
| right... I forgot the :bin in my slurp test | 18:03 | ||
| so it does trying to decode the 2.4G blob | |||
| *dies | 18:04 | ||
| timo | > MoarVM panic: Memory allocation failed; could not allocate 18446744065282693704 bytes | 18:11 | |
| ah yes | |||
| lizmat | gist.github.com/lizmat/905751e5c76...c22a8c9d9c some decoding memory usage | ||
| timo | #3 MVM_string_utf8_decode (tc=0x5ec4c020080, result_type=<optimized out>, utf8=0x7ffe6a010000 "Rar!\032\a", | 18:13 | |
| lizmat | note that in the end, even the ticker gets interrupted (for about 24 ticks) | ||
| timo | bytes=2188252818) at src/strings/utf8.c:241 | ||
| 241 MVMGrapheme32 *buffer = MVM_malloc(sizeof(MVMGrapheme32) * bufsize); | |||
| "ticker gets interrupted" could be a surprisingly long GC pause? | 18:14 | ||
| MVMint32 bufsize = bytes ... yeah that can too easily overflow haha | |||
| lizmat | yeah, about 2.4 seconds worth ? | ||
| ao appending bufs for more than 32bit ints is wonky | 18:15 | ||
| ? | 18:17 | ||
| timo | we should note that a MVMString has a num_graphs attribute that is a 32bit integer too | 18:18 | |
| lizmat | aaahhh I guess that's where the decoding of larger blobs dies on | 18:19 | |
| would it work if it would make separate strands ? | |||
| timo | there is however no upper limit to how many bytes you can decode to still fit into that because we might be creating some enormous composed graphemes | ||
| no the string that has the strands in it still has to have the total length in that attribute | |||
| lizmat | ok, so slurping my sample file will just not work | 18:20 | |
| so I guess we will need Blob.decode to check for max length and give a less LTA error message | |||
| timo | the first thing we do is generate a buffer that's big enough to have 4 bytes for every 1 byte of the input | 18:21 | |
| that's the line where the panic happens for us now | |||
| lizmat | yeah, but the buf is not 4611686016551129934 bytes | 18:22 | |
| timo | correct | ||
| lizmat | so the value it shows is bogus | ||
| timo | but we go through a 32bit integer on the way to the malloc call | ||
| that makes it go negative and then back up to unsigned 64bit which makes it huge | 18:24 | ||
| lizmat | ok, I guess that makes sense... but still LTA :-) | ||
| timo | yes, for sure | ||
| rakudo -e 'say "x" x (2**32 + 1)' | 18:25 | ||
| Repeat count (4294967297) cannot be greater than max allowed number of graphemes 4294967295 | |||
|
18:38
[Coke]_ is now known as [Coke]
|
|||
| lizmat | m: say "x" x (2**32) | 18:48 | |
| camelia | Repeat count (4294967296) cannot be greater than max allowed number of graphemes 4294967295 in block <unit> at <tmp> line 1 |
||
| lizmat | that's uint32 | 18:49 | |
| some handling appears to conk out at half of that because of signed int32 ? | |||
| timo | could be | ||
| lizmat | hehe, looks like .slurp(:bin) on a 2G file currently doesn't even work | 18:50 | |
| timo | i think it might just be really slow because the buffer grows linearly | ||
| lizmat | Reading from filehandle failed: Invalid argument | 18:51 | |
| timo | you have local changes? | ||
| lizmat | this is on 2026.01 | ||
| timo | could be a bug on macos only? | 18:52 | |
| lizmat | it's because nqp::readfh gets a too large value, exceeding max positive value on an int32 | 18:53 | |
| timo | as i showed above I can slurp(:bin) a file that's 2188252818 bytes big just fine | ||
| you do "path".IO.slurp(:bin)? | 18:54 | ||
| lizmat | ah, maybe af30c7bed30b725a124876addeb1303da97ce7cf is to blame | ||
| timo | right, i'm on 2025.12-8-ga42e10a59 still i think | ||
| oh, i see | 18:55 | ||
| i see 64bits all over the place in the implementation of read_fhb that implements nqp::readfh | 18:58 | ||
| lizmat | 0.47 to slurp a 2.4G file | ||
| try an nqp::readfh with 0x080000000 as size | 18:59 | ||
| timo | the read function we use for the actual reading on the other hand returns signed int64 for the number of bytes read, but takes an unsigned int for the number of bytes to read | ||
| According to POSIX.1, if count is greater than SSIZE_MAX, the result is implementation-defined; see NOTES | 19:00 | ||
| for the upper limit on Linux. | |||
| lizmat | % r 'use nqp; nqp::readfh(nqp::open("m5.sql","r"),Buf.new,0x080000000)' | ||
| Reading from filehandle failed: Invalid argument | |||
| m: say my int32 $ = 0x080000000 | 19:01 | ||
| camelia | -2147483648 | ||
| lizmat | m: say my int32 $ = 0x070000000 | ||
| camelia | 1879048192 | ||
| timo | that file has to actually be big enough? | ||
| lizmat | nope | 19:02 | |
| timo | no error on my machine | ||
| lizmat | hmmm... ok lemme try on a linux box | 19:03 | |
| timo | [pid 138197] <... read resumed>, "QFI\373\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\24\200\0\0\0"..., 2147483648) = 2147479552 | ||
| m: say 2147483648.base(16) | |||
| camelia | 80000000 | ||
| timo | On Linux, read() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, return‐ | 19:04 | |
| ing the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.) | |||
| can you check the man page on your system `man 2 read` what it says? | |||
| lizmat | The read() and pread() call may also return the following error: | 19:07 | |
| [EINVAL] The value provided for nbyte exceeds INT_MAX. | |||
| confirmed it's not an issue on Linux | |||
| timo | ok, so on mac the maximum is INT_MAX, which i think is 32bit integers? and on linux the maximum is SSIZE_MAX which is 64bit on 64bit systems? | ||
| lizmat | I guess | ||
| no idea what INT_MAX is on MacOS | 19:08 | ||
| timo | i wonder if limits.h has something | ||
| in lldb you should be able to print(INT_MAX) | |||
| lizmat | google says: | ||
| INT_MAX is a macro that represents the maximum value of the upper limit of the integer data type in C/C++. The value of INT_MAX is: INT_MAX = 2147483647 (for 32-bit Integers) | |||
| timo | though of course a sufficiently wild platform could have different definitions :P | 19:09 | |
| (we do not target platforms like that) | |||
| lizmat | yeah... so the underlying issue is read on MacOS being 32bit bound | ||
| timo | if there's no standardised #define available we can either probe for that in the Configure.pl or we just always cap to INT_MAX for all systems, or we try once with the value passed and if we get EINVAL we reduce the size we attempt to read | 19:12 | |
| lizmat | I think capping for INT_MAX for now is ok :-) | 19:13 | |
| timo | we don't do anything inside the read_bytes function of syncfile to handle having read less than was requested, do all users of nqp::readfh deal with smaller actual results? | 19:14 | |
| i guess they already have to | |||
| lizmat | I guess I'll check them after I commit this | 19:15 | |
| anyways, one probably shouldn't be slurping files with that size anyway | 19:16 | ||
| timo | it's quite possibly not what you actually want to do, yeah | 19:17 | |
|
20:27
patrickb left
20:41
patrickb joined
|
|||