Re^9: Error binmode() on unopened filehandle

Thsnks for your interesting benchmark. That surprised me.

The reason why 8K is "good"... The smallest unit of data that can be written to the disk is called a sector. For a bunch of historical and practical reasons, the most common value seen today is 512 bytes. There is no need for the file system to keep track of such a small unit. So the file system keeps track of blocks of sectors. An extremely common value of this smallest file system data unit is 8Kytes or 16 sectors. A combination of df or du commands can show this on a Unix system. Sorry don't have a Unix sys right now to post an example. If you write a file with one byte in it, it will take 8K of space on the disk. It is more efficient to just start out in the first place with a buffer size that will make the "file system happy" (increment of 8K). Bigger buffers typically help, but there are limits. I suspect not much to be gained once you are past 4*8192 bytes. Yes, sysread would have lower overhead. The OP's situation doesn't sound like any kind of performance issue.

Comment on Re^9: Error binmode() on unopened filehandle

Replies are listed 'Best First'.
Re^10: Error binmode() on unopened filehandle by ikegami (Patriarch) on May 09, 2020 at 08:27 UTC
I know that, but I've already explained what it's irrelevant. There's no correspondence between the parameter passed to `read` and the amount that needs to be read from disk, so saying that reading 8192 bytes from disk at a time is a good idea doesn't make requesting 8192 characters from `read` a good idea. Take for example text consisting entirely of ASCII characters save for a one character with a 3-byte encoding. `read(..., 8192)` requires reading 8194 bytes from disk. So asking for 8190 characters would have been a better choice if reading 8192 bytes from disk is optimal as you claim. The only time one might be able to claim that providing a size of 8192 to `read` is a good choice is when reading text using a fixed-width encoding (so not UTF-8 or UTF-16le). These days, that would mostly be binary data, but using `read` at all to slurp a binary file is surely slower than using `sysread`. So even then, `read(..., 8192)` would be suboptimal. In fact, even with text files, using `sysread` and decoding afterwards is probably faster than using `read` with an encoding layer if you're interested in slurping the whole file. Your statements about performance seem quite uninformed.	[reply] [d/l] [select]
Re^11: Error binmode() on unopened filehandle by Marshall (Canon) on May 10, 2020 at 23:29 UTC
I know that, but I've already explained what it's irrelevant. There's no correspondence between the parameter passed to read and the amount that needs to be read from disk, so saying that reading 8192 bytes from disk at a time is a good idea doesn't make requesting 8192 characters from read a good idea. Well the title of this node has "binmode()" explicitly in the title. Given that, my assumption that we are talking about binary data is not completely unfounded! I was indeed quite surprised by your code at Re^8: Error binmode() on unopened filehandle. I knew that your conclusion was incorrect. "I used 81024 because read reads in 8 KiB chunks anyway.". But at the time, I just wanted to hit the basics for other readers. I suspect part of the problem is 100_000 vs 1000000.? There are also typically limits to the size of the STDIN pipe. Now I supply code that you can run on an actual disk file. Your test code is not representative of a real world example. Read() can indeed read more than 8192 bytes! `use strict; use warnings; # run on Windows 10 Home Edition my $file ='COVID19-Death02Apr.jpg'; # any big file open my $fh, '<', $file or die "$!"; binmode $fh; my $data; my $num_read = read ($fh, $data, 38192); print $num_read; # 24576 just fine! __END__ open my $fh, '<', $file or die "$!"; binmode $fh; This following open method means exactly the same as above: open my $fh, '<:raw', $file or die "$!";` [download] I have not done any benchmarking of read() using :raw encoding vs sysread(). I believe that the more direct sysread() method will be faster, but by how much? I do not know (and whether we are measuring CPU time or execution time) . read() adds an additional level of buffering even in :raw mode (or so I suspect). However, most UNIX versions also copy data to a system area before queuing the disk write for the hardware, i.e. the absolute memory pointer that the disk subsystem gets will not be in a user memory space. An extra copy may not matter much. The execution time to do a file copy with minimal processing will be dominated by disk system's ability to produce the "next blocks" and write them. There are mechanical motions involved in this and some extra CPU time may or may not matter that much depending upon what and how it is done.	[reply] [d/l]


more useful options
	PerlMonks