Re^7: Finding files recursively

Replies are listed 'Best First'.
Re^8: Finding files recursively by ovedpo15 (Pilgrim) on Aug 06, 2019 at 07:01 UTC
Tried a few tests, it always returns 10-15 min difference. We use VNC so other users also use the machine but it should not affect (as much) the penalty for searching. Isn't fork() a good idea when we have big directories?	[reply]
Re^9: Finding files recursively by afoken (Chancellor) on Aug 06, 2019 at 08:30 UTC
Isn't fork() a good idea when we have big directories? Your bottleneck is - assuming you are running the code directly on the machine - the disk system. There is an upper limit of bytes/sec that you can read through the disk interface (in case of SATA-III, about 6 GBit/s or about 600 MByte/s (see Serial_ATA). Your disk is usually much slower. Plus, there are seek times. The disk has to literally search for the directory on the disk. An arm carrying the read heads has to be moved over the surface of the disk. That takes some time, typically some milliseconds per read access. Estimating a fast disk, you will need about 1 sec per 1000 directories, maybe less, maybe more, only waiting for the seek time. Normally, the operating system (and the disk) caches some parts of the disk. But if you traverse the directory of the entire disk, or large parts of it, you will read more data than any cache will hold. Especially when traversing for the first time, your caches are "cold", i.e. have not yet read the data from disk. If you have insanely large amounts of RAM, your OS may have cached and read ahead a little bit during the scan. But generally, it did not. SSDs avoid the seek time, because nothing has to be moved. But you still have to read the data. NVMe-SSDs can be accessed using PCIe speeds of about 4 GByte/s (PCIe 3.0, 4 lanes). That's about 10 times faster than SATA-III, but SSDs rarely can deliver that speed, and even less SSDs can do so continuously without overheating. Now, what happens when you distribute the load over, say a thousand processes forked from the main process? Right, each process get's 1/1000 of the available bandwidth. So instead of reading 600 MByte/s from SATA-III to one process, you are reading 1000x 0.6 MByte/s into 1000 processes. Well, you are not. Switching between 1000 processes has some significant overhead, you are forcing the disk to seek even more, you are wasting RAM for processes that won't help you instead of using it for caching, and as explained before, your SATA-III disk won't be able to deliver 600 MByte/s to work with. So things become significantly WORSE. Feel free to replace 1000 by any other positive integer > 1. Now, networking, running your code on a computer not directly connected to the disks. Gigabit ethernet has a theoretical limit of 1 GBit/s = 100 MByte/s. Easily saturated by a single SATA-III interface. NVMe won't help you at all. The practical limit is less, at about 50 to 75 %, especiallly if you use more than two computers in the same network. Switching to the more expensive 10 GBit/s ethernet limits you to 1 GByte/s, barely enough for a single SATA-III interface. Throw in NVMe or a second SATA-III interface and you are again saturating the network interface. Forking new processes won't help you. The network interface is saturated. You can not get more data through it. Other people working on the same machine. Guess what happens. They also need the disk. They take away bandwidth and cause more seek times. Plus, they also need the CPU, slowing down your process(es). Again, forking won't help you. VNC. I like VNC, but it either needs a lot of network bandwidth to transport bitmap images of the remote screen, or it needs a lot of CPU time and memory to compress the bitmap images. If your code does not run on the machine connected to the disks, VNC steals bandwidth, memory, and CPU, even if only other people use VNC. Forking won't help you here. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^9: Finding files recursively by holli (Abbot) on Aug 06, 2019 at 10:48 UTC
Are you processing the files you found in any way you don't show us? I'm still looking for an explanation for the measly speedup you're experiencing. Time for a reality check. How long does it take to run this? `find /where/the/secret/files/are -name secret.file 1>secret-files.dat +2>/dev/null` [download] Also, how stressed is the server? Please try to find out about the CPU-load and the IO-load. holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^9: Finding files recursively by bliako (Monsignor) on Aug 06, 2019 at 15:36 UTC
fork for as many physically different disks you have. not partitions, not directories.	[reply]
Re^10: Finding files recursively by afoken (Chancellor) on Aug 06, 2019 at 20:19 UTC
fork for as many physically different disks you have. not partitions, not directories. When using RAIDs and/or LVM, that rule is a little bit too simple: In case of hardware RAIDs (expensive disk controller with dedicated CPU, dedicated RAM, perhaps battery backup), treat each RAID volume as a single disk. The number of physical disks is irrelevant in this case. In case of software RAIDs (e.g. Linux MD driver, ZFS), things may become compicated. In the most simple case, each RAID volume is composed of several disks containing no or just a single partition, and you can treat each RAID volume as a single disk. If you spread several RAID volumes over several disks (e.g. a /boot RAID-1 using the first partition of each disk and a /data or root RAID-5 using the second partition of each disk), you need to treat the two RAIDs as a single disk. For more advanced setups, things will get successively more complex. In case of fake RAIDs (cheap disk controller with no CPU, no RAM, just a boot ROM, implementing a BIOS-level software RAID), the hardware RAID rules apply if the fake RAID allows only RAIDs of entire disks. If the fake RAID allows to partition the disks into several RAIDs (I've never seen that), the software RAID rules apply. If you use LVM on top of the RAID, or even just on top of bare disks, you need to treat all disks (physical or RAID volumes) shared by an LVM set as a single disk. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^11: Finding files recursively by bliako (Monsignor) on Aug 07, 2019 at 08:49 UTC
Re^12: Finding files recursively by afoken (Chancellor) on Aug 07, 2019 at 15:06 UTC
Some notes below your chosen depth have not been shown here


Pathologically Eclectic Rubbish Lister
	PerlMonks