Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^7: Finding files recursively

by holli (Abbot)
on Aug 05, 2019 at 21:38 UTC ( [id://11103995]=note: print w/replies, xml ) Need Help??


in reply to Re^6: Finding files recursively
in thread Finding files recursively

I would expect more than 4% speedup. You mentioned other users. Are you running this on some kind of shared network drive? If so, then THAT is your bottleneck. It's hard to say wether parallelization will speed up things without knowing more about the directory structure.


holli

You can lead your users to water, but alas, you cannot drown them.

Replies are listed 'Best First'.
Re^8: Finding files recursively
by ovedpo15 (Pilgrim) on Aug 06, 2019 at 07:01 UTC
    Tried a few tests, it always returns 10-15 min difference. We use VNC so other users also use the machine but it should not affect (as much) the penalty for searching. Isn't fork() a good idea when we have big directories?
      Isn't fork() a good idea when we have big directories?

      Your bottleneck is - assuming you are running the code directly on the machine - the disk system. There is an upper limit of bytes/sec that you can read through the disk interface (in case of SATA-III, about 6 GBit/s or about 600 MByte/s (see Serial_ATA). Your disk is usually much slower. Plus, there are seek times. The disk has to literally search for the directory on the disk. An arm carrying the read heads has to be moved over the surface of the disk. That takes some time, typically some milliseconds per read access. Estimating a fast disk, you will need about 1 sec per 1000 directories, maybe less, maybe more, only waiting for the seek time.

      Normally, the operating system (and the disk) caches some parts of the disk. But if you traverse the directory of the entire disk, or large parts of it, you will read more data than any cache will hold. Especially when traversing for the first time, your caches are "cold", i.e. have not yet read the data from disk. If you have insanely large amounts of RAM, your OS may have cached and read ahead a little bit during the scan. But generally, it did not.

      SSDs avoid the seek time, because nothing has to be moved. But you still have to read the data. NVMe-SSDs can be accessed using PCIe speeds of about 4 GByte/s (PCIe 3.0, 4 lanes). That's about 10 times faster than SATA-III, but SSDs rarely can deliver that speed, and even less SSDs can do so continuously without overheating.

      Now, what happens when you distribute the load over, say a thousand processes forked from the main process?

      Right, each process get's 1/1000 of the available bandwidth. So instead of reading 600 MByte/s from SATA-III to one process, you are reading 1000x 0.6 MByte/s into 1000 processes. Well, you are not. Switching between 1000 processes has some significant overhead, you are forcing the disk to seek even more, you are wasting RAM for processes that won't help you instead of using it for caching, and as explained before, your SATA-III disk won't be able to deliver 600 MByte/s to work with. So things become significantly WORSE. Feel free to replace 1000 by any other positive integer > 1.

      Now, networking, running your code on a computer not directly connected to the disks. Gigabit ethernet has a theoretical limit of 1 GBit/s = 100 MByte/s. Easily saturated by a single SATA-III interface. NVMe won't help you at all. The practical limit is less, at about 50 to 75 %, especiallly if you use more than two computers in the same network. Switching to the more expensive 10 GBit/s ethernet limits you to 1 GByte/s, barely enough for a single SATA-III interface. Throw in NVMe or a second SATA-III interface and you are again saturating the network interface. Forking new processes won't help you. The network interface is saturated. You can not get more data through it.

      Other people working on the same machine. Guess what happens. They also need the disk. They take away bandwidth and cause more seek times. Plus, they also need the CPU, slowing down your process(es). Again, forking won't help you.

      VNC. I like VNC, but it either needs a lot of network bandwidth to transport bitmap images of the remote screen, or it needs a lot of CPU time and memory to compress the bitmap images. If your code does not run on the machine connected to the disks, VNC steals bandwidth, memory, and CPU, even if only other people use VNC. Forking won't help you here.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      Are you processing the files you found in any way you don't show us? I'm still looking for an explanation for the measly speedup you're experiencing. Time for a reality check. How long does it take to run this?
      find /where/the/secret/files/are -name secret.file 1>secret-files.dat +2>/dev/null
      Also, how stressed is the server? Please try to find out about the CPU-load and the IO-load.


      holli

      You can lead your users to water, but alas, you cannot drown them.

      fork for as many physically different disks you have. not partitions, not directories.

        fork for as many physically different disks you have. not partitions, not directories.

        When using RAIDs and/or LVM, that rule is a little bit too simple:

        In case of hardware RAIDs (expensive disk controller with dedicated CPU, dedicated RAM, perhaps battery backup), treat each RAID volume as a single disk. The number of physical disks is irrelevant in this case.

        In case of software RAIDs (e.g. Linux MD driver, ZFS), things may become compicated. In the most simple case, each RAID volume is composed of several disks containing no or just a single partition, and you can treat each RAID volume as a single disk. If you spread several RAID volumes over several disks (e.g. a /boot RAID-1 using the first partition of each disk and a /data or root RAID-5 using the second partition of each disk), you need to treat the two RAIDs as a single disk. For more advanced setups, things will get successively more complex.

        In case of fake RAIDs (cheap disk controller with no CPU, no RAM, just a boot ROM, implementing a BIOS-level software RAID), the hardware RAID rules apply if the fake RAID allows only RAIDs of entire disks. If the fake RAID allows to partition the disks into several RAIDs (I've never seen that), the software RAID rules apply.

        If you use LVM on top of the RAID, or even just on top of bare disks, you need to treat all disks (physical or RAID volumes) shared by an LVM set as a single disk.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11103995]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 16:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found