Isn't fork() a good idea when we have big directories?
Your bottleneck is - assuming you are running the code directly on the machine - the disk system. There is an upper limit of bytes/sec that you can read through the disk interface (in case of SATA-III, about 6 GBit/s or about 600 MByte/s (see Serial_ATA). Your disk is usually much slower. Plus, there are seek times. The disk has to literally search for the directory on the disk. An arm carrying the read heads has to be moved over the surface of the disk. That takes some time, typically some milliseconds per read access. Estimating a fast disk, you will need about 1 sec per 1000 directories, maybe less, maybe more, only waiting for the seek time.
Normally, the operating system (and the disk) caches some parts of the disk. But if you traverse the directory of the entire disk, or large parts of it, you will read more data than any cache will hold. Especially when traversing for the first time, your caches are "cold", i.e. have not yet read the data from disk. If you have insanely large amounts of RAM, your OS may have cached and read ahead a little bit during the scan. But generally, it did not.
SSDs avoid the seek time, because nothing has to be moved. But you still have to read the data. NVMe-SSDs can be accessed using PCIe speeds of about 4 GByte/s (PCIe 3.0, 4 lanes). That's about 10 times faster than SATA-III, but SSDs rarely can deliver that speed, and even less SSDs can do so continuously without overheating.
Now, what happens when you distribute the load over, say a thousand processes forked from the main process?
Right, each process get's 1/1000 of the available bandwidth. So instead of reading 600 MByte/s from SATA-III to one process, you are reading 1000x 0.6 MByte/s into 1000 processes. Well, you are not. Switching between 1000 processes has some significant overhead, you are forcing the disk to seek even more, you are wasting RAM for processes that won't help you instead of using it for caching, and as explained before, your SATA-III disk won't be able to deliver 600 MByte/s to work with. So things become significantly WORSE. Feel free to replace 1000 by any other positive integer > 1.
Now, networking, running your code on a computer not directly connected to the disks. Gigabit ethernet has a theoretical limit of 1 GBit/s = 100 MByte/s. Easily saturated by a single SATA-III interface. NVMe won't help you at all. The practical limit is less, at about 50 to 75 %, especiallly if you use more than two computers in the same network. Switching to the more expensive 10 GBit/s ethernet limits you to 1 GByte/s, barely enough for a single SATA-III interface. Throw in NVMe or a second SATA-III interface and you are again saturating the network interface. Forking new processes won't help you. The network interface is saturated. You can not get more data through it.
Other people working on the same machine. Guess what happens. They also need the disk. They take away bandwidth and cause more seek times. Plus, they also need the CPU, slowing down your process(es). Again, forking won't help you.
VNC. I like VNC, but it either needs a lot of network bandwidth to transport bitmap images of the remote screen, or it needs a lot of CPU time and memory to compress the bitmap images. If your code does not run on the machine connected to the disks, VNC steals bandwidth, memory, and CPU, even if only other people use VNC. Forking won't help you here.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)