If file sizes match just compare hashes (Digest::MD5, Digest::MurmurHash).
I don't know anything about MurmurHash, but MD5 has a higher chance of collisions than more modern hash digests. While, in all likelihood, this will not be a problem for this kind of usage, i would still go the recommended path of using something like SHA256 or even better SHA3-512.
See also:
"For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."
| [reply] |
Thank you for all the links! I am very glad. I do not have Linux, but I will try to install one to find the Perl scripts. :)
I think, in order to calculate the MD5 hash, we have to read the entire file. But if we're going to read the entire file, then why not just compare every byte? It would be less work. | [reply] |
| [reply] |
"Thank you for all the links! I am very glad. I do not have Linux, but I will try to install one to find the Perl scripts. :)"
If you're on Windows or Mac you should also be able to install modules. What issue do you have?
"But if we're going to read the entire file, then why not just compare every byte? It would be less work."
If there were only two matching files a direct comparison would be quicker, since you don't know that this is going to be the case a hash makes more sense from a performance perspective.
| [reply] |
If you're on Windows or Mac you should also be able to install modules. What issue do you have?
I'm using tinyperl on Windows XP. I am not looking for modules but ready-to-use scripts that do various things such as what I am working on right now, the "duplicate file remover."
This project is more like a programming exercise for me. So, even if I find that someone had written this script, I still want to finish writing my own. But it would be neat to see more working scripts. I will probably get a copy of linux and install it on my computer, because I want those perl scripts, especially if they are old, because tinyperl is kind of old.
I like strawberry perl, but I like tinyperl better, because it is tiny. Lol It only takes me 10 seconds to install on a new computer. It's very convenient, and doesn't take up much space.
Btw JPG photos are not like random binary files. I think, it is safe to assume that if I compare the first 70000 bytes, then the photos are the same. Why? JPG photos are so special that even if you try to take two shots of the clear blue sky, you're going to end up with two different files. If you zoom in, there is not a single pixel that is the same! Also, JPG files have a header that contains the name of the camera, the precise date & time the photo was taken and sometimes even the GPS location. If you edit a photo and change just one pixel and save it, the entire file changes and all the header info changes. So, the chances of having two different JPG photos whose sizes match and the first 70000 bytes match is infinitely small.
| [reply] |