Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

SHA-256? What do you all think of this?

by erichansen1836 (Sexton)
on May 23, 2019 at 16:37 UTC ( [id://11100424]=perlquestion: print w/replies, xml ) Need Help??

erichansen1836 has asked for the wisdom of the Perl Monks concerning the following question:

I calculate it will take 114 minutes to run this CHECKSUM script on a 127-GIG file on Windows 7 Home Premium Laptop at 5 seconds/100 million byte chunk read in and output. Total of 1375 --> 100 million byte chunks. Any faster ideas?

"NIST acknowledges that the work of Prof. Xiaoyun Wang constitutes a practical collision attack on SHA-1. Therefore, NIST encourages the rapid adoption of the SHA-2 hash functions (e.g. SHA-256) for applications requiring strong collision resistance, such as digital signatures."

https://crypto.stackexchange.com/questions/47809/why-havent-any-sha-256-collisions-been-found-yet

use IO::Handle; use Digest::SHA qw(sha256_hex); use Fcntl; use File::Basename; $AWD=dirname($0); #-- application working directory path $DWD="C:\\Users\\Eric\\Documents\\FlatFiles"; $DataFile="KJV_Bible_SDBM_528_31102_8369.dat"; $ret=0; open(OUT,"> $AWD\\ShawTestOutput.txt") || do {$ret=1;}; if (! $ret) { OUT->autoflush(1); #-- flush the output buffer each print } else{ die "crashed on open output\n"; } sysopen(IN, "$DWD\\$DataFile", O_RDONLY) || die "Open error on input f +ile\n"; $digest; print "Processing a 127-GIG file in (1375) large chunks...\n"; $i=0; #-- read in 100 million bytes at a time of 137_434_512_864 total bytes while ($nbr_bytes=sysread(IN,$buffer,100_000_000)) { $i++; print "Outputting ($i) of (1375) records\n"; $fnum=sprintf("%4.0f",$i); $fnbr_bytes=sprintf("%9.0f",$nbr_bytes); undef $digest; $digest = sha256_hex($buffer); print OUT "$fnum|$fnbr_bytes|$digest\n"; } close(IN); close(OUT); print "process complete\n"; sleep 5; #print $digest1 eq $digest2 ? # "whew!\n" : "oops!\n";

Replies are listed 'Best First'.
Re: SHA-256? What do you all think of this?
by hippo (Bishop) on May 24, 2019 at 08:11 UTC
    Any faster ideas?

    Here's a few (other than changing the algorithm):

    • Don't have a line in your code which says sleep 5; if you are trying to get it to run faster.
    • Don't set autoflush on OUT. Specifically set it off.
    • Don't have lines in your code which are essentially no-ops such as $digest;
    • Don't run this on Windows 7 Home Premium Laptop. Either buy or rent top-end hardware or go cloud but use an O/S which isn't going to be doing 1000 other things in the background and slowing everything down.
    • Benchmark and then optimise for chunk size. Did you just pick 100MB at random?
    • Profile and concentrate on the bottlenecks.

    Should be enough there to get you started.

      Thanks everyone for the great ideas. I will try them once I study up on threading a bit to see if I can get 1 process to pass in the read chunk to another process which digests it, while the other process reads another chunk of data.

      hippo asked: Benchmark and then optimise for chunk size. Did you just pick 100MB at random?

      Yes, at random. But I got an idea last night for code changes.

      I am now running the main script as a compiled (.pl to .exe) application, kicked off as a detached background process. What the main script does is read in 1 copy of the Bible at a time (8369 copies of the Bible, 127-GIG total), and creates a digital signature for each copy, stored in a persistent SDBM binary hash file of key/value pairs. But after I ran this I discovered that I had only duplicated the same digital signature 8369 times. What was I thinking? No problem, as this just confirmed that the digital signature never changed on the same data.

      I ran the background process, monitoring it from TASK MANAGER. I increased the PRIORITY from NORMAL to HIGH via TASK MANAGER. I could have done that through the Win32::Process module. Actual CPU time was about 1 hour, but the process took just under 2 hours in real time.</P.

      What I have done is hard-coded the original digital signature for 1 copy of the Bible within by PERL Database GUI user-interface application program. Whenever the end-user switches copies of the Bible (1-8369), my application will take less than 1 second to recalculate the digital signature for the currently selected copy of the Bible, and compare it to the original. (Note: This DB user-interface application uses random access indexing to each copy of the 8369 copies of the Bible, and their 1189 chapters each, via byte offsets stored in persistent SDBM files of key/value pairs tied to in-memory hash tables at run-time.) If the 127-GIG data file (8369 copies of the Bible) has been tampered with or corrupted unintentionally, and that tampering/corruption effected the currently selected copy of the Bible, then a different digital signature will be generated, which will lock out the user from that copy, and present them the message:

      "Please contact your Database Administrator and let them know that you have encountered tampering or corruption within the copy of the Bible which you have just selected. Please use the Edit-->Select sub-menu to choose a different copy of the Bible to navigate."

      My idea to store the ORIGINAL digital signatures (within an SDBM file) for chunks (sectors?) of data within a large static flat file (data warehouse) could be applied to a situation unlike my Bible database, where the data is different between sectors and generates a different digital signature. Whenever a user random accesses a single record (or group of contiguous records), the associated sector could be read and a new digital signature for that sector checked against the original signature. If they differ, the user would be locked out of that sector.

      I also have hard coded within my Bible navigation software, the ORIGINAL size of the Read Only Bible database for comparison to the current size:

      # Deny read/write to other users or processes. This is NOT an advisory + lock. # note: sopen() replaces the use of sysopen() when using Win32::Shared +FileOpen sopen(IN, "$DWD\\$DataFile", O_RDONLY | O_RANDOM , SH_DENYRW) || do { Win32::GUI::MessageBox($W1,$ErrStr,"KJV Bible Navigator - Error",16, +); return 1; }; my $size = sysseek(IN,0,2); #-- byte position at bottom/end of file (n +o systell function) if ($size != 137_434_512_864) { close(IN); Win32::GUI::MessageBox($W1,"Size mismatch on Flat File: ($DWD\\$Data +File)", "KJV Bible Navigator - Error",16,); return 1; }

      Calling script:

      use Win32::Process; use File::Basename; $AWD=dirname($0); #-- works with ShawTest.exe but does not work on ShawTest.pl #-- perhaps would have to ran as: "perl.exe","perl ShawTest. +pl" ?? #-- ShawTest.pl compiled to .exe with the IndigoSTAR PERL application +compiler $ret=Win32::Process::Create($POBJ,"$AWD\\ShawTest.exe","ShawTest",0,DE +TACHED_PROCESS,"."); $ID=$POBJ->GetProcessID(); print "($ret) pid = $ID, $AWD \n"; sleep 5; exit;
      CALLED SCRIPT:
      use IO::Handle; use Digest::SHA qw(sha256_hex); use Fcntl; use File::Basename; use SDBM_File; $AWD=dirname($0); #-- application working directory path $DWD="C:\\Users\\Eric\\Documents\\FlatFiles"; $DataFile="KJV_Bible_SDBM_528_31102_8369.dat"; #-- 528 byte records * +31102 verses * 8369 Bibles $SdbmFile="KJV_Bible_8369_DigiSign"; #-- open the 127-GIG Bible database for reading sysopen(IN, "$DWD\\$DataFile", O_RDONLY) || die "Open error on input f +ile\n"; %DigitalSignatures; keys %DigitalSignatures = 8369; tie( %DigitalSignatures, "SDBM_File", "$DWD\\$SdbmFile", O_WRONLY | O_ +CREAT, 0666 ); unless ( tied %DigitalSignatures ) { close(IN); die "Can't tie hash table to SDBM Files:\n$DWD\\$SdbmFile (.pag and +.dir) $!"; } $ret=0; open(OUT,"> $AWD\\ShawTestOutput.txt") || do {$ret=1;}; if (! $ret) { OUT->autoflush(1); #-- flush the output buffer each print } else{ close(IN); untie(%DigitalSignatures); die "crashed on open output of $AWD\\ShawTestOutput.txt\n"; } #-- Read in 16_421_856 bytes (at a time) of 137_434_512_864 total byte +s. #-- This process estimated to take 114 minutes on Windows 7 Home Premi +um laptop. #-- Based Upon a previous test where it took: #-- approx. 5 seconds to: (input, digest, output) each 100 million byt +e string read in. #-- i.e. (5 seconds * 1375)/60 = 114.5 minutes for a 127-GIG flat "tex +t" file, containing #-- 528 byte fixed-length records, with no CR/LF(newline,\n) record se +parators. #-- #-- Now, that same flat file, containing 8369 complete copies of the K +JV Bible, will be read in #-- and a digital signature created for each of the 8369 copies of the + Bible. #-- These digital signatures will then be loaded into an SDBM file/sto +re of key/value #-- pairs which will be tied at run-time to an in-memory hash table fo +r lookup #-- and verification of each copy of the Bible accessed by the end-use +r within the #-- database GUI user-interface application. Each time an end-user se +lects a different #-- copy of the Bible to view (1 of 8369 copies), the DB interface app +lication will #-- create a new digital signature for the single copy of the Bible se +lected, and compare #-- that signature against the ORIGINAL signature held in the SDBM fil +e key/val store, #-- tied to an in-memory hash table. There will be 8369 ORIGINAL digit +al signatures held #-- within the binary SDBM file. It is estimated to take around 0.8 se +conds to create the #-- digital signature anew each time the end-user selects a different +copy of the Bible to #-- view. That seems like a reasonable amount of time to expect the en +d-user to wait #-- patiently for a digital signature comparison test (ORIGINAL vs. CU +RRENT signatures). #-- This 0.8 seconds will take place during the time a TREEVIEW widget + is being loaded with #-- the 66 Books of the KJV Bible, and their 1189 total chapters, for +the single copy of the #-- Bible just selected. #-- Will run this script as a Windows(tm) O/S, Detached background pro +cess. #-- First, we compiled ShawTest.pl to ShawTest.exe #-- #-- Perl2Exe V26.10 2018-01-31 Copyright (c) 1997-2018 IndigoSTAR Soft +ware #-- #-- This is an evaluation version of Perl2Exe, which may be used for 3 +0 days. #-- For more information see the attached pxman.htm file, #-- or visit http://www.indigostar.com #-- Generating ShawTest.exe $digest; # print "Processing a 127-GIG file in (8369) large chunks...\n"; $i=0; while ($nbr_bytes=sysread(IN,$buffer,16_421_856)) { $i++; #print "Outputting ($i) of (8369) records\n"; #-- about every 0.8 +seconds $fnum=sprintf("%4.0f",$i); $fnbr_bytes=sprintf("%8.0f",$nbr_bytes); undef $digest; $digest = sha256_hex($buffer); print OUT "$fnum|$fnbr_bytes=$digest\n"; $DigitalSignatures{$fnum}=$digest; #-- store each digital signatur +e in a binary SDBM file. #-- note: each key will be 4 bytes long, and each value 64 bytes l +ong - total 68 bytes. } close(IN); close(OUT); untie(%DigitalSignatures); exit;

        Checksumming is a good thing to have. But databases that are already implemented got there first: postgres for instance has the --data-checksums option on initdb. (In PostgreSQL 12, expected fall 2019, the setting can even be changed after the database has been initialized.)

Re: SHA-256? What do you all think of this?
by RMGir (Prior) on May 23, 2019 at 18:03 UTC
    I suspect it's going to be strongly I/O bound, with the sha calculation not adding much time, but you could use a thread pool and hand off the read data blocks to a thread for the SHA calculation/file update steps...

    It might be worth profiling it first (or just adding some printf's with timestamps) to see if anything but the reads (or possibly the sha calculation, if that's what takes all the time) takes significant time, though.

    Given how large your reads are, you're not going to want more than a few blocks in progress at a time, though, or you'll be soaking up too much RAM.


    Mike

      Okay I'll try that and see if it speeds things up a bit/byte.

      “The world went and got itself in a big damn hurry.” -- "The SHA-Shank Redemption"

      #-- Output snippet:

      1|100000000|d04e015247bf7b97c7687a48ef450c4155947b5772b10f45255b5d6 +cdd3c1e57 2|100000000|b36e8bd3719da62b29fb662f4ad0b7ceacdf0216f0459161bd55547 +e2d08f708 3|100000000|22c8d79642a7489885546dce83fe79f5d93027205de254f42a3269e +3e72ed0e1 4|100000000|38dffbb0997badf1ffcfda22dfa10402609982ba037ad00d293e0fb +9d218b1ce 5|100000000|bb0be91d5a969f41d54caf23de5c4d9595dae526f27029d7de8b76f +7cb814207 6|100000000|3f633fb8533b98a88912ecb5538e9cadd388668df9be2cf9bbb55ed +d3fe77d89 7|100000000|4a5c6f0ddd358163fff067e58eac88cf8dc3979a104f7b10a45f2fa +face1b5a7 8|100000000|03ea2a1990d9ebf76219f998437f8d125a96da592b41729f2189e80 +c93b68d2c 9|100000000|909899bd2583f374fbd9afb3cbf0b6295fd6be18ad06ddf03a203fc +df4b4d4b3 10|100000000|61f975350639d5282c3ea05e31b2cfc54c3014efcd2493cf238db3b +6515d5b5e 11|100000000|50101cd70c068b421d73b85c54c2d51962cf1257ffa4d156ee24106 +c6bcfd198 12|100000000|d03abf48f5a1fce2343b96f15b1974e4c840fc24330a91c896df2d0 +8ef483124 13|100000000|5cf64d35570d0d9304d9b42ba2f5c5d134305da521217b4ce235631 +87b7dc296 14|100000000|c180d599cd64077550c026a77a6be517be9313b1680331bca5bcfa6 +aa021ec0f 15|100000000|1c41af9bf0f5b997553fb523c3a7c13ae28a428ccac931895acc35e +68e979dd5
Re: SHA-256? What do you all think of this?
by bliako (Monsignor) on May 23, 2019 at 22:44 UTC

    If I understand correctly you have a large text which you break into 100mbyte chunks and then calculate the checksum on each chunk independently, i.e. you are not trying to create 1 checksum for the whole 127gbyte text.

    You can save some time by setting up a pipeline: it starts by reading 1 chunk. When it is done, the checksum calculation begins in another thread and at the same time current thread reads the 2nd chunk in parallel.The savings depend on the ratio of the time reading from disk over the time calculating the checksum. The big objection here is that shared memory between threads in Perl is not efficient (perhaps someone can teach me otherwise) and you end up wasting more time in locking or in duplicating data between the threads. The alternative is the reader thread to pass data via a pipe to the calculating thread.

    Also, it is worth investigating Digest::SHA's ability to add data from a stream as it becomes available over the pipe and whether some calculations can be done before the full chunk becomes available. I do not know about this.

    If the problem is indeed IO-bound then compressing the files to, say, half the size will reduce IO time and increase calculation time (decompressing+caclulating). If you can make that ratio 50/50 then you have a nice candidate for a pipeline to halve your overall time (and increase power consumption). You should consider this only if you intend to repeat this experiments in the future (see paragraph below) otherwise you will end up with both longer time and bigger electricity bill.

    Lastly, if you are thinking doing similar experiments using same data, perhaps one can split the file (unix split --bytes=100000000 file.dat) in advance and move it into different physical disks permanently. The cost of split+move can be worth if you intend to do these (or similar) calculations/experiments repeatedly. Suppose you move them to 3 different disks, then you can parallelise the process over 3 threads and benchmark what your OS and hardware get you on the theoretical 2/3 savings. With this setup you can save time over every experiment you make in the future but I doubt anyone has 3 physically distinct disks on a laptop.

    bw, bliako

Re: SHA-256? What do you all think of this?
by daxim (Curate) on May 24, 2019 at 10:56 UTC

      Use SHA-3. http://valerieaurora.org/hash.html

      Nice table of statistics. Thanks.

      SHA-2 still unbroken, but SHA-3 looks promising for now.

      SHA-2 had similar STRONG record as SHA-3 for the same number of years since inception.

      I used SHA-256, but there is also SHA-384 and SHA-512 in the SHA-2 Family.

Re: SHA-256? What do you all think of this?
by bliako (Monsignor) on May 25, 2019 at 12:41 UTC
Re: SHA-256? What do you all think of this?
by karlgoethebier (Abbot) on May 24, 2019 at 19:17 UTC

    I already asked myself four years ago what strange kind of bible studies you are doing on your Windows 7 lap.

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

      "I already asked myself four years ago what strange kind of bible studies you are doing on your Windows 7 lap."

      Hilarious !!

      Hey, the KJV Bible makes for great test data: logically partitioned into 66 Books, and 1189 Chapters (and their verses). In the Public Domain. Easily obtainable. Easily verifiable. And it was easy to make 8369 copies of it to populate a 128-GIG test file to try out the random access indexing capabilities, and verify the results.

      No I am not some MAD SCIENTIST performing vivisections upon poor defenseless Scripture verses.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11100424]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-04-19 09:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found