Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Flat File Database (random access indexing)

by erichansen1836 (Sexton)
on May 04, 2019 at 14:16 UTC ( #1233353=CUFP: print w/replies, xml ) Need Help??

UPDATE Monday June 3, 2019... I tested filling a 128-GIG flat file (containing 137_438_953_472 bytes) with 68,430 complete copies of the KJV Bible, at 2_008_451 bytes/Bible. I was able to get 8 times as many records (over 2 billion records) when switching to variable-length delimited records where I additionally compacted the data by 50% more by storing the text as random seeded Word-to-Code key mapped text using 1, 2, and 3 character codes (1/4 million of them to randomly choose from) generated solely from the character set {A-Z, a-z, 0-9}. Additionally, I used SHA-2 (256) digital signature generation on 1 copy of the Bible (2_008_451 bytes) to compare against real-time digital signature generation each time a different copy of the Bible is accessed through the DB GUI front-end. A sector/block of 31,102 records, 2_008_451 bytes (a complete copy of the Bible) is read in by one READ STATEMENT through the front-end, and a digital signature generated on-the-fly to compare against the original. If a different signature is produced, an auto-correct of the database occurs from a backup image for the effected sector/block of records only. This happens faster than you can blink.

UPDATE Monday May 6, 2019... I tested a 114-GIG Flat File for random access of its records and it works on Windows 7 Home Premium O/S with a NTFS (NT File System) using 64-bit ActiveState ActivePerl for Windows version/release 5.26.1. and the File I/O syntax: sysopen, sysseek, $tell=sysseek(FH,0,1), sysread, syswrite, and close. The Flat File had 7500 copies of the Bible in it (7500 copies of the Bible * 31,102 verses/Bible copy, and at a fixed-length record size of 528 bytes, and over 233 million records). I used a single 2-GIG SDBM index file to index the records of the 114-GIG Flat File. I would have indexed a full 128-GIG Flat File, but I hit the 2-GIG limit with the single SDBM file used for indexing. A 2nd SDBM file would be needed to index the records in a 128-GIG Flat File from the 114-GIG mark to the 128-GIG mark i.e. Bible copy 7501 through 8369.

###################################################################### +###### #-- This PERL script creates the KVJ Bible indexing for 7500 complete +copies #-- of the Bible contained within a single 114-GIG Flat File with 528 +byte #-- fixed-length records. ###################################################################### +###### #-- #-- INPUT: #-- #-- 66 Books, 1189 Chapters, 31102 Verses #-- DUMP 1 (edited) format: #-- "1|Genesis|Ge" #-- ... #-- "66|Revelation|Re" #-- ================== #-- DUMP 2 (unedited) format: #-- "1|1|31" #-- ... #-- "66|22|21" #-- ================== #-- DUMP 3 (unedited) format: #-- "1|50" #-- ... #-- "66|22 #-- #-- I used the Abingdon's/Strong's names & abbreviations for Bible Boo +ks #-- ###################################################################### +########## use File::Basename; use SDBM_File; use Fcntl; print "Processing... Please wait..." . "\n"; $cwd=dirname($0); %BibleIDX=(); keys %BibleIDX = 8917567; #-- ((1189 * 7500) = 8917500) + 66 + 1 = +8917567 tie( %BibleIDX, "SDBM_File", '.\KJV_Bible_SDBM_528_31102_7500', O_RDWR +|O_CREAT, 0666 ); $total_books=0; $total_chps=0; $total_vers=0; ###################################################################### +########## open(IN, "< $cwd\\KJV_BIBLE_DUMP_1_edited.txt"); #-- unloaded from + MS-Access while (<IN>) { chomp; #-- remove CR/LF ($bk_nbr, $name, $abbrev)=split(/\|/, $_); $BibleIDX{$bk_nbr} = $name . "," . $abbrev; #-- 1=Genesis,Ge $total_books++; } close(IN); ###################################################################### +########## @nbr_chapters=(); open(IN2, "< $cwd\\KJV_BIBLE_DUMP_2_unedited.txt"); #-- unloaded from + MS-Access while (<IN2>) { chomp; #-- remove CR/LF push @nbr_chapters, $_; } close(IN2); ###################################################################### +########## open(IN3, "< $cwd\\KJV_BIBLE_DUMP_3_unedited.txt"); #-- unloaded from + MS-Access while (<IN3>) { chomp; #-- remove CR/LF ($bk_nbr, $nbr_chps)=split(/\|/, $_); $BibleIDX{$bk_nbr} = $BibleIDX{$bk_nbr} . "," . $nbr_chps; $total_chps+=$nbr_chps; } close(IN3); ###################################################################### +########## $offset=0; for ($tran_nbr=1; $tran_nbr<=7500; $tran_nbr++) { print "Indexing the chapters for Bible copy number $tran_nbr of 750 +0\n"; foreach $chapter (@nbr_chapters) { ($bk_nbr, $chp_nbr, $nbr_verses)=split(/\|/, $chapter); $key=$tran_nbr . "|" . $bk_nbr . "|" . $chp_nbr; if ($key eq "1|1|1") { $BibleIDX{$key} = $offset . "," . $nbr_verses; #-- 1|1| +1 = 0,31 $nextoffset=(528 * $nbr_verses); } else { $offset = $nextoffset; $BibleIDX{$key} = $offset . "," . $nbr_verses; $nextoffset=$offset + (528 * $nbr_verses); } $total_vers+=$nbr_verses; } } ###################################################################### +########## $total_books+=(66 * 7499); $total_chps+=(1189 * 7499); print "Process complete:" . "\n\n"; print $total_books . "\n"; print $total_chps . "\n"; print $total_vers . "\n"; $BibleIDX{"STATS"} = "(1 to 7500) " . $total_books . "," . $total_chps + . "," . $total_vers; print $BibleIDX{"STATS"} . "\n"; exit; ###################################################################### +########## END { untie(%BibleIDX); close(IN); close(IN2); close(IN3); sleep 5; }
###################################################################### +################# #-- This PERL script creates a 114-GIG Flat File containing 233+ milli +on Bible verses, #-- for 7500 complete copies of the KJV Bible (31,102 verses/copy). #-- #-- One copy of the Bible is used as input, and is of variable-length +record format, #-- with CR/LF (newline) record terminator. Output file will be format +ted to 528 byte #-- fixed-length records, with no record terminator. #-- #-- Fixed-length records are used to demonstrate capacity requirements + for a READ/WRITE #-- database where the records are edited in-place at reliable byte of +fset locations. #-- #-- A similar READ ONLY database with variable-length records would ha +ve the capacity #-- to hold 4 times as many Bible verses i.e. almost 1 Billion records +/verses. #-- #-- This Flat File is being created for RANDOM ACCESS of its records, +where the byte #-- offset locations are stored persistently in a PERL SDBM database f +ile of key/value #-- pairs tied to an in-memory PERL program hash table at run-time. #-- #-- Tested on Windows 7 Home Premium with NT File System (NTFS). #-- Tested with ActiveState ActivePerl for Windows version/release 5.2 +6.1 #-- Tested using unbuffered File I/O syntax: sysopen, sysseek, sysrea +d, syswrite, #-- and close. ###################################################################### +################# use File::Basename; use Fcntl; $cwd=dirname($0); #-- this application program directory print "Processing... Please wait..." . "\n"; $total_verses = 0; @nbr_verses = (); open(IN, "< $cwd\\BibleVerses1Copy.txt") or do {print "Error on open input: $!\n"; sleep 5; die}; while (<IN>) { chomp; #-- remove CR/LF push @nbr_verses, $_; #-- load 31,102 verses to an array in memor +y } sysopen(OUT,"$cwd\\KJV_BIBLE_SDBM_528_31102_7500.dat", O_WRONLY|O_CREA +T) or do {print "Error on open output: $!\n"; sleep 5; die}; $tell = sysseek(OUT, 0, 0); #-- top of file for ($tran_nbr=1; $tran_nbr<=7500; $tran_nbr++) { print "$tran_nbr of 7500\n"; foreach $verse (@nbr_verses) { $total_verses++; $line=sprintf("%-528s", $verse); syswrite(OUT, $line); } } print "Total verses written = $total_verses \n"; exit; END { close(IN); close(OUT); sleep 5; }
###################################################################### +########################### # This PERL script prints out the verses from the last chapter of the +KJV Bible # i.e. Book of Revelation(book 66), chapter 22, verses 1-21. It does + this by performing # Random Access to the last chapter of the 7500th copy of the Bible in + a 114-GIG Flat File using # a SDBM file of key/value pairs containing the record byte offsets to + the first verse within # each of the 1189 chapters within each of the 7500 copies of the Bibl +e. # # This program works with ActivePerl (5.26.1), but not ActivePerl (5.6 +.1) # Any similar application built (using flat files > 4-GIG in size) mus +t use (5.26.1) for # Joint Database Technology (JDT) i.e. # Flat Files with fixed-length records indexed by DBM Files of key/val +ue pairs. # # Continue to use ActivePerl (5.6.1) for ODBC/MS-Jet 4.x (MS-Access Dr +iver) SQL databases as it # does not work on ActivePerl (5.26.1), at least not on Windows 7 Home + Premium O/S. # # TESTING... Windows 7 Home Premium (NT File System - NTFS)... # ###################################################################### +########################### use File::Basename; use IO::Handle; use SDBM_File; use Fcntl; $cwd=dirname($0); $tr=7500; $bk=66; $chp=22; $key=$tr . "|" . $bk . "|" . $chp; $outfile="$cwd\\KJVBibleSDBM_528_31102_7500_Report_" . $tr. "_" . $bk +. "_" . $chp . ".txt"; open(OUT,"> $outfile"); OUT->autoflush(1); sysopen(IN,"$cwd\\KJV_BIBLE_SDBM_528_31102_7500.dat", O_RDONLY); $size = sysseek(IN,0,2); #-- bottom of file print OUT "File Size: " . $size . "\n"; tie( %BibleIDX, "SDBM_File", 'KJV_Bible_SDBM_528_31102_7500', O_RDONLY +, 0444 ); #-- (.pag, .dir) unless (tied %BibleIDX) { print "Error tying SDBM hash files: [$cwd\\KJV_Bible_SDBM_528_3110 +2_7500(.pag && .dir)]\n"; sleep 5; die; } print OUT "INDEX FILE STATS: ", $BibleIDX{"STATS"}, "\n"; ($offset, $nbr_verses) = split(/,/,$BibleIDX{$key}); for ($j=1;$j<=$nbr_verses;$j++) { if ($j == 1) { $pos=sysseek(IN,$offset,0); $tell=sysseek(IN, 0, 1); print OUT "pos=($pos), tell=($tell)\n\n" } sysread(IN,$rec,528); #-- or we could read (528 * $nbr_verses) an +d unpack to an array $tell=sysseek(IN, 0, 1); #-- 0 bytes from current position $tell -= 528; $verse = sprintf("%s", $rec); #-- necessary because $rec initializ +ed to 528 length w/padded spaces. #$verse =~ s/^\s+//; #-- if we had needed to remove any leading sp +aces $verse=~s/ *$//; #-- works on $verse but not on $rec to remove + trailing spaces $verse=~s/\s+$//; #-- works on $verse but not on $rec to remove + trailing spaces ($name,$abbrev)=split(/\,/,$BibleIDX{$bk}); $line=$tell . "|" . $tr . "|" . $bk . "|" . $name . "|" . $abbrev +. "|" . $chp . ":" . $j . "|" . $verse . "\n\n"; print OUT $line; } exit; END { close(IN); close(OUT); untie( %BibleIDX ); print "BYE\n"; sleep 3; }

Replies are listed 'Best First'.
Re: Flat File Database (random access indexing)
by bliako (Parson) on May 04, 2019 at 14:48 UTC

    though octal, 0666 may cost your software papal accreditation.

    seriously, what is the reason for using flat file databases as opposed to, say, SQLite which is a file database?

        Along with all the other associated root nodes :)

        Thanks. That sums it up.

Re: Flat File Database (random access indexing)
by jdporter (Canon) on May 06, 2019 at 14:21 UTC

    Please stop posting this crap. We get it. Enough.

    I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.
      TO: jdporter... Others will understand this who have experience with this sort of thing i.e. NoSQL. It is not for everyone i.e. lovers of SQL. Being able to use multiple 128-GIG flat files for a large relational database with random access indexing to billions of records should make this a MUCH more attractive data warehousing alternative for some folks. With the advent of ActiveState ActivePerl 5.26.1, a truly significant change has been made. PERL 5.6.1 was limited to flat file indexing to 4-GIG. But with PERL 5.26.1, 128-GIG flat files can be random access indexed, and perhaps even larger flat files?

        What you don't seem to be getting is that you've preached your gospel enough, and more than enough. It's become spam.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://1233353]
Approved by NetWallah
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (9)
As of 2020-04-04 22:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The most amusing oxymoron is:
















    Results (32 votes). Check out past polls.

    Notices?