Unless you have a second disk in which to write the output file, I wouldn't use two files.
Besides the huge intermediate storage requirement which can be a problem for smaller systems, writing (and the OS locating space to write to) a disk, at the same time as you are reading from that disk can severly slow down the process. Especially if the disk is more than say 50% full and/or fragmented.
I'd use a two pass process.
- Pass 1 reads the file and records the start&end pairing of each record to be deleted.
If the kill list is small, load it all in a hash.
If it's too big for memory, load the kill list in to a hash in chunks and write the positions to another file.
The second pass only need the positions one at a time (once they are sorted).
- The second pass sorts the positions and then opens two filehandles to the data file. 1 for reading, 1 writing. It then reads via the former, writing to the latter, overwriting the bits that need deleting. Finally truncating the file (via the write handle) to it's final length.
This is a little tricky to envisage, so a diagram might help.
The records to be deleted A,J,O,R,Y,Z
The initial state of the bigfile: L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B
The big file without the kill list: L, E,M,X,T, G,U,W, Q, C,K, P,V,I,D,H,N,S,F,B
The desired result after 'compaction': L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B
The following code uses a ramfile with single character records and ',' as the record separator to demostrate the logic required:
#! perl -slw
use strict;
use Data::Dumper;
use List::Util qw[ shuffle ];
## Dummy up a hash of lines to kill.
my %kills = map{ $_ => undef } ( shuffle( 'A' .. 'Z' ) )[ 0 .. 5 ];
print sort keys %kills;
## And a 'big file'
my $bigfile = join ',', shuffle 'A' .. 'Z';
print $bigfile;
## Scan the bigfile recording the file positions
## where records are to be deleted.
open my $fhRead, '<', \$bigfile;
{
local $/ = ',';
my $lastPos = 0;
while( <$fhRead> ) {
chomp;
## Store the ranges (start & end) of each record to delete
$kills{ $_ } = [ $lastPos, tell $fhRead ] if exists $kills{ $_
+ };
$lastPos = tell $fhRead;
}
}
## Sort the ranges into ascending order
my @posns = sort{ $a->[ 0 ] <=> $b->[ 0 ] } values %kills;
## Open a second write handle to the file
open my $fhWrite, '+<', \$bigfile;
{
local $/ = ','; local $\;
## Move the file pointers for reading and writing
## to the end and start positions respectively
my( $w1, $r1 ) = @{ shift @posns };
seek $fhWrite, $w1, 0;
seek $fhRead, $r1, 0;
while( @posns ) {
## Get the next set of positions
my( $w2, $r2 ) = @{ shift @posns };
## 'Move' the records up to the start of the next record to be
+ deleted
print $fhWrite scalar <$fhRead> while tell( $fhRead ) < $w2;
## Advance the read head over the section to be removed.
seek $fhRead, $r2, 0;
}
## copy the residual remaining over
print $fhWrite $_ while <$fhRead>;
## truncate the write filehandle.
# truncate $fhWrite, tell $fhWrite;
## truncate doesn't work on ramfiles!
$bigfile = substr $bigfile, 0, tell $fhWrite;
}
close $fhRead;
close $fhWrite;
print $bigfile;
__END__
C:\test>junk
HMNPRS
L,R,J,Z,P,H,M,K,T,A,D,Q,B,Y,S,E,W,F,U,C,I,X,G,O,N,V
L,J,Z,K,T,A,D,Q,B,Y,E,W,F,U,C,I,X,G,O,V
C:\test>junk
ABGKOT
G,O,Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,A,B,L,N,T,K,F
Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,L,N,F
C:\test>junk
AKLPTZ
J,F,Q,S,E,X,I,B,U,K,R,A,M,W,Z,G,D,L,H,C,N,Y,O,V,T,P
J,F,Q,S,E,X,I,B,U,R,M,W,G,D,H,C,N,Y,O,V,
C:\test>junk
AJORYZ
L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B
L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |