Faster file read, text search and replace

sabas has asked for the wisdom of the Perl Monks concerning the following question:

I have a large xml file that has 946,388,628 lines. I created a simple .pl script to read and count the lines but it took so long just to finish reading the whole file without any logic added just read each line and count it. Is there a way I can speed up the process in PERL (I am new in PERL). I am planning to search for a certain "old string" in each line and replace with the "new string".

 

$ARGV[0] or die "ERROR: No file for 114 lines";
$ARGV[1] or die "ERROR: No file for 114 lines";

open my $bigfile,"<",$ARGV[0] or die "ERROR: COuld not open big file $
+ARGV[0]:$!";
open my $outfile,">",$ARGV[1] or die "Error: Could not open output fil
+e $ARGV[0]:$!";

$datestring = localtime();
print $outfile "Processing started...at $datestring\n";
print "Processing started...at $datestring\n";

my $lctr = 0;

while (my $line = <$bigfile>) {
    chomp $line;
    $lctr++;
}
$datestring = localtime();
print $outfile "Processing Ended...at $datestring\n";

print $outfile "Total Lines read in $ARGV[0] = $lctr";

close $bigfile;
close $outfile;
[download]

Comment on Faster file read, text search and replace Download Code

Replies are listed 'Best First'.
Re: Faster file read, text search and replace by hippo (Bishop) on Feb 13, 2018 at 23:06 UTC
Is there a way I can speed up the process in PERL Without changing the approach you can speed it up by losing the unused `$line` scalar and the pointless chomp. Change that loop to: `while (<$bigfile>) { $lctr++; }` [download] That should buy you a few percent. Beyond that it would be better not to process the file line by line but rather block by block with a variable (ie. tunable) block size. Maybe start with 16MB or so. Then just count the newlines in each block once it is in memory. BTW, did you spot the bug on this line? `open my $outfile,">",$ARGV[1] or die "Error: Could not open output file $ARGV[0]:$!";`	[reply] [d/l] [select]
Re^2: Faster file read, text search and replace by sabas (Acolyte) on Feb 28, 2018 at 20:08 UTC
< yes i saw the bug $ARGV[0] should be $ARGV1 >	[reply]
Re: Faster file read, text search and replace by NetWallah (Canon) on Feb 14, 2018 at 04:56 UTC
An XML file larger than ~ 500 MB is indicative of a poorly designed application system. The reason is that typically, XML files are serialized/processed after reading them into memory, and at over 500M, memory demands start to enter the region where they need special treatment for resource allocation. Consider loading the XML file into a database that can manage memory much better, while providing structured access. Something like this sqlite UI with an XML plug-in could help. Python is a racist language what with it's dependence on white space!	[reply]
Re^2: Faster file read, text search and replace by Jenda (Abbot) on Feb 14, 2018 at 11:58 UTC
While I agree about the poorly designed system, reading whole XMLs into memory is more often than not poor design as well. Whether the file is huge (already) or not, if you do not have to, do not load the whole file into a huge maze of interconnected objects, but rather process it in chunks. XML::Twig or XML::Rules make that fairly easy to do. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re: Faster file read, text search and replace by Jenda (Abbot) on Feb 14, 2018 at 12:18 UTC
It's Perl (the language) or perl (the "interpreter"), not PERL. And no, there's nothing on the Perl side that can make this quicker. The IO costs will greatly overweight anything you can do on the Perl side. The data should not be in the XML format. It's one of the least space efficient ways to store data and when reading and writing is involved, space equals speed. If you can't change the way you store the data, you might at least store it compressed and then decompress as you read and compress as you write. While it will mean more work for the CPU, the IO costs ought to be much lower. See PerlIO::gzip and PerlIO::via::Bzip2. Also ... making changes to a XML file without the use of a module that actually understands the format is dangerous. Sooner or later you run into problems with encoding, entities or comments. I'm not saying you may never ever do it ... if it's a one time transformation of a known XML and the changes are simple enough, go ahead ... but do be careful. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re: Faster file read, text search and replace by haukex (Archbishop) on Feb 13, 2018 at 22:28 UTC
Could you tell us a bit more, like how long did it take to process this file, and exactly what the "old string" and "new string" are?	[reply]
Re: Faster file read, text search and replace by Cristoforo (Curate) on Feb 14, 2018 at 20:29 UTC
An example that edits an XML file was recently discussed here.	[reply]


Perl-Sensitive Sunglasses
	PerlMonks