Extract individual lines and block of text from large files

wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, I’m a Perl novice who is attempting to extract individual lines of text AND a block of text from very large text files. Memory constraints suggest reading the files one line at a time. However, extracting blocks of text requires reading multiple lines rather a single line. One solution that I have discovered is to read the file into an array (Tie::File, or something more simple maybe … simpler the better for the novice here). One problem with this solution is identifying the information that I need within the individual elements of the array (seems like this requires a technique to sequentially read within the elements, extract the info from the elements, move to the next element, etc. which wayyyyyyy over my head). I’ve included my code below, along with a brief sample of text to simulate my input data. My code includes comments to identify the beginning and ending points for the block of text I wish to accumulate (i.e., <FILENAME>v144610_ex21.htm , and </DOCUMENT>, respectively). I apologize for my incompetence in advance. I am grateful for any suggestions. Thank you. </p?

*********** Sample text ***************
CONFORMED PERIOD OF REPORT:    20081231     &#61663;------ individual 
+line I want
FILED AS OF DATE:        20090331     &#61663;------ individual line I
+ want
DATE AS OF CHANGE:        20090331     &#61663;------ individual line 
+I want

CENTRAL INDEX KEY:        0000786368    &#61663;------ individual line
+ I want 
        
    FORM TYPE:        10-K    &#61663;------ individual line I want

Whole buncha text here …………….

</DOCUMENT>
<DOCUMENT>
<TYPE>EX-21
<SEQUENCE>7
<FILENAME>v144610_ex21.htm      &#61663;-----------My starting point
<TEXT>
<html>
      *************  BODY OF TEXT I WISH TO EXTRACT ****************
</html>
</TEXT>
</DOCUMENT>               &#61663;----------- My ending point

**********End of sample text ***********



#!/usr/bin/perl -w
use strict;
use warnings;
use File::stat;
use lib "c:/strawberry/perl/site/lib";

#Specify the directory containing the files that you want to read;
my $files_dir = 'E:\research\audit fee models\filings\Test';

#Specify the directory containing the results/output;
my $write_dir =  'E:\research\audit fee models\filings\filenames\filen
+ames.txt';

#Open the directory containing the files you plan to read;
opendir(my $dir_handle, $files_dir) or die "Can't open directory $!";

#Initialize the variable names.
my $file_count = 0;
my $line_count=0;
my $cik=-99;
my $form_type="";
my $form="";
my $report_date=-99;
my $htm="";
my $url="";
my $slash='/';
my $line_count=0;

#Loop for reading each file in the input directory;

while (my $filename = readdir($dir_handle))  {
next unless -f $files_dir.'/'.$filename;
print "Processing $filename\n";

#Open the input file;
open my $FH_IN, '<',$files_dir.'/'.$filename or die "Can't open $filen
+ame";

#Within the file loop, read each line of the current file;
while (my $line = <$FH_IN>) {     
next unless -f $files_dir.'/'.$filename;

 if ($line_count > 500000) { last;}

#Begin extracting header type data from the file;

  if($line=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1; $cik =~ s
+/^0+//;}
 
  if($line=~m/^\s*FORM\s*TYPE:\s*(10k.*$)/im || ($line=~m/^\s*FORM\s*T
+YPE:\s*(10-k.*$)/im))
     {$form_type=$1;}
  if($line=~m/^\s*CONFORMED\s*PERIOD\s*OF\s*REPORT:\s*(\d*)/m){$report
+_date=$1;}

#End of header type information;

#Begin block text accumulation;

#This REGEX identifies the starting point of the text I wish to accumu
+late;  

  if($line=~m/^\s*<FILENAME>(.*?)(ex21)(.*?)(.htm$)/igm ||
     $line=~m/^\s*<FILENAME>(.*?)(EX-21)(.*?)(.htm$)/igm ||
     $line=~m/^\s*<FILENAME>(.*?)(ex21)(.*?)(.htm$)/igm   ||
     $line=~m/^\s*<FILENAME>(.*?)(EX-21)(.*?)(.htm$)/igm)        
         {$htm=join('',$1,$2,$3,$4);    }
     
#Something seemingly here that accumulates text, using PUSH, or whatev
+er;     
     
         
#This is the ending point of the text I wish to accumulate;        
if($line=~m/^\s*</DOCUMENT>/igm;

#End block text accumulation;         
 
#Update line counter;

++$line_count;

 }
[download]

Comment on Extract individual lines and block of text from large files Download Code

Replies are listed 'Best First'.
Re: Extract individual lines and block of text from large files by tangent (Parson) on Apr 06, 2016 at 21:39 UTC
You could add an inner loop: `{ $htm = join('',$1,$2,$3,$4); while ( my $htm_line = <$FH_IN> ) { last if $htm_line =~ m/^\s*<\/DOCUMENT>/i; $htm .= $htm_line; } }` [download] Note that you need to escape the '/' in the DOCUMENT tag.	[reply] [d/l]
Re^2: Extract individual lines and block of text from large files by wrkrbeee (Scribe) on Apr 06, 2016 at 21:45 UTC
Thank you!	[reply]
Re: Extract individual lines and block of text from large files by choroba (Cardinal) on Apr 06, 2016 at 20:28 UTC
Crossposted to StackOverflow. It's considered polite to inform about crossposting so that people not attending both sites don't waste their time hacking a solution for a problem already solved at the other end of the internet. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^2: Extract individual lines and block of text from large files by wrkrbeee (Scribe) on Apr 06, 2016 at 20:36 UTC
Withdrawn from StackOverflow.	[reply]
Re^3: Extract individual lines and block of text from large files by choroba (Cardinal) on Apr 06, 2016 at 20:40 UTC
That wasn't the point. Crossposting is not bad if announced. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^4: Extract individual lines and block of text from large files by wrkrbeee (Scribe) on Apr 06, 2016 at 20:43 UTC
Re: Extract individual lines and block of text from large files (xml) by Anonymous Monk on Apr 06, 2016 at 19:56 UTC
sample data also goes in code tags for xml you want one of XML::Twig or XML::LibXML , both can do well to manage memory and both come with strawberry http://xmltwig.org/article/ways_to_rome/ways_to_rome.html, Re: question about lookaheads and threatexpert/html parsing	[reply]
Re: Extract individual lines and block of text from large files by Marshall (Canon) on Apr 07, 2016 at 14:27 UTC
I'm not sure that I completely understand the context of your problem, but a few suggestions. Make a more general expression for the 'NAME: DATE' lines and capture into a hash. You have 4 regex lines to describe the filename, perhaps just one different regex would suffice? When you find the place after the FILENAME where the data capture is supposed to happen, sometimes it works out well to call a subroutine to finish that job off. Here is some code... #!usr/bin/perl use warnings; use strict; use Data::Dumper; $Data::Dumper::Sortkeys =1; my %hash; sub get_html { while (<DATA>) { if (/<html>/ .. /<\/html>/) ##see [id://525392] { chomp; push @{$hash{DATA}},$_; } } } while (<DATA>) { if ( (my ($name, $date) = m/^\s([\w ]+):\s+([\w-]+)/)) { $hash{$name}=$date; } if (my ($filename) = m/^\s<FILENAME>\s(\w+ex(-)?21..htm)/i) { $hash{FILENAME}=$filename; get_html(); } } print Dumper \%hash; =Prints: ******* $VAR1 = { 'CENTRAL INDEX KEY' => '0000786368', 'CONFORMED PERIOD OF REPORT' => '20081231', 'DATA' => [ '<html>', 'blah ', 'smore blah', 'blahblah', ' BODY OF TEXT I WISH TO EXTRACT * +', '</html>' ], 'DATE AS OF CHANGE' => '20090331', 'FILED AS OF DATE' => '20090331', 'FILENAME' => 'v144610_ex21.htm', 'FORM TYPE' => '10-K' }; =cut __DATA__ ******* Sample text *********** CONFORMED PERIOD OF REPORT: 20081231 ------ individual +line I want FILED AS OF DATE: 20090331 ------ individual line I + want DATE AS OF CHANGE: 20090331 ------ individual line +I want CENTRAL INDEX KEY: 0000786368 ------ individual line + I want FORM TYPE: 10-K ------ individual line I want Whole buncha text here ……………. </DOCUMENT> <DOCUMENT> <TYPE>EX-21 <SEQUENCE>7 <FILENAME>v144610_ex21.htm -----------My starting point <TEXT> <html> blah smore blah blahblah BODY OF TEXT I WISH TO EXTRACT * </html> </TEXT> </DOCUMENT> ----------- My ending point ******End of sample text ********* [download]	[reply] [d/l]


go ahead... be a heretic
	PerlMonks