comment on

Hi Perl Monks, I’m a Perl novice who is attempting to extract individual lines of text AND a block of text from very large text files. Memory constraints suggest reading the files one line at a time. However, extracting blocks of text requires reading multiple lines rather a single line. One solution that I have discovered is to read the file into an array (Tie::File, or something more simple maybe … simpler the better for the novice here). One problem with this solution is identifying the information that I need within the individual elements of the array (seems like this requires a technique to sequentially read within the elements, extract the info from the elements, move to the next element, etc. which wayyyyyyy over my head). I’ve included my code below, along with a brief sample of text to simulate my input data. My code includes comments to identify the beginning and ending points for the block of text I wish to accumulate (i.e., <FILENAME>v144610_ex21.htm , and </DOCUMENT>, respectively). I apologize for my incompetence in advance. I am grateful for any suggestions. Thank you. </p?

*********** Sample text ***************
CONFORMED PERIOD OF REPORT:    20081231     &#61663;------ individual 
+line I want
FILED AS OF DATE:        20090331     &#61663;------ individual line I
+ want
DATE AS OF CHANGE:        20090331     &#61663;------ individual line 
+I want

CENTRAL INDEX KEY:        0000786368    &#61663;------ individual line
+ I want 
        
    FORM TYPE:        10-K    &#61663;------ individual line I want

Whole buncha text here …………….

</DOCUMENT>
<DOCUMENT>
<TYPE>EX-21
<SEQUENCE>7
<FILENAME>v144610_ex21.htm      &#61663;-----------My starting point
<TEXT>
<html>
      *************  BODY OF TEXT I WISH TO EXTRACT ****************
</html>
</TEXT>
</DOCUMENT>               &#61663;----------- My ending point

**********End of sample text ***********



#!/usr/bin/perl -w
use strict;
use warnings;
use File::stat;
use lib "c:/strawberry/perl/site/lib";

#Specify the directory containing the files that you want to read;
my $files_dir = 'E:\research\audit fee models\filings\Test';

#Specify the directory containing the results/output;
my $write_dir =  'E:\research\audit fee models\filings\filenames\filen
+ames.txt';

#Open the directory containing the files you plan to read;
opendir(my $dir_handle, $files_dir) or die "Can't open directory $!";

#Initialize the variable names.
my $file_count = 0;
my $line_count=0;
my $cik=-99;
my $form_type="";
my $form="";
my $report_date=-99;
my $htm="";
my $url="";
my $slash='/';
my $line_count=0;

#Loop for reading each file in the input directory;

while (my $filename = readdir($dir_handle))  {
next unless -f $files_dir.'/'.$filename;
print "Processing $filename\n";

#Open the input file;
open my $FH_IN, '<',$files_dir.'/'.$filename or die "Can't open $filen
+ame";

#Within the file loop, read each line of the current file;
while (my $line = <$FH_IN>) {     
next unless -f $files_dir.'/'.$filename;

 if ($line_count > 500000) { last;}

#Begin extracting header type data from the file;

  if($line=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1; $cik =~ s
+/^0+//;}
 
  if($line=~m/^\s*FORM\s*TYPE:\s*(10k.*$)/im || ($line=~m/^\s*FORM\s*T
+YPE:\s*(10-k.*$)/im))
     {$form_type=$1;}
  if($line=~m/^\s*CONFORMED\s*PERIOD\s*OF\s*REPORT:\s*(\d*)/m){$report
+_date=$1;}

#End of header type information;

#Begin block text accumulation;

#This REGEX identifies the starting point of the text I wish to accumu
+late;  

  if($line=~m/^\s*<FILENAME>(.*?)(ex21)(.*?)(.htm$)/igm ||
     $line=~m/^\s*<FILENAME>(.*?)(EX-21)(.*?)(.htm$)/igm ||
     $line=~m/^\s*<FILENAME>(.*?)(ex21)(.*?)(.htm$)/igm   ||
     $line=~m/^\s*<FILENAME>(.*?)(EX-21)(.*?)(.htm$)/igm)        
         {$htm=join('',$1,$2,$3,$4);    }
     
#Something seemingly here that accumulates text, using PUSH, or whatev
+er;     
     
         
#This is the ending point of the text I wish to accumulate;        
if($line=~m/^\s*</DOCUMENT>/igm;

#End block text accumulation;         
 
#Update line counter;

++$line_count;

 }
[download]

In reply to Extract individual lines and block of text from large files by wrkrbeee

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks