Hi Perl Monks, I’m a Perl novice who is attempting to extract individual lines of text AND a block of text from very large text files. Memory constraints suggest reading the files one line at a time. However, extracting blocks of text requires reading multiple lines rather a single line. One solution that I have discovered is to read the file into an array (Tie::File, or something more simple maybe … simpler the better for the novice here). One problem with this solution is identifying the information that I need within the individual elements of the array (seems like this requires a technique to sequentially read within the elements, extract the info from the elements, move to the next element, etc. which wayyyyyyy over my head). I’ve included my code below, along with a brief sample of text to simulate my input data. My code includes comments to identify the beginning and ending points for the block of text I wish to accumulate (i.e., <FILENAME>v144610_ex21.htm , and </DOCUMENT>, respectively). I apologize for my incompetence in advance. I am grateful for any suggestions. Thank you.
</p?
*********** Sample text ***************
CONFORMED PERIOD OF REPORT: 20081231 ------ individual
+line I want
FILED AS OF DATE: 20090331 ------ individual line I
+ want
DATE AS OF CHANGE: 20090331 ------ individual line
+I want
CENTRAL INDEX KEY: 0000786368 ------ individual line
+ I want
FORM TYPE: 10-K ------ individual line I want
Whole buncha text here …………….
</DOCUMENT>
<DOCUMENT>
<TYPE>EX-21
<SEQUENCE>7
<FILENAME>v144610_ex21.htm -----------My starting point
<TEXT>
<html>
************* BODY OF TEXT I WISH TO EXTRACT ****************
</html>
</TEXT>
</DOCUMENT> ----------- My ending point
**********End of sample text ***********
#!/usr/bin/perl -w
use strict;
use warnings;
use File::stat;
use lib "c:/strawberry/perl/site/lib";
#Specify the directory containing the files that you want to read;
my $files_dir = 'E:\research\audit fee models\filings\Test';
#Specify the directory containing the results/output;
my $write_dir = 'E:\research\audit fee models\filings\filenames\filen
+ames.txt';
#Open the directory containing the files you plan to read;
opendir(my $dir_handle, $files_dir) or die "Can't open directory $!";
#Initialize the variable names.
my $file_count = 0;
my $line_count=0;
my $cik=-99;
my $form_type="";
my $form="";
my $report_date=-99;
my $htm="";
my $url="";
my $slash='/';
my $line_count=0;
#Loop for reading each file in the input directory;
while (my $filename = readdir($dir_handle)) {
next unless -f $files_dir.'/'.$filename;
print "Processing $filename\n";
#Open the input file;
open my $FH_IN, '<',$files_dir.'/'.$filename or die "Can't open $filen
+ame";
#Within the file loop, read each line of the current file;
while (my $line = <$FH_IN>) {
next unless -f $files_dir.'/'.$filename;
if ($line_count > 500000) { last;}
#Begin extracting header type data from the file;
if($line=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1; $cik =~ s
+/^0+//;}
if($line=~m/^\s*FORM\s*TYPE:\s*(10k.*$)/im || ($line=~m/^\s*FORM\s*T
+YPE:\s*(10-k.*$)/im))
{$form_type=$1;}
if($line=~m/^\s*CONFORMED\s*PERIOD\s*OF\s*REPORT:\s*(\d*)/m){$report
+_date=$1;}
#End of header type information;
#Begin block text accumulation;
#This REGEX identifies the starting point of the text I wish to accumu
+late;
if($line=~m/^\s*<FILENAME>(.*?)(ex21)(.*?)(.htm$)/igm ||
$line=~m/^\s*<FILENAME>(.*?)(EX-21)(.*?)(.htm$)/igm ||
$line=~m/^\s*<FILENAME>(.*?)(ex21)(.*?)(.htm$)/igm ||
$line=~m/^\s*<FILENAME>(.*?)(EX-21)(.*?)(.htm$)/igm)
{$htm=join('',$1,$2,$3,$4); }
#Something seemingly here that accumulates text, using PUSH, or whatev
+er;
#This is the ending point of the text I wish to accumulate;
if($line=~m/^\s*</DOCUMENT>/igm;
#End block text accumulation;
#Update line counter;
++$line_count;
}
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.