creating array of hashes from input file

I can't think of a proper way of capturing just the data I need

You'll have to be more specific than that, we can't guess what is useful information for you. I can't even understand how you're supposed to read your data. See How do I post a question effectively? Be aware that you shouldn't expect to be given a fully functional solution, but rather advice, and maybe corrections. So you'll have to show some work.

[reply]

My approach when parsing lists depends on the nature of the list.

If I can find out from looking at a single line what kind it is, then I use regular expressions to fill out a hash and flush the record whenever a new set starts.

If I can't find out from looking at a single line what kind it is, I use counters or flags to know what line I am on.

In your case, it looks to me as if you basically have a report with some header data and then three payload lines, the issuance date, the address and the description. One ugly thing seems to be that the address and the permission can span multiple lines, but from the unrepresentative example you have posted, each item seems to be delimited by a blank line from the previous item.

Going from these assumptions, my approach would be something like the following (untested):

#!perl -w
use strict;
use Data::Dumper;

# Output a row of information
sub flush {
    my( $record ) = @_;
    if( $record->{permit} ) {
        print Dumper $record;
    };
    delete $record->{permit};
};



# This will collect all information for one entry:
my %info;
my $last_page;
my $expected_pages;
my $record_kind;

my %next_record = (
    address => 'description',
    description => undef,
);

while(<DATA>) {
    if( m!^Page (\d+) of (\d+)! ) {
        $last_page = $1;
        $expected_pages ||= $2;
        next;
    };
    if( m!^(Jan|Feb|...|Jun|...) (19\d\d|20\d\d)! ) {
        $info{ report_date } = "$2-$1";
        next;
    };

    # ... more code to skip the header left for the reader
    next if( m!MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT! );
    next if( m!^Permit Issued! );

    if( m!^(\d\d)/(\d\d)/((?:19|20)\d\d)$! ) {
        flush(\%info);
        $info{ permit } = "$3-$2-$1";
        $record_kind = 'address';
        <>; # skip empty line
        next
    };

    # Fast-forward until the next set of lines
    if( $record_kind ) {
        while( <> !~ /^\s*$/ ) {
            s!\s*$!!;
            $info{ $record_kind } .= " " . $_;
        };
        $record_kind = $next_record{ $record_kind };
    } else {
        die "Unknown line [$_] on line $.";
    };
};

warn "Uhoh - expected $expected_pages but only read up to $last_page"
    if( $expected_pages != $last_page );

__DATA__
Page 1 of 3

100 Civic Center Way
Calabasas, California 91302

7/12/2012 9:21:02AM

MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT
Jun 2012

Permit Issued

Address

Description

06/01/2012

26166 ROYMOR
DR

Upgrade panel from 100 amp to 200 amp

06/04/2012

24956 NORMANS
WAY

(6) light fixtures @ patio; (3) branch circuits; (4) electric
heaters

06/05/2012

4273 VICASA
DR

Construct 339 SF Covered Loggia
[download]

[reply]
[d/l]

                    100 Civic Center Way                              
+                        Page 1 of 3

                    Calabasas, California 91302
                                                                      
+                        7/12/2012 9:21:02AM




                      MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT
                                        Jun 2012



Permit Issued   Address                               Description


06/01/2012      26166 ROYMOR                          Upgrade panel fr
+om 100 amp to 200 amp
                DR


06/04/2012      24956 NORMANS                         (6) light fixtur
+es @ patio; (3) branch circuits; (4) electric
                                                      heaters
                WAY


06/05/2012      4273 VICASA                           Construct 339 SF
+ Covered Loggia
                DR


06/07/2012      26011 ALIZIA CANYON                   R/R (1) <100K BT
+U Furnace in garage; () <100K BTU
                                                      condenser outsid
+e (NO DUCTS TO BE CHANGED OUT)
                DR E


06/07/2012      4240 LOST HILLS                       R/R (7) windows 
+(like for like) AT LEAST ONE PANE MUST
                                                      BE TEMPERED
                RD 1503


06/08/2012      3574 ELM                              Construct Retain
+ing Wall in front of (E) retaining wall:        4
                                                      1/2' average x 3
+2 LF = approx. 144 SF
                DR


06/13/2012      4026 TOWHEE                           Construct a 460 
+SF Pool/49 SF Spa
                DR
[download]

[reply]
[d/l]

would this layout would be an easier an approach to parse

Maybe, depends on what pages 2 and 3 are like. Are you parsing many reports all formatted the same ?. Or is the 3 pages just a small sample of the real report.

#!perl
use strict;
use Data::Dumper;

my $infile = 'report1.txt';

# date address description
my $fmt   = "A16 A38 A*";
my @data  = ();
my $recno = -1;
my $flag  = 0;

open IN,'<',$infile or die "$infile $!";
while (<IN>){
  chomp;
  next unless /\S/;
  $flag = 0 if /Page \d of \d/;
  
  my ($date,$addr,$desc) = unpack $fmt,$_;
  if ( $date =~ /\d\d.\d\d.20\d\d/ ){
    $flag = 1;  
    ++$recno;  
    $data[$recno] = [ $date,$addr,$desc ];
  } elsif ($flag) {
    $data[$recno][1] .= ' '.$addr if $addr;
    $data[$recno][2] .= ' '.$desc if $desc; 
  }
}
close IN;
print scalar(@data)." records read\n";
print Dumper \@data;
[download]

[reply]
[d/l]

Here, instead of a regex I would likely use unpack with the appropriate template(s).

Append to %info until you encounter an empty line, then flush.

[reply]
[d/l]

Corion made good points, but it needs a flush(\%info); at the end.

Here is a more non form dependent single state driven method. I like state machines too! It also fixes some common excel problems noted.

#!perl -w
use strict;
use Data::Dumper;



# delimiter style , pick one 
my $delim;

# this mode produces comma delimted files, special magic needs to be a
+dded to output 

$delim=','; 

# this produces tab delimted files, fields just cant contain a tab 
#  the output needs to have a .txt filetype and be opened with excel 
# or another filetype (say .tdf) if that filetype is assigned to be op
+ened by excel 
#  i like tdf better, but you said csv 

# $delim="\t"; 


sub delimfix{ 
# excel has problems with text files 
#  something excell may think as a date    but should be text 
#    2012-12-12
#    2/4
#  something excell may think is a number but is text 
#    1e123
#  leading zeros as text 
#    00001
#  leading equals as text 
#    =x
# (leading equals-dquote, training dquote)  fixes these problems (the 
+equals function) 
# There still exists a long string problem with ="..." 
#    |001-----0|010-----0|020-----0|030-----0|040-----0|050-----0|060-
+----0|070-----0|080-----0|090-----0|101-----0|110-----0|120-----0|130
+-----0|140-----0|150-----0|160-----0|170-----0|180-----0|190-----0|20
+1-----0|210-----0|220-----0|230-----0|240-----0|250-----0|260-----0|2
+70-----0|280-----0|290-----0

  my $field=shift; 
  if ($field=~m!"!) { 
      $field=~s!"!""!g; 
  }
  if ($delim eq ',' && $field=~m!,!) { 
#     csv also has problems with commas in string    
#      but if it has a comma it doesnt have the other problems 
      return '"'.$field.'"'; 
    
  }
  return '="'.$field.'"'; 
}; 

sub infoprint {
# Output a row of information
    my($title,$record) = @_;
#    print $title." data\n";  print Dumper $record;
  my @delimline; 
  if (lc($title) eq 'top') {
      print delimfix($record->{title})."\n"; 
      print "\n"; 
      for my $f (split("\t",$record->{getheaders})) { push @delimline,
+delimfix($f)} 
      } 
  else { 
      if ($record->{permitdate}=~m!^\d+/\d+/\d+!) {
          push @delimline,$record->{permitdate};
      } else {
          push @delimline,delimfix($record->{permitdate});
          } 
      push @delimline,delimfix($record->{address});
      push @delimline,delimfix($record->{description});
  }
  print join($delim,@delimline)."\n"; 
};

# used in getheaders to add tabs 
my $linejoin=' '; 

# This will collect all information for one entry:
my %info;

# what is in %info 
my $infotype='Top'; 

my %topinfo; 
# state machine data 
my $record_kind='none';
my $last_record_blank=1; 

my %next_record = (  # state table was => now 
    none        => 'page', 
    page        => 'topaddr', 
    topaddr     => 'rundate', 
    rundate     => 'title', 
    title       => 'getheaders', 
    getheaders  => 'getheaders',        # hold state till manual reset
+ 
    founddesc   => 'permitdate',
    permitdate  => 'address',
    address     => 'description',
    description => 'permitdate', 
);


while(<DATA>) {
    chomp; 
    s!\t! !g;  # can have tabs in fields , kill them 
    if ( m!^\s*$!) { # blanks dont matter 
        $last_record_blank=1; 
        next; 
    } 
    elsif ($last_record_blank) { #  new data so go to next state 
        $record_kind = $next_record{ $record_kind };
        if ($record_kind eq 'permitdate') { # move to permitdate means
+ dump full data 
            infoprint($infotype,\%info); 
            %info=();
            $infotype='Permit'; 
        }
        $last_record_blank=0;
    } 
    if (defined($info{ $record_kind })) { 
        $info{ $record_kind } .= $linejoin . $_;
        }
    else {
        $info{ $record_kind } = $_;
        }
    if ($record_kind eq  'getheaders') { 
         if ( m!^Description! ) { 
            $record_kind='founddesc';  # manual move to next state 
            $linejoin=' '; 
            %topinfo=%info; 
            } 
         else {$linejoin="\t"; } 
    } 
};

infoprint('Permit',\%info,); 
print "------\n"; 
      print delimfix($topinfo{page})."\n"; 
      print delimfix($topinfo{topaddr})."\n"; 
      print delimfix($topinfo{rundate})."\n"; 


__DATA__
Page 1 of 3

100 Civic Center Way
Calabasas, California 91302

7/12/2012 9:21:02AM "hi"

MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT
Jun 2012

Permit Issued

Address

Description

06/01/2012

26166 ROYMOR
DR

Upgrade panel from 100 amp to 200 amp

06/04/2012

24956 NORMANS
WAY

(6) light fixtures @ patio; (3) branch circuits; (4) electric
heaters

06/05/2012

4273 VICASA
DR

Construct 339 SF Covered Loggia
[download]

[reply]
[d/l]
[select]

I played with the above so much i factored out the array of hashs to make the output file instead after putting it in, So the following is refactored again to construct the AoH, THEN make the output file from it.

#!perl -w
use strict;
use Data::Dumper;

# all fo them 
my $aoh=[]; 

# used in getheaders to add tabs 
my $linejoin=' '; 

# This will collect all information for one entry:
my $infohash={type=>'top'};

# state machine data 
my $record_kind='none';
my $last_record_blank=1; 

my %next_record = (  # state table was => now 
    none        => 'page', 
    page        => 'topaddr', 
    topaddr     => 'rundate', 
    rundate     => 'title', 
    title       => 'getheaders', 
    getheaders  => 'getheaders',        # hold state till manual reset
+ 
    founddesc   => 'permitdate',
    permitdate  => 'address',
    address     => 'description',
    description => 'permitdate', 
);


while(<DATA>) {
    chomp; 
    s!\t! !g;  # can have tabs in fields, kill them  
    if ( m!^\s*$!) { # blanks dont matter 
        $last_record_blank=1; 
        next; 
    } 
    elsif ($last_record_blank) { #  new data so go to next state 
        $record_kind = $next_record{ $record_kind };
        if ($record_kind eq 'permitdate') { # move to permitdate means
+ dump full data 
            push @$aoh,$infohash; 
            $infohash={type=>'permit'};
        }
        $last_record_blank=0;
    } 
    if (defined($infohash->{ $record_kind })) { 
        $infohash->{ $record_kind } .= $linejoin . $_;
        }
    else {
        $infohash-> { $record_kind } = $_;
        }
    if ($record_kind eq  'getheaders') { 
         if ( m!^Description! ) { 
            $record_kind='founddesc';  # manual move to next state 
            $linejoin=' '; 
            } 
         else {$linejoin="\t"; } 
    } 
};
push @$aoh,$infohash; 

print Dumper($aoh); 

#########################################################
# make csv 
#########################################################

# delimiter style , pick one 
my $delim;

# this mode produces comma delimted files, special magic needs to be a
+dded to output 

$delim=','; 

# this produces tab delimted files, fields just cant contain a tab 
#  the output needs to have a .txt filetype and be opened with excel 
# or another filetype (say .tdf) if that filetype is assigned to be op
+ened by excel 
#  i like tdf better, but you said csv 

# $delim="\t"; 


for my $hash (@$aoh) { 
    infoprint ($hash->{type},$hash); 
    } 
infoprint ('bot',@{$aoh}[0]); 


sub delimfix{ 
# excel has problems with text files 
#  something excell may think as a date    but should be text 
#    2012-12-12
#    2/4
#  something excell may think is a number but is text 
#    1e123
#  leading zeros as text 
#    00001
#  leading equals as text 
#    =x
# (leading equals-dquote, training dquote)  fixes these problems (the 
+equals function) 
# There still exists a long string problem with ="..." 
#    |001-----0|010-----0|020-----0|030-----0|040-----0|050-----0|060-
+----0|070-----0|080-----0|090-----0|101-----0|110-----0|120-----0|130
+-----0|140-----0|150-----0|160-----0|170-----0|180-----0|190-----0|20
+1-----0|210-----0|220-----0|230-----0|240-----0|250-----0|260-----0|2
+70-----0|280-----0|290-----0

  my $field=shift; 
  if ($field=~m!"!) { 
      $field=~s!"!""!g; 
  }
  if ($delim eq ',' && $field=~m!,!) { 
#     csv also has problems with commas in string    
#      but if it has a comma it doesnt have the other problems 
      return '"'.$field.'"'; 
    
  }
  return '="'.$field.'"'; 
}; 

sub infoprint {
# Output a row of information
    my($title,$record) = @_;
#    print $title." data\n";  print Dumper $record;
  my @delimline; 
  if (lc($title) eq 'top') {
      print delimfix($record->{title})."\n"; 
      print "\n"; 
      for my $f (split("\t",$record->{getheaders})) { push @delimline,
+delimfix($f)} 
      } 
  elsif (lc($title) eq 'bot') {
      print "------\n"; 
      print delimfix($record->{page})."\n"; 
      print delimfix($record->{topaddr})."\n"; 
      print delimfix($record->{rundate})."\n"; 
  }
  else { 
      if ($record->{permitdate}=~m!^\d+/\d+/\d+!) {
          push @delimline,$record->{permitdate};
      } else {
          push @delimline,delimfix($record->{permitdate});
          } 
      push @delimline,delimfix($record->{address});
      push @delimline,delimfix($record->{description});
  }
  print join($delim,@delimline)."\n"; 
};


__DATA__
Page 1 of 3

100 Civic Center Way
Calabasas, California 91302

7/12/2012 9:21:02AM "hi"

MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT
Jun 2012

Permit Issued

Address

Description

06/01/2012

26166 ROYMOR
DR

Upgrade panel from 100 amp to 200 amp

06/04/2012

24956 NORMANS
WAY

(6) light fixtures @ patio; (3) branch circuits; (4) electric
heaters

06/05/2012

4273 VICASA
DR

Construct 339 SF Covered Loggia
[download]

[reply]
[d/l]


more useful options
	PerlMonks