Regex to fix up records, some multiline fields, some not

butchie3980 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I have a batch of data dumps that are shipped to me from another site, and I wanted to convert the data to xml. I'm using regex to grab fields, most of which are on one line, but sometimes a record will have a multiline field. I'm struggling to get this working. each record is pulled (using Tie::File) into a multi-line scalar, called $currentrecord. below is an example of the code I've tried, with sample data

if $currentrecord =~ m/^field2(.*)\nfield3/mi {
    $field2data = $1;
}
[download]

Here's two examples of the data encountered:
record 1:
field1: data 1 monday
field2: data 2 monday
field3: data 3 monday

record 2:
field1: data 1 tuesday
field2: data 2 tuesday
        tuesday details line 1
        tuesday details line 2
field3: data 3 tuesday
[download]

The above approach isn't working when field2 has multiple lines. How can I catch both record styles?

UPDATE
OK, I tested all of the responses to this posting, and they were all effective. Thank you so much for your help.

Comment on Regex to fix up records, some multiline fields, some not Select or Download Code

Replies are listed 'Best First'.

Re: Regex to fix up records, some multiline fields, some not
by Athanasius (Archbishop) on Aug 20, 2013 at 09:10 UTC

You need to add an /s modifier to the regex:

#! perl
use strict;
use warnings;

our $/ = '';

while (my $currentrecord = <DATA>)
{
    if ($currentrecord =~ m/^field2(.*)\nfield3/msi)
    {
        my $field2data = $1;
        print "Found \$field2data = $field2data\n";
    }
}

__DATA__
record 1:
field1: data 1 monday
field2: data 2 monday
field3: data 3 monday

record 2:
field1: data 1 tuesday
field2: data 2 tuesday
        tuesday details line 1
        tuesday details line 2
field3: data 3 tuesday
[download]

Output:

19:07 >perl 692_SoPW.pl
Found $field2data = : data 2 monday
Found $field2data = : data 2 tuesday
        tuesday details line 1
        tuesday details line 2

19:08 >
[download]

See perlre#Modifiers.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Regex to fix up records, some multiline fields, some not
by Eily (Monsignor) on Aug 20, 2013 at 12:37 UTC

Instead of having one regex for each field, you can use the /g modifier to go from one field to the other, and use the (?=EXPR) syntax to check that what follows your field is another one and not data without

use Data::Dumper;

my $regex = qr/
              ^field(\d+):       # find a line starting by 'field' and
+ capture its number
              (.*?)\n?           # find the smallest string before the
+ next
              (?=^field\d+:|\z)  # line starting by 'field' or end of 
+record. Rewind just before that point after the match.
            /msx; # ^ matches beginning of line, . matches \n and spac
+es and comments are ignored in the regex


my %result;
my $count = 1;
{ # block to limit the scope of local
  local $/ = ""; # records are separated by empty lines
  while(<DATA>)
  {
    my %hash;
    while(/$regex/g)
    {
      $hash{"field$1"} = $2;
    }
    $result{"record ".$count++} = \%hash;
  }
}
print Dumper \%result;

__DATA__
field1: data 1 monday
field2: data 2 monday
field3: data 3 monday

field1: data 1 tuesday
field2: data 2 tuesday
        tuesday details line 1
        tuesday details line 2
field3: data 3 tuesday
[download]

$VAR1 = {
          'record 1' => {
                          'field1' => ' data 1 monday',
                          'field2' => ' data 2 monday',
                          'field3' => ' data 3 monday
'
                        },
          'record 2' => {
                          'field1' => ' data 1 tuesday',
                          'field2' => ' data 2 tuesday
        tuesday details line 1
        tuesday details line 2',
                          'field3' => ' data 3 tuesday'
                        }
        };
[download]

[reply]
[d/l]
[select]

Re: Regex to fix up records, some multiline fields, some not
by Utilitarian (Vicar) on Aug 20, 2013 at 09:13 UTC

Text::LineFold

print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."

[reply]
[d/l]

Re: Regex to fix up records, some multiline fields, some not
by McA (Priest) on Aug 20, 2013 at 09:09 UTC

What is the rule to determine that a row is a continued line?

McA

[reply]


go ahead... be a heretic
	PerlMonks