Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Regex to fix up records, some multiline fields, some not

by butchie3980 (Acolyte)
on Aug 20, 2013 at 08:59 UTC ( [id://1050152]=perlquestion: print w/replies, xml ) Need Help??

butchie3980 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I have a batch of data dumps that are shipped to me from another site, and I wanted to convert the data to xml. I'm using regex to grab fields, most of which are on one line, but sometimes a record will have a multiline field. I'm struggling to get this working. each record is pulled (using Tie::File) into a multi-line scalar, called $currentrecord. below is an example of the code I've tried, with sample data

if $currentrecord =~ m/^field2(.*)\nfield3/mi { $field2data = $1; }
Here's two examples of the data encountered: record 1: field1: data 1 monday field2: data 2 monday field3: data 3 monday record 2: field1: data 1 tuesday field2: data 2 tuesday tuesday details line 1 tuesday details line 2 field3: data 3 tuesday

The above approach isn't working when field2 has multiple lines. How can I catch both record styles?

UPDATE
OK, I tested all of the responses to this posting, and they were all effective. Thank you so much for your help.

Replies are listed 'Best First'.
Re: Regex to fix up records, some multiline fields, some not
by Athanasius (Archbishop) on Aug 20, 2013 at 09:10 UTC

    You need to add an /s modifier to the regex:

    #! perl use strict; use warnings; our $/ = ''; while (my $currentrecord = <DATA>) { if ($currentrecord =~ m/^field2(.*)\nfield3/msi) { my $field2data = $1; print "Found \$field2data = $field2data\n"; } } __DATA__ record 1: field1: data 1 monday field2: data 2 monday field3: data 3 monday record 2: field1: data 1 tuesday field2: data 2 tuesday tuesday details line 1 tuesday details line 2 field3: data 3 tuesday

    Output:

    19:07 >perl 692_SoPW.pl Found $field2data = : data 2 monday Found $field2data = : data 2 tuesday tuesday details line 1 tuesday details line 2 19:08 >

    See perlre#Modifiers.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Regex to fix up records, some multiline fields, some not
by Eily (Monsignor) on Aug 20, 2013 at 12:37 UTC

    Instead of having one regex for each field, you can use the /g modifier to go from one field to the other, and use the (?=EXPR) syntax to check that what follows your field is another one and not data without

    use Data::Dumper; my $regex = qr/ ^field(\d+): # find a line starting by 'field' and + capture its number (.*?)\n? # find the smallest string before the + next (?=^field\d+:|\z) # line starting by 'field' or end of +record. Rewind just before that point after the match. /msx; # ^ matches beginning of line, . matches \n and spac +es and comments are ignored in the regex my %result; my $count = 1; { # block to limit the scope of local local $/ = ""; # records are separated by empty lines while(<DATA>) { my %hash; while(/$regex/g) { $hash{"field$1"} = $2; } $result{"record ".$count++} = \%hash; } } print Dumper \%result; __DATA__ field1: data 1 monday field2: data 2 monday field3: data 3 monday field1: data 1 tuesday field2: data 2 tuesday tuesday details line 1 tuesday details line 2 field3: data 3 tuesday
    $VAR1 = { 'record 1' => { 'field1' => ' data 1 monday', 'field2' => ' data 2 monday', 'field3' => ' data 3 monday ' }, 'record 2' => { 'field1' => ' data 1 tuesday', 'field2' => ' data 2 tuesday tuesday details line 1 tuesday details line 2', 'field3' => ' data 3 tuesday' } };

Re: Regex to fix up records, some multiline fields, some not
by Utilitarian (Vicar) on Aug 20, 2013 at 09:13 UTC
    Try the unfold method of Text::LineFold

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
Re: Regex to fix up records, some multiline fields, some not
by McA (Priest) on Aug 20, 2013 at 09:09 UTC

    What is the rule to determine that a row is a continued line?

    McA

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1050152]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-19 03:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found