Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

regex : get a paragraph, not just line

by crusty_collins (Friar)
on Mar 18, 2015 at 17:52 UTC ( [id://1120476]=perlquestion: print w/replies, xml ) Need Help??

crusty_collins has asked for the wisdom of the Perl Monks concerning the following question:

All knowing monks.

I have a regex problem that i cant figure out.

I need to get the following paragraph out of a file.

First i get that account number then i get the paragraph.

Question : How do i get the full paragraph skipping the cr and lf (\r\n)? Code

use strict; use warnings; use Data::Dumper; my $self = {}; while (my $line = <DATA>) { if( $line =~ /ACCOUNT\s+NUMBER\s+(\d+)/ ){ $self->{ACCOUNTNUMBER} = $1; print "$1 \n"; }elsif ($line =~ /^(YOUR[\s\w]+)(\d+\.\d+)(.*)/sm ){ print "$1 $2 $3 \n"; } } __DATA__ ACCOUNT NUMBER 000111111111 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00.
Output
000111111111 YOUR LOAN PAYMENT FOR THE COMING YEAR WILL BE 0 0.00

Replies are listed 'Best First'.
Re: regex : get a paragraph, not just line
by AnomalousMonk (Archbishop) on Mar 18, 2015 at 18:44 UTC

    If your file is small enough (no more than several hundred meg and guaranteed not to grow beyond that size), slurping and searching the entire file may be useful. Code:

    use warnings; use strict; my $doc = do { local $/; <DATA>; }; # slurp entire file # print "<<<$doc>>> \n"; # FOR DEBUG my $rx_account = qr{ (?<! \d) \d{12} (?! \d) }xms; my $rx_amount = qr{ (?<! \d) \d+ [.] \d\d (?! \d) }xms; my ($two_lines, $account, $amount) = $doc =~ m{ ( ^ ACCOUNT \s+ NUMBER \s+ ($rx_account) .*? ^ YOUR \s+ LOAN \s+ PAYMENT .*? ($rx_amount) ) }xms; print "two lines: <<<$two_lines>>> \n\n"; print "account '$account' amount '$amount' \n"; __DATA__ ACCOUNT NUMBER 000111111111 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00.
    Output:
    c:\@Work\Perl\monks\crusty_collins>perl match_multiline_1.pl two lines: <<<ACCOUNT NUMBER 000111111111 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00>>> account '000111111111' amount '00.00'

    Updates:

    1. See also File::Slurp and friends.
    2. What you have shown in the OP is not what I would think of as a "paragraph" match, but rather a multi-line match. It's difficult to see from the given data just what a paragraph would be.


    Give a man a fish:  <%-(-(-(-<

Re: regex : get a paragraph, not just line
by jeffa (Bishop) on Mar 18, 2015 at 18:45 UTC

    You only supplied 1 record in your example so it is impossible to see how the records are separated. However, this should be enough to create the hash with the keys and values you desire:

    use strict; use warnings; use Data::Dumper; my $data = do {local $/; <DATA>}; my %records = $data =~ m{^ACCOUNT NUMBER\s+(\d+)\s+(.*?\d\d/\d\d/\d\d\ +.)}msg; print Dumper \%records; __DATA__ ACCOUNT NUMBER 000111111111 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00. ACCOUNT NUMBER 000222222222 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00. ACCOUNT NUMBER 000333333333 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00.

    Output:

    
    $VAR1 = {
              '000111111111' => 'YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00
       OF WHICH 00.00  WILL BE FOR PRINCIPAL AND INTEREST,         00.00
       WILL GO ESCROW ACCOUNT, AND .00  WILL BE FOR
       DISCRETIONARY ITEMS THAT YOU
       CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. 
    THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00.',
              '000222222222' => 'YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00
       OF WHICH 00.00  WILL BE FOR PRINCIPAL AND INTEREST,         00.00
       WILL GO ESCROW ACCOUNT, AND .00  WILL BE FOR
       DISCRETIONARY ITEMS THAT YOU
       CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. 
    THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00.',
              '000333333333' => 'YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00
       OF WHICH 00.00  WILL BE FOR PRINCIPAL AND INTEREST,         00.00
       WILL GO ESCROW ACCOUNT, AND .00  WILL BE FOR
       DISCRETIONARY ITEMS THAT YOU
       CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. 
    THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00.'
            };
    

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: regex : get a paragraph, not just line
by sauoq (Abbot) on Mar 18, 2015 at 19:50 UTC
    One strategy, if you don't want to slurp the whole file and each record is delimited by that phrase "ACCOUNT NUMBER" is something like this:
    use strict; use warnings; use Data::Dumper; $/ = 'ACCOUNT NUMBER'; my %records; while (<DATA>) { chomp; my ($acct) = /^\s*(\d+)/; next unless defined($acct); $records{$acct} = $/ . $_; } print Dumper \%records; __DATA__ ACCOUNT NUMBER 000111111111 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00. ACCOUNT NUMBER 000222222222 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00. ACCOUNT NUMBER 000333333333 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00.

    Output

    $VAR1 = { '000222222222' => 'ACCOUNT NUMBER 000222222222 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00. ', '000111111111' => 'ACCOUNT NUMBER 000111111111 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00. ', '000333333333' => 'ACCOUNT NUMBER 000333333333 YOUR LOAN PAYMENT FOR THE YEAR WILL BE 00.00 OF WHICH 00.00 WILL BE FOR PRINCIPAL AND INTEREST, 00.00 WILL GO ESCROW ACCOUNT, AND .00 WILL BE FOR DISCRETIONARY ITEMS THAT YOU CHOSE TO BE INCLUDED WITH YOUR LOAN PAYMENT. THE EFFECTIVE DATE OF YOUR NEW SCHEDULED PAYMENT IS 00/00/00. ' };
    -sauoq
    "My two cents aren't worth a dime.";
Re: regex : get a paragraph, not just line
by atcroft (Abbot) on Mar 18, 2015 at 18:07 UTC

    Untested, but perhaps something such as:

    LOOP: while (my $line = <DATA>) { my $line2 = undef; if ( $line =~ m/^\s+/ ) { while (my $line2 = <DATA>) { if ( $line2 =~ m/^\s+/ ) { $line .= $line2; } else { last; } } } # ... remaining processing here ... if (defined $line2) { $line = $line2; redo LOOP; } }

    Hope that helps.

Re: regex : get a paragraph, not just line
by hdb (Monsignor) on Mar 18, 2015 at 20:21 UTC

    Another option is to set $/ to the empty string which enables paragraph mode, a full paragraph up to the next empty line is read.

      Which is great, if the text he wants is actually separated by empty lines. He didn't say it was. But then, he didn't say it wasn't either.

      -sauoq
      "My two cents aren't worth a dime.";
Re: regex : get a paragraph, not just line
by crusty_collins (Friar) on Mar 19, 2015 at 13:54 UTC
    Thank you so much for the excellent suggestions.

    I decided to use jeffa 's suggestion because of the simplicity of it.

    my %records = $data =~ m{^ACCOUNT NUMBER\s+(\d+)\s+(.*?\d\d/\d\d/\d\d\ +.)}msg;
    Just makes sense to me. Thanks Again!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1120476]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-25 16:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found