Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Regexes, stitching broken lines, and other fun stuff.

by vxp (Pilgrim)
on Jun 26, 2009 at 14:46 UTC ( [id://775066]=perlquestion: print w/replies, xml ) Need Help??

vxp has asked for the wisdom of the Perl Monks concerning the following question:

I've a file that I need to parse. What I am trying to do, essentially, is this:

#!/usr/bin/perl $file = "spg.txt"; open(SPG, $file) or die "Couldn't open $file: $!\n"; while (defined($line = <SPG>)) { $line =~ s/\s+/ /g; $line =~ s/^\s//g; my ($title, $start_date, $start_time, $end_date, $end_time, $s +tatus, $prixit) = split(/\s/, $line); print "$title $status\n"; }

that'd be pretty easy, but there are lines in the file that are "broken", so to speak. Take a look at the "spg-risk-ln_cdo_leg_synthetic" line below, for instance:

[vxp@vxp ~]$ cat spg.txt spg-risk-ln-box 06/24/2009 21:14 06/24/2009 22:01 IN 3969 +3696/0 spg-risk-Fixed_Sterling 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-aaeml 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_abs_credit 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_abs_fixed 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_abs_fixed2 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_abs_flow 06/24/2009 21:14 06/24/2009 21:14 IN 3969 +3696/1 spg-risk-ln_aol_abs 06/24/2009 21:14 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_apcms 06/24/2009 21:14 06/24/2009 21:14 IN 3969 +3696/1 spg-risk-ln_bouwfonds 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_caprub 06/24/2009 21:43 06/24/2009 21:45 IN 3969 +3696/2 spg-risk-ln_capusd 06/24/2009 21:14 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_cdo 06/24/2009 21:14 06/24/2009 22:00 IN 3969 +3696/0 spg-risk-ln_cdo_leg_synthetic 06/24/2009 21:14 06/24/2009 21:18 IN 3969 +3696/0 spg-risk-ln_cdo_legacy 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_cmbs 06/24/2009 21:15 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_cmbx 06/24/2009 21:15 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_credit_fixed 06/24/2009 21:15 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_credit_frn 06/24/2009 21:15 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_euresi 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-ln_fonspa 06/24/2009 21:15 06/24/2009 21:21 IN 3969 +3696/1 spg-risk-ln_hyloans 06/24/2009 21:15 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_ni 06/24/2009 21:15 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_resid_rmbs 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-ln_rmbs 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-ln_swaps 06/24/2009 21:15 06/24/2009 21:22 IN 3969 +3696/1 spg-risk-ln_synresi 06/24/2009 21:15 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_synthetics 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-ln_trefs 06/24/2009 21:15 06/24/2009 21:20 IN 3969 +3696/1 spg-risk-ln_ukpurch 06/24/2009 21:15 06/24/2009 21:19 IN 3969 +3696/0 spg-risk-ln_warehouse 06/24/2009 21:15 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_abs_frn 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-lnliq 06/24/2009 21:14 06/24/2009 21:18 IN 3969 +3696/1 [vxp@vxp ~]$

Аny ideas on what's needed to "fix" the file?

I can't do a regex to match the line that starts with "spg-risk-ln_cdo_leg_synthetic" and "stitch" it with the next line (that'd involve something like checking if there is any data after the first column, and if there isn't then place the first column into a hash (with the first column being the key) and then check the next line, if it starts with a space then assign those as the key's value. That can be done technically, but that solution won't work because I've thousands and thousands of these little files to parse, I can't possibly find all of these lines and write thousands and thousands of regexes... That's why I'm asking people here for a , possibly, "universal", so to speak, solution to this problem.

Thanks in advance! :)

Replies are listed 'Best First'.
Re: Regexes, stitching broken lines, and other fun stuff.
by ikegami (Patriarch) on Jun 26, 2009 at 14:55 UTC
    while (<SPG>) { s/^\s+//; s/\s+$//; if (!/\s/) { $_ .= <SPG>; redo; } my ($title, $status) = ( split /\s+/ )[0,5]; print "$title $status\n"; }
Re: Regexes, stitching broken lines, and other fun stuff.
by suaveant (Parson) on Jun 26, 2009 at 14:56 UTC
    There are many ways to do it... you could use a single regexp
    $file =~ /^\s+spg-risk-\S+\s+.*? IN \d{8}\/\d\s*$/ms;
    That should match a "full line" Or as you said, use state...
    my $last; while(<IN>) { if(/^\s+(spg-risk\S+)/) { $items{$1} = $_; $last = $1; } elsif($last) { $items{$last} .= $_; } else { warn "Extended line found before any items found\n"; } }
    Not tested, but should give you an idea.

                    - Ant
                    - Some of my best work - (1 2 3)

Re: Regexes, stitching broken lines, and other fun stuff.
by jwkrahn (Abbot) on Jun 26, 2009 at 16:51 UTC

    I would do it like this:

    #!/usr/bin/perl use warnings; use strict; my $file = '775066.dat'; open my $SPG, '<', $file or die "Couldn't open $file: $!\n"; while ( <$SPG> ) { my ( $title, @data ) = split; @data = split ' ', <$SPG> unless @data; print "$title $data[-2]\n"; }
Re: Regexes, stitching broken lines, and other fun stuff.
by oko1 (Deacon) on Jun 26, 2009 at 18:44 UTC

    Another, fairly simple way: since your data is so regular, and since all the elements are delimited by whitespace, you can simply parse it as a list of sets.

    #!/usr/bin/perl -w use strict; open Spg, '<', 'spg.txt' or die "spg.txt: $!\n"; my @list = split /\s+/, do { local $/; <Spg> }; close Spg; while (@list){ # Output 1 and 5... print "@list[1,5]\n"; # ...and throw away the first 7 splice @list, 0, 7; }

    --
    "Language shapes the way we think, and determines what we can think about."
    -- B. L. Whorf
Re: Regexes, stitching broken lines, and other fun stuff.
by vxp (Pilgrim) on Jun 26, 2009 at 15:06 UTC
    I knew I could count on perl monks! :D

    Thanks guys

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://775066]
Approved by wfsp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-24 19:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found