Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

parsing with regex

by 2501 (Pilgrim)
on Nov 16, 2001 at 01:37 UTC ( [id://125686]=perlquestion: print w/replies, xml ) Need Help??

2501 has asked for the wisdom of the Perl Monks concerning the following question:

I have a somewhat easy regex question, but I am very weak with regex so I am having a hard time trying to juggle the things I have to get done and researching what the heck I am doing wrong. I was hoping I could get some help.
lets say I bring a chunk of a web page into a script. The HTML basically looks like :
<HR> 1 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 2 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 3 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 4 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR>
What I need to do is grab anything between the horizontal breaks where it says "is good" and load all occurences of that data into some sort of data structure.
I can accomplish this task in a really ugly manner, but I figured it was about time to see the right way to do it.

Once again, I thank all of you for your time and patience.
2501

Replies are listed 'Best First'.
Re: parsing with regex
by dfog (Scribe) on Nov 16, 2001 at 03:51 UTC
    If everything is in a single variable you could do it in one line like
    #!perl my $Data=<<HTMLend; <HR> 1 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 2 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 3 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 4 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> HTMLend @Results = grep {$_ =~ /is good/i} split (/<HR>/, $Data); $" = "\n\n"; print "@Results";
    Dave
Re: parsing with regex
by Sifmole (Chaplain) on Nov 16, 2001 at 02:46 UTC
    #!/usr/bin/perl -w use strict; my %foo; while (<DATA>) { $foo{$1}++ if (/(\d+) is good/) } print "$_ :: $foo{$_} \n" foreach (keys %foo); __DATA__ <HR> 1 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 2 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 3 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 4 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR>
      Sorry, I probably wasn't clear enough.
      I would need everything between the horizontal breaks which would be:
      1 is good<BR> data unique to 1<BR> data unique to 1<<BR> data unique to 1<<BR> data unique to 1<<BR> 3 is good<BR> data unique to 3<<BR> data unique to 3<<BR> data unique to 3<<BR> data unique to 3<<BR>
      Thank you,
      2501
        Okay how about this?
        #!/usr/bin/perl -w use strict; my @foo; $/=""; $_ = <DATA>; while (s/(\d+ is good.*?)<HR>//s) { push @foo, $1; } print $_, "\n--------------\n" foreach (@foo); __DATA__ <HR> 1 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 2 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 3 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 4 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR>
Re: parsing with regex
by mitd (Curate) on Nov 16, 2001 at 09:02 UTC
    Well I'll have a go:

    #!/bin/perl -w use strict; # slurp it up $/=''; my $slurp = <DATA>; # one nice string might as well split it # adding a little whitespace gobble and case protection my @stuff = split(/\s*<\s*[Hh][Rr]\s*>\s*/,$slurp); # and spit it out foreach (@stuff) { print $_,"\n"; } __DATA__ <HR> 1 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 2 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 3 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 4 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR>

    mitd-Made in the Dark
    'Interactive! Paper tape is interactive!
    If you don't believe me I can show you my paper cut scars!'

Re: parsing with regex
by YuckFoo (Abbot) on Nov 16, 2001 at 03:39 UTC
    I'm sure someone will post an efficient regex, but in the meantime you can try this. Still a bit ugly, lines are joined then split on the HR tags.

    YuckFoo

    #!/usr/bin/perl use strict; my ($line, @keep); for $line ((split(/<HR>\s+/s, join('', (<DATA>))))) { if ($line =~ m{\d+\s+is\s+good}) { push (@keep, $line); } } for $line (@keep) { print "$line\n"; } __DATA__ <HR> 1 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 2 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 3 is good<BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR> 4 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR>
Re: parsing with regex
by mkmcconn (Chaplain) on Nov 17, 2001 at 01:02 UTC

    Everyone is so much quicker than me. Oh well, here's my attempt for what it's worth, constructed to handle some (by no means all) HTML-legal variations in the text.

    #!/usr/bin/perl -w use strict; $/ = ''; my %h; while (<DATA>){ while ( s/((\d+) is good.+?)<(?:hr|HR)>//s ){ my $good = $1; my $key = $2; $good =~ s/\n?\s?<(?:BR|br).?.?>\n?/|/g; my @pot = split /\|/, $good; shift @pot; $h{$key} = [@pot]; } } use Data::Dumper; print Data::Dumper->Dump([\%h],[qw(*h)]); __DATA__ <HR> 1 is good<BR> useless data<BR>useless data<BR> useless data <BR>useless data<BR> <hr> 2 is good<BR> useless data<br> useless data<BR> useless data<br> useless data<BR> <hr> 3 is not good <BR> useless data <br />useless data<br />useless data<BR> useless data<BR> <HR> 4 is good<BR> useless data<BR>useless data<BR> useless data<br>useless data<BR> <HR> 5 is not good <BR> useless data<BR> useless data<BR> useless data<BR> useless data<BR> <HR>

    By the way, you asked your question very well and complete with a good data example. It's appreciated.
    (better Data::Dumper, thanks to sacked and tilly).
    mkmcconn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://125686]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-24 00:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found