Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Cropping the output of the pattern matcher

by jerrygarciuh (Curate)
on Sep 23, 2001 at 23:21 UTC ( [id://114200] : perlquestion . print w/replies, xml ) Need Help??

jerrygarciuh has asked for the wisdom of the Perl Monks concerning the following question:

I have been playing with the code from Pattern Matching Examples and I got this working
(My most Obfu line yet :} ).
#!/usr/local/bin/perl -w use strict; #matches '>' then 0 or more spaces and then #0 or more alphanumeric characters followed by '<' and 0 or more space +s while(<>){ print if m/>\s+\w*</|/>\w*</|/>\w*</s+//; }

Anyway, the idea is that if I can detect text and whitespace delimited by '><' then I could pick up just the text from an HTML table and lose the tags. So the script finds the pattern OK but it gives me the whole line in which the pattern occurs. How do I crop the data based on my delimeters '><' and return only what was between them?

Replies are listed 'Best First'.
Re: Cropping the output of the pattern matcher
by wog (Curate) on Sep 23, 2001 at 23:44 UTC
    For parsing HTML you are best off avoiding a regex. The reason for this is that HTML is not easy to parse, for example:

    <!-- > A really funky image. --> <img src="light.gif" alt=">>LIGHT<<" /> <!-- was: <img src="light.jpg" alt="<light>" /> --> This is some text.

    Because > and < can appear other then deliminating HTML tags, HTML parsing is probably best left off to HTML::TokeParser or HTML::Parser. For your case you might also want to look at HTML::TableExtract.

    If you want to use your pattern, you can capture text using parenthesis, which will place the captured text in to the $<digit> variables, or in the result of the match in list context.

    Note that your regex parses very differently from how you think it does. Here is the output of -MO=Deparse on it, modified to use m// instead of // so regexes stand out:

    m/>\s+\w*</ | m/>\w*</ | m/>\w*</s + m//

    I doubt this is the way you think it parses.

    However, besides the fact it does not compile with those deliminators, your regex needs work to match the way you document it as matching. A straightforward translation of your specification would be:

    if (/>(\s*[[:alnum:]]*)</) { my $matched = $1; # ... } else { # didn't match }

    (Note that \w does not match just alphanumerics (it includes _) so I did not use it there. I also suspect you defined what you want to match incorrectly. update: I also excluded the 0 or more spaces after the "<" because it will always find at least 0 spaces.)

    (update: minor rephrasing to make things make more sense.)

(jeffa) Re: Cropping the output of the pattern matcher
by jeffa (Bishop) on Sep 23, 2001 at 23:47 UTC
    My first response is HTML::TableExtract, but that's too easy. ;)

    Looks like some of your slashes are going the wrong way... but anyhoo - if you want to 'crop' the data you need to do a little syntactical trick:

    while(<>) { if (my ($matched) = $_ =~ />([^<]+)</) { print "'$matched': on line $.\n"; } }
    Instead of matching something so specific (\w and \s), match everything BUT the closing tag: [^<]+


Re: Cropping the output of the pattern matcher
by trantor (Chaplain) on Sep 23, 2001 at 23:43 UTC

    Parsing HTML may be tougher than it seems, so you might consider using HTML::Parser.

    However, considering your question:

    How do I crop the data based on my delimeters '><' and return only what was between them?

    A simple split should suffice, such as:

    my @bits_and_pieces = split /></, $line;


      I will definitely take it under advisement and look into the modules others have developed for this porpoise,
      but in the spirit of learning please advise me on using split. I am now trying your code Trantor like so:
      #!/usr/local/bin/perl -w use strict; my $line = "a whole lotta shakin >pattern here< goin on"; my @bits_and_pieces = split /></, $line; print @bits_and_pieces;

      Done this way the out put is the same as $line was originally:
      a whole lotta shakin >pattern here< goin on
      But if I use
      my $line = "a whole lotta shakin ><pattern here>< goin on";
      Then my output is: a whole lotta shakin pattern here goin on
      The only thing that happened was my delimiters were deleted.
      Would some kind soul explain why split did that.
      I read perlfunc:split but I'm afraid I am no wiser.
        It seems as the delimiters were deleted, but the print is a little tricky

        First, you had the string:
        my $line = "a whole lotta shakin >pattern here< goin on";

        and you did
        my @bits_and_pieces = split /></, $line;

        What did it make?
        This code "splits" the string, searching '><' as delimiter. Because of '><' is not matched in the string, in @bits_and_pieces you will have only one element, and it would be
        'a whole lotta shakin >pattern here< goin on'
        (all the string)

        In the other hand, you did this
        my $line = "a whole lotta shakin ><pattern here>< goin on"; my @bits_and_pieces = split /></, $line;

        Then, perl searches '><' in the string and splits it into an array.
        So @bits_and_pieces would be now
        ('a whole lotta shakin ','pattern here',' goin on')

        If you print @bits_and_pieces with
        print @bits_and_pieces;
        is printed
        'a whole lotta shakin pattern here goin on'
        as if the delimiters '><' where removed from the array.

        Hope this help
        The only thing I might add to hopes fine reply to jerrygarciuh's question about split

        is that changing the final print might make the point clearer.....

        print "numboer of array elements==". scalar(@bits_and_pieces)."\n"; print join("\n",@bits_and_pieces);
Re: Cropping the output of the pattern matcher
by tachyon (Chancellor) on Sep 24, 2001 at 07:16 UTC

    Playing with regexps is fun but to do this reliably you want to use HTML::Parser and probably the HTML::TokeParser interface as this is easier to understand than the raw Parser interface. Here is an example that extracts all the H tags and the text between them from a document. If you just want the text it should be obvious how to get it. If you want to extract from a table just substitute TD for the H tag list. There is an excellent tutorial on TokeParser in Tutorials

    #!/usr/bin/perl -w use strict; use HTML::TokeParser; my $dir = "c:/windows/desktop/book/work/"; my $file = $dir."work_introduction.htm"; my $p = HTML::TokeParser->new($file) || die "Can't open $file: $!"; while (my $token = $p->get_tag(qw(h1 h2 h3 h4))) { my $open = $token->[0]; my $close = '/'.$open; my $text = $p->get_trimmed_text($close); print "<$open>$text<$close>\n"; }