Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Pattern Search on HTML source.

by Anonymous Monk
on Dec 31, 2007 at 16:45 UTC ( [id://659770]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I have an html page and in it's source code I have some tags like

<!-- 1st table -->What I want 1<!-- /1st table -->more stuff...<!-- 2st movie -->What I want 2<!-- /2st movie -->...more stuff...<!-- 3st movie -->What I want 3<!-- /3st movie -->...more stuff
In the source code could be 1 or more than 10 tags, but my point is that I am trying to get only the content in the middle of these tags, using the code showing here, but can't get a match, any help on that?
Here is the code I am using to do the match:
#got the html page already and I saved as .txt my $output_file = "/temp_my_news.txt"; open(OUTPUT, "$output_file") || print "There is no file here!"; while(<OUTPUT>) { if ($_=~/<!-- 1nd table -->(.*?)<!-- \/1nd table -->/g) { print $1; }else{print "<br>Nothing Here!<br>";} } close OUTPUT;

Thanks for the Help!!!!

Replies are listed 'Best First'.
Re: Pattern Search on HTML source.
by jettero (Monsignor) on Dec 31, 2007 at 16:56 UTC

    It's notoriously poor form to try to parse html with regular expressions. You might want to look at HTML::Parser or XML::XPath or one of the million other choices.

    Your approach might otherwise work (some of the time) except that 1nd should be 1st?

    m/\Q<!--\E\s*1st table\s*\Q-->\E(.*?)\Q<!--\E\s*\/1st table\s*\Q-->\E/ +g

    -Paul

Re: Pattern Search on HTML source.
by sundialsvc4 (Abbot) on Dec 31, 2007 at 17:04 UTC

    I very strongly agree:   do not use regular expressions to parse HTML code. Do not treat the input that you have been given as a “string.”

    Dictum Ne Agas:   do not do a thing already done.

    Use one of the many HTML-parsing tools already mentioned to transform the HTML input into a data-structure consisting of Perl hashes and lists. Then, navigate through the structure. (You can even find assistance, in so-called “XPath expressions,” to do that part without having to write custom code.)

Re: Pattern Search on HTML source.
by ww (Archbishop) on Dec 31, 2007 at 17:09 UTC

    Although you have correctly made your regex non-greedy, it "can't get a match" because 1nd (in the regex) does not match 1st (in the data):

    my $output = "<!-- 1st table -->What I want 1<!-- /1st table -->more s +tuff...<!-- 2st movie -->What I want 2<!-- /2st movie -->...more stuf +f...<!-- 3st movie -->What I want 3<!-- /3st movie -->...more stuff"; if ($output =~ /<!-- 1st table -->(.*?)<!-- \/1st table -->/g) { print $1; } else{ print "Nothing Here!"; }
    cheerfully spits out
    perl 23.pl
    What I want 1
    However
    1. Your /g isn't doing what your think. You've tried to specify a single set of tags. /g will find the content between them if they're repeated, but it won't find "2st sic movie
    2. Your pseudo-html makes no sense: tables without rows or data cells?
    3. Using LWP or similar, if you're not, could save you the trouble of saving the source data as a text file
    4. It's a tad peculiar to name the input FH in your code as "OUTPUT"
    5. and, if you're going to parse html, use a module. There are just too many ways to go wrong while rolling your own.
      The problem is that is the tags has sometihing like:

      my $output = "<!-- 1st table --> What I want 1<!-- /1st table -->more stuff...<!-- 2st movie --> What I want 2<!-- /2st movie -->...more stuff...<!-- 3st movie -->What + I want 3<!-- /3st movie -->...more stuff";


      Like a carriage return or something like that I can't get it to match.
        am: use the download download link beneath the code to capture it rather than copy-pasting... or remove the newlines from what you copy-pasted until you have the $output as a single line in your editor.

        and... updating the previous: I realized, belatedly, that you appear to want to capture the contents of all the tag pairs, rather than just the first. Sorry, the code I posted captures only the first and so far, I haven't worked out a simple (aka, "elegant") and understandable way to do them all with a regex. CF advise to use an html parser or (new suggestion) a module designed to deal with matching pairs. Perhaps wiser monks will offer more particular suggestions.

Re: Pattern Search on HTML source.
by NetWallah (Canon) on Dec 31, 2007 at 17:12 UTC
    While strongly agreeing with the previous 2 posts(Use an HTML parser module), one of the issues you may run into, with your code is described in Death to Dot Star!.

    for a quick hack, I'd suggest using

    ([^<]+)
    instead.

         "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom

      Why something like this doesn't work?

      if ( $line_test=~/<!-- 1st table -->\s*(.*?)\s*<!-- \/1st table -->/g) +{print $1;}


      If it has a carriage return or new line.
        Try adding an "s" modifer, along with your"g".

        s

        Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.

             "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom

Re: Pattern Search on HTML source.
by dwm042 (Priest) on Dec 31, 2007 at 18:49 UTC
    I've had success looking at HTML tags and content of tags, on a as needed basis, by grabbing pages with LWP::UserAgent and parsing the result with HTML::TreeBuilder or HTML::TableExtract. With the TreeBuilder module you get a tree of your HTML returned and you can walk down the tree and choose what you want to extract. TableExtract is a more specific parser just for tables.

    For docs on using HTML::TreeBuilder, there is the very nice HTML::Tree::Scanning.

    Update: TreeBuilder docs.
Re: Pattern Search on HTML source.
by Popcorn Dave (Abbot) on Jan 02, 2008 at 00:39 UTC
    If you're only trying to get the text between the tags, have you considered using HTML::Strip? Or do you need to know what tags the text is between? If so, have you looked at using HTML::TokeParser?

    One of those should probably do the trick for you.


    Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

    I would love to change the world, but they won't give me the source code

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://659770]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-04-19 11:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found