Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

how to use regular expressions read some string from a htm file

by weihe (Initiate)
on Aug 02, 2006 at 05:04 UTC ( [id://565156]=perlquestion: print w/replies, xml ) Need Help??

weihe has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: how to use regular expressions read some string from a htm file
by GrandFather (Saint) on Aug 02, 2006 at 05:13 UTC

    You don't. You use HTML::TreeBuilder or some such similar module. Life is too short to bother reinventing that particular wheel. Markup is hard to write regexen to parse because there are many special cases for handling things like white space. Try something like:

    use strict; use warnings; use HTML::TreeBuilder; my $str = <<'STR'; <html><head><title>my page></title></head> <body> <table><tr><td> <a href="http://mysite/bbsui.jsp?id=dxpwd">dxpwd</a> </td><td> <a href="http://mysite/bbsui.jsp?id=jimeth">jimeth</a> </td><td> <a href="http://mysite/bbsui.jsp?id=jone28">jone28</a> </td><td> <a href="http://mysite/bbsui.jsp?id=25528">25528</a> </td></tr> </body></html> STR my $tree = HTML::TreeBuilder->new; $tree->parse ($str); print $_->attr ('href') . "\n" for $tree->find ('a');

    Prints:

    http://mysite/bbsui.jsp?id=dxpwd http://mysite/bbsui.jsp?id=jimeth http://mysite/bbsui.jsp?id=jone28 http://mysite/bbsui.jsp?id=25528

    DWIM is Perl's answer to Gödel
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: how to use regular expressions read some string from a htm file
by Zaxo (Archbishop) on Aug 02, 2006 at 05:10 UTC

    Don't use a regular expression, use HTML::LinkExtor;

    It will drive you mad to try this with regexen.

    After Compline,
    Zaxo

Re: how to use regular expressions read some string from a htm file
by reneeb (Chaplain) on Aug 02, 2006 at 05:48 UTC
    You can use HTML::Parser:
    #! /usr/bin/perl use strict; use warnings; use HTML::Parser; my @links; my $string = qq~<a href="url1">linktext1</a> Ein anderer Text <a href="url2">linktext2</a> text~; my $p = HTML::Parser->new(); $p->handler(start => \&start_handler,"tagname,attr,self"); $p->parse($string); foreach my $link(@links){ print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n"; } sub start_handler{ return if(shift ne 'a'); my ($class) = shift->{href}; my $self = shift; my $text; $self->handler(text => sub{$text = shift;},"dtext"); $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a +')},"tagname"); }
Re: how to use regular expressions read some string from a htm file
by rsriram (Hermit) on Aug 02, 2006 at 05:34 UTC

    Hi, It is a smarter way to use modules instead of regular expressions when working with HTML files. But, if you are so particular in using regex, try this.

    open (F1, "<$ARGV[0]") || die ("Can't open the file $ARGV[0]. $!\n");
    while(<F1>)
    {
       print "$1\n" if ($_ =~ /<a href="([^"]+)">/)
    }
    close F1;

    In the above script, I have the HTML file stored in the variable F1.

Re: how do i get special string from source of text file
by gellyfish (Monsignor) on Aug 02, 2006 at 08:09 UTC

    I'd suggest using HTML::LinkExtor to extract the URLs from the <a /> elements and then throwing away the ones you don't want afterwards, however as you don't say how to distinguish between the ones you do want and the ones you don't I'm not going to guess and give you an example.

    /J\

Re: how to use regular expressions read some string from a htm file
by planetscape (Chancellor) on Aug 02, 2006 at 17:45 UTC

    As many others have pointed out, you don't.

    In addition to the other excellent examples above, you could also use mech-dump, which comes with WWW::Mechanize, e.g.:

    mech-dump --links http://www.perlmonks.org

    You would still, of course, need to do some post-processing to get just the links you want, but so far you have posted no criteria to determine which of those that is.

    HTH,

    planetscape

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://565156]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-25 06:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found