Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Regex Exercise

by deprecated (Priest)
on Mar 16, 2001 at 21:38 UTC ( [id://64957]=perlquestion: print w/replies, xml ) Need Help??

deprecated has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks...

I am writing a small script to hashify an HTML table. The table is large, but completely homogenous (thank goodness). So, without further ado, I give you the html:

<tr><td><b><a href=i386/zh-xcin-2.3.04.tgz-long.html>zh-xcin-2.3.04.tgz</a></b></td> +<td>&nbsp&nbsp&nbsp <i>chinese input utility for X </i></td><td>[ <a href=ftp://ftp.openbsd.org/pub/OpenBSD/2.8/packages/ +i386/zh-xcin-2.3.04.tgz>FTP Site 1</a> ]</td><td> [ <a href=ftp://ftp1.usa.openbsd.org/pub/OpenBSD/2.8/packages/i386/zh- +xcin-2.3.04.tgz>FTP Site 2</a> ]</td></tr>
So, for simplicity I zapped the /n/r that was lurking in there and have something thats a big brick of html (which I will spare all of you, nobody ever said html was pretty). So I have the following code:
my @fields = split '<tr><td><b>', $input; foreach my $field (@fields) { # what i really wanted to do was... # (undef, $names{$1}) =~ m// but that didnt work either # so I added the $foo and $bar. my ($foo, $bar) = $field =~ m!^<a href=.*>(.*)</a></b></td><td>&nbsp{3}<i>(.*)</i>.*$!x; $names{$foo} = $bar; print "$foo == $bar\n"; }
If i print $field I do get my html, so I know $field is okay... I think the problem is the regex. In fact, im 90% sure its the regex. But where is it wrong given the data? It looks fine to me.

Thanks
brother dep.

--
transcending "coolness" is what makes us cool.

Replies are listed 'Best First'.
Re: Regex Exercise
by japhy (Canon) on Mar 16, 2001 at 21:44 UTC
    The problem is &nbsp{3} matches the string "&nbsppp", not "&nbsp&nbsp&nbsp".

    japhy -- Perl and Regex Hacker
Re: Regex Exercise
by Malkavian (Friar) on Mar 16, 2001 at 21:54 UTC
    Perhaps the line:
    m!^<a href=[^>]+>([^<]+)</a></b></td><td>(?:&nbsp){3}<i>([^<]+)</i>.*$ +!x;

    may help?
    (It's untested, and I'm not that great at regex either. :) )

    Malk.

    Updated Ooops, forgot the capturing brackets, now added back in.
Re: Regex Exercise
by gryphon (Abbot) on Mar 16, 2001 at 22:32 UTC

    Greetings deprecated,

    Well, this isn't the best or most compact regex in the world, but I've tested this, and it appears to work in the trials I've done. Give it a try. Someone with additional regex experience should be able to shorten my match string somewhat, I suspect.

    use strict; my $input = "<tr><td><b><a href=i386/zh-xcin-2.3.04.tgz-long.html>zh-x +cin-2.3.04.tgz</a></b></td><td>&nbsp&nbsp&nbsp<i>chinese input utilit +y for X</i></td><td>[ <a href=ftp://ftp.openbsd.org/pub/OpenBSD/2.8/p +ackages/i386/zh-xcin-2.3.04.tgz>FTP Site 1</a> ]</td><td>[ <a href=ft +p://ftp1.usa.openbsd.org/pub/OpenBSD/2.8/packages/i386/zh-xcin-2.3.04 +.tgz>FTP Site 2</a> ]</td></tr>"; my %data; my @fields = split '<tr><td><b>', $input; shift @fields; foreach my $field (@fields) { ($data{fileurl}, $data{filename}, $data{description}, $data{ftp1}, + $data{ftp2}) = $field =~ m#^<a href=(.*?)>(.*?)</a></b></td><td>&nbsp&nbsp&nbsp<i>(.*?) +</i></td><td>\[ <a href=(.*?)>.*?</a> ]</td><td>\[ <a href=(.*?)>.*#; print "$2 == $3\n"; }

    Yeah, I know. It's a bit clunky. Given additional known constants for your specific situation, you may be able to streamline this a bit better than me. Anyway, good luck!

    -Gryphon.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://64957]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2024-04-25 06:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found