Scraping HTML: orthodoxy and reality

Every so often, some asks a question about extracting HTML. They might have a piece of text like...

<p>This is a paragraph</p>
<p>And this is <i>another</i> paragraph</p>
[download]

... and they wonder why they get such strange results when they try to attack it with something like /(.*)<\/p>/. The standard Perlmonks reply is to point people to HTML::Parser, HTML::TableExtract or even HTML::TreeBuilder, depending on the context. The main thrust of the argument is usually that regexps lead to fragile code. This is what I term The Correct Answer, but alas, In Real Life, things are never so simple, as a recent experience just showed me. You can, with minor care and effort, get perfect results with regexps, with much better performance.

I used HTML::Parser in the past to build the Perlmonk Snippets Index -- the main reason is that I wanted to walk down all the pages, in case an old node was reaped. (I think there's still a dup or two in there). From that I learnt that it's a real bear to ferry information from one callback to another. I hacked it by using global variables to keep track of state. Later on, someone else told me that The Right Way to use H::P is to subclass it, and extend the internal hash object to track your state that way. Fair enough, but this approach, while theoretically correct, is a non-trivial undertaking for a casual user who just wants to chop up some HTML.

More recently I used HTML::TreeBuilder to parse some HTML output from a webified Domino database. Because of the way the HTML was structured in this particular case, it was a snap to just look_down('_tag', 'foo') and get exactly what I wanted. It was a easy to write, and the code was straightforward.

Then last week, I got tired of keeping an eye on our farm of HP 4600 colour printers, to see how their supplies were lasting (they use four cartridges, C, M, Y & K, and two kits, the transfer and fuser). It turns out that this model has a embedded web server (I could have also done it with SNMP, but that's another story). Point your browser at it, and it will produce a status page that shows you how many pages can be printed out on what's left in the consumables.

So I brought HTML::TreeBuilder to bear on the task. It wasn't quite as easy. It was no simple matter to find a reliable part in the tree from whence to direct my search. The HTML contains deeply nested tables, with a high degree of repetition for each kit and cartridge. The various pieces of information in scattered in different elements and collecting and collating it made for some pretty ugly code.

After I'd managed to wrestle the data I wanted out of the web page, I set about stepping through the code in the debugger, to better understand the data structures and see what shortcuts I could by way of method chaining and array slicing in an attempt to tidy up the code. To my suprise, I saw that just building the H::TB object (by calling the parse() with the HTML in a scalar) required about a second to execute, and this, on some fairly recent high-end hardware.

Until then I wasn't really concerned about performance, because I figured the parse time would be dwarfed by the time it took to get the request's results back from the network. In the master plan, however, I intend to use LWP::ParallelAgent to probe all of the printers in parallel, rather than looping though them one at a time, and factor out much of the waiting. In a perfect world it would be as fast as the single slowest printer.

Given the less than stellar performance of the code at this point, however, it was clear that the cost of parsing the HTML would consume the bulk of the overall run-time. Maybe I might be able to traverse a partially-fetched page, but at this point the architecture would start to become unwieldy. Madness.

So, after having tried the orthodox approach, I started again and broke the rule about parsing HTML with regexps and wrote the following:

    my (@s) = m{
        >         # close of previous tag
        ([^<]+)   # text (name of part, e.g. q/BLACK CARTRIDGE/)
        <br>
        ([^<]+)   # part number (e.g. q/HP Part Number:     HP C9724A/
+)
        (?:<[^>]+>\s*){4} # separated by 4 tags
        (\d+)     # percent remaining
        |         # --or--
        (?:
            # different text values
            (?:
                Pages\sRemaining
                | Low\sReached
                | Serial\sNumber
                | Pages\sprinted\swith\sthis\ssupply
            ) : (?:\s*<[^>]+>){6}\s* # colon, separated by 6 tags
        # or just this, within the current element
        | Based\son\shistorical\s\S+\spage\scoverage\sof\s
        )
        (\w+) # and the value we want
    }gx;
[download]

A single regexp (albeit with a /g modifier) pulls out all I want. Actually it's not quite perfect, since it the resulting array also fills up with a pile of undefs, the unfilled parens on the opposite side of the | alternation to the match. But even so there's a pattern to the offsets into the array where the defined values are found, so a few magic numbers wrapped up as constants tidy that up.

Is the code fragile? Not really. For a start, the HTML has errors in it, such as <td valign= op"> (although H::TB coped just fine with this too).

The generated HTML is stored in the printer's onboard firmware, so unless I upgrade the BIOS, the HTML isn't going to change; it's written in stone, bugs and all. Here's the main point: when the HP 4650 or 4700 model is released, it will probably have completely different HTML anyway, perhaps with style sheets instead of tables. Either way, the HTML will have to be inspected anew, in order to tweak the regexp, or to pull something else out of TreeBuilder's parse tree.

Thus neither approach is maintenance free. But the regexp far less code, and 17 times faster. Now the extraction cost is negligeable compared to the page fetch, as it should be. And as a final bonus the regexp approach requires no non-core modules. Case closed.

I intend to make this a module available on CPAN, so about the best I can do will to be to put a BUGS section in the POD, aaking for contributions of HTML page dumps from newer models, and I'll patch it to take account of the variations.

Now all I need is a suggestion of how to name the module...

_____________________________________________
Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

Comment on Scraping HTML: orthodoxy and reality Select or Download Code

Replies are listed 'Best First'.

Re: Scraping HTML: orthodoxy and reality
by BrowserUk (Patriarch) on Jul 08, 2003 at 11:13 UTC

Amen grinder++. I whole heartedly agree.

I came to much the same conclusion in Being a heretic and going against the party line. after having only been using perl for a relatively short time. My experiences since have done little to change my mind.

Back in that old post I tried to make a distinction between the need to parse HTML and the need to extract something that just happens to be embedded within stuff that happens to be HTML. This distinction was roundly set upon as being wrong. I still hold with this distinction.

The dictionary definition of parse is

To break (a sentence) down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part.
To describe (a word) by stating its part of speech, form, and syntactical relationships in a sentence.
To examine closely or subject to detailed analysis, especially by breaking up into components: “What are we missing by parsing the behavior of chimpanzees into the conventional categories recognized largely from our own behavior?” (Stephen Jay Gould).
To make sense of; comprehend: I simply couldn't parse what you just said.

Whilst the dictionary definition of extract is:

To draw or pull out, often with great force or effort: <cite>extract a wisdom tooth; used tweezers to extract the splinter.</cite>

To obtain despite resistance: <cite>extract a promise.</cite>

To obtain from a substance by chemical or mechanical action, as by pressure, distillation, or evaporation.

To remove for separate consideration or publication; excerpt.

To derive or obtain (information, for example) from a source.

To deduce (a principle or doctrine); construe (a meaning).

To derive (pleasure or comfort) from an experience.

Mathematics. To determine or calculate (the root of a number).

From my perspective, when the need is to locate and capture one or more pieces of information from within any amount or structure of other stuff, without regard to the structural or semantic positioning of those pieces within the overall structure, the term extraction more applicable than parsing. If I need to understand the structure, derive semantic meaning from the structure or verify its correctness, then I need to parse, otherwise I just need to extract. After all, Practical Extraction and Reporting is what that Language was first designed to do.

My final, and strongest argument lies in a simple premise. If the information I was after was embedded amongst a lot of Arabic, Greek or Chinese, then noone would expect me to find and use a module that understood those languages, just to extract the bits I needed.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

[reply]

Re: Re: Scraping HTML: orthodoxy and reality

by tilly (Archbishop) on Jul 08, 2003 at 16:25 UTC

Regular expressions are designed as a tool for locating specific patterns in a sea of stuff. (Well until Perl 6 that is...) Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it. Parsing is a lot more work, but for structured text is going to give much more robust solutions. For instance you avoid different kinds of data being mistaken for each other.

The problem is that people are used to using regular expressions for text manipulation, and then set out to solve what is really a parsing probem with regular expressions. Then fail (and may or may not realize this). This happens so routinely that the knee-jerk response is that virtually anything which can be done with parsing should be, rather than regular expressions. And indeed this is good advice to give to someone who doesn't understand the parsing wheels - if only to avoid the problem of all problems looking like nails for the one hammer (regexps) that you have.

However the two kinds of problems are different and do overlap. Where they do overlap, it isn't necessarily obvious which is more practical. It isn't even necessarily obvious from the problem specification - sometimes you need to make a guess about how the code will evolve to know that...

[reply]

Re: Re: Re: Scraping HTML: orthodoxy and reality

by demerphq (Chancellor) on Jul 08, 2003 at 17:00 UTC

Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it.

Parsing typically has two phases though, the first is Tokenization and the second Parse Tree Generation (Im sure there is a better term but I forget what it is). These phases more often then not occur in synch but they need not. Either way regexes are perfectly suited to tokenization.

I learned the most about regexes from writing a regex tokenizer and parser. I learned a lot more from the tokenizer than from the parser tho. :-) Writing regexes to tokenize regexes is a fun head trip. (Incidentally the whole idea was to be able to use regexes to specify and generate random test data.)

_{<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...}

[reply]
[d/l]

Re: Scraping HTML: orthodoxy and reality
by PodMaster (Abbot) on Jul 08, 2003 at 08:08 UTC

HP200LX::

~~that~~

HP200LX::

As for your notes on html scraping reality, checkout YAPE::HTML, it's regex based.

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Scraping HTML: orthodoxy and reality
by gjb (Vicar) on Jul 08, 2003 at 13:37 UTC

Although I agree that choosing for a regexp approach or a context free grammar approach depends on the problem at hand, I'd like to stress that halley made a very important point:

Rules are meant to be broken, but you've to understand them before you can break them... safely.

Although a lot of Monks will know the distiction between a regular language and a context free language (and I'm sure grinder and BrowserUK do), I'm rather sure that some don't. In the latter case, unfortunately those Monks simply don't know the rules and have lots of opportunity to mess up.

I'd like to paraphrase: "a little thinking is a dangerous thing" if the process is not supported by a proper amount background knowledge.

It is possible to approximate a context free grammar with a regular expression, a nice survey article about that has been written by Mark-Jan Nederhof. There are several good books about formal languages, but I'd particularly recommend Sipser's since it is well written and is nice to read.

Conclusion: even if you know the rules, but don't understand them, don't try and break them. More importantly: try and understand the rules you're following.

Just my 2 cents, -gjb-

[reply]

Re: Scraping HTML: orthodoxy and reality
by halley (Prior) on Jul 08, 2003 at 13:03 UTC

Rules are meant to be broken, but you have to know the rules before you can break them. -- anon

"Orthodoxy means not having to think." -- George Orwell

--
[ e d @ h a l l e y . c c ]

[reply]

Re: Scraping HTML: orthodoxy and reality
by mojotoad (Monsignor) on Jul 08, 2003 at 15:22 UTC

So I brought HTML::TreeBuilder to bear on the task. It wasn't quite as easy. It was no simple matter to find a reliable part in the tree from whence to direct my search. The HTML contains deeply nested tables, with a high degree of repetition for each kit and cartridge. The various pieces of information in scattered in different elements and collecting and collating it made for some pretty ugly code.

From a logical standpoint, HTML::TableExtract seems to be a perfect choice for this. It might not be a perfect choice for efficiency, for which you seem to have a requirement (though I'm not sure what total run time you were shooting for...how often is this tool going to be run?)

Could you give an example HTML page and some numbers such as how many of them you are expected to handle and how often? For purposes of discussion, let's say your parallel fetch more or less delivers all of the pages simultaneously.

(Despite my bias, I am not automatically anti-regexp parsing and I see both sides of that particular scuffle)

Matt

[reply]

Re:x2 Scraping HTML: orthodoxy and reality

by grinder (Bishop) on Jul 08, 2003 at 16:50 UTC

Well, having never used it, I'd be very interested in seeing how you'd do this with HTML::TableExtract. Here's an example page: http://grinder.perlmonk.org/hp4600/.

There are 6 printers today, and we'll probably be adding another 4 or so in the future.

As a general rule I really don't care about performance, but this is a rare case where I have to do something about it. The reason being is that I want to be able to call this from mod_perl, so every tenth of a second is vital (in terms of human perception noticing lag in loading/rendering a page). It's for a small population of users (5 or so), and mod_perl is reverse proxied through lightweight Apache processes, so I'm not worried about machine resources.

I can't do anything about the time the printer takes to respond, but I do need the extraction to be as fast as possible to make up lost ground. There is always Plan B, which would be to cache the results via cron once or twice an hour; it's not as if the users drain one cartridge per day. I already do this for other status pages where the information is very expensive to calculate. People know the data aren't always fresh up to the minute but they can deal with that (especially since I always label the age of the information being presented).

I'll be very interested in seeing what you come up with. And if someone wants to show what a sub-classed HTML::Parser solution looks like, I think we'd have a really good real-life tutorial.

update: here's the proof-of-concept code as it stands today, as a yardstick to go by. The end result is a hash of hashes, C M Y and K are the colour cartridges and X and F are the transfer and fuser kits, respectively. These will mutate into something like HP::4600::Kit and HP::4600::Cartridge.

This code implements jeffa's observation of grepping the array for definedness, which indeed simplifies the problem considerably. Thanks jeffa!

#! /usr/bin/perl -w

use strict;
use LWP::UserAgent;

my @cartridge = qw/ name part percent remain coverage low serial print
+ed /;
my @kit       = qw/ name part percent remain /;

for my $host( @ARGV ) {
    my $url = qq{http://$host/hp/device/this.LCDispatcher?dispatch=htm
+l&cat=0&pos=2};
    my $response = LWP::UserAgent->new->request( HTTP::Request->new( G
+ET => $url ));
    if( !$response->is_success ) {
        warn "$host: couldn't get $url: ", $response->status_line, "\n
+";
        next;
    }
    $_ = $response->content;

    my (@s) = grep { defined $_ } m{
        (?:
            >         # closing tag
            ([^<]+)   # text (name of part, e.g. q/BLACK CARTRIDGE/)
            <br>
            ([^<]+)   # part number (e.g. q/HP Part Number:     HP C97
+24A/)
            </font>\s+</td>\s*<td[^>]+><font[^>]+>
            (\d+)     # percent remaining
        )
        |
        (?:
            (?:
                (?:
                    Pages\sRemaining # different text values
                    | Low\sReached
                    | Serial\sNumber
                    | Pages\sprinted\swith\sthis\ssupply
                )
                    :
                    \s*</font></p>\s*</td>\s*<td[^>]*>\s*<p[^>]*><font
+[^>]*>\s* # separated by this
                |
                Based\son\shistorical\s\S+\spage\scoverage\sof\s # or 
+just this, within a <td>
            )
            (\w+) # and the value we want
        )
    }gx;

    my %res;
    @{$res{K}}{@cartridge} = @s[ 0.. 7];
    @{$res{X}}{@kit}       = @s[ 8..11];
    @{$res{C}}{@cartridge} = @s[12..19];
    @{$res{F}}{@kit}       = @s[20..23];
    @{$res{M}}{@cartridge} = @s[24..31];
    @{$res{Y}}{@cartridge} = @s[32..39];

    print <<END_STATS;
$host
    Xfer $res{X}{percent}% $res{X}{remain}
    Fuse $res{F}{percent}% $res{F}{remain}
    C $res{C}{percent}% cover=$res{C}{coverage}% left=$res{C}{remain} 
+printed=$res{C}{printed}
    M $res{M}{percent}% cover=$res{M}{coverage}% left=$res{M}{remain} 
+printed=$res{M}{printed}
    Y $res{Y}{percent}% cover=$res{Y}{coverage}% left=$res{Y}{remain} 
+printed=$res{Y}{printed}
    K $res{K}{percent}% cover=$res{K}{coverage}% left=$res{K}{remain} 
+printed=$res{K}{printed}
END_STATS
}
[download]

_____________________________________________
Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

[reply]
[d/l]

Re: Re:x2 Scraping HTML: orthodoxy and reality

by mojotoad (Monsignor) on Jul 08, 2003 at 19:54 UTC

This solution is still vulnerable to layout changes from the printer manufacturer. I really don't like having to use depth and count with HTML::TableExtract because of this reason -- if the HTML tables had some nice, labeled columns it would be another story entirely. With that in mind you may well be better off with your solution in the long run, though I daresay the regexp solution might be more difficult to maintain.

HTML::TableExtract is a subclass of HTML::Parser, in case you were unaware.

I'm pretty sure HTML::Parser slows things down compared to your solution, but I'm curious to what degree.

Enjoy,
Matt

#!/usr/bin/perl -w
use strict;

my $depth  = 0;
my $count  = 0;
my $ddepth = 3;

use LWP::Simple;
my $html = get('http://grinder.perlmonk.org/hp4600/');

my %Device;

use Data::Dumper;

use HTML::TableExtract;
my $te = HTML::TableExtract->new;
$te->parse($html);
foreach my $ts ($te->table_states) {
  &process_detail($ts) if ($ts->depth == $ddepth);
  &process_main($ts)   if ($ts->depth == $depth && $ts->count == $coun
+t);
}
# Clean up the empty spots
@{$Device{stats}} = grep(defined, @{$Device{stats}});

print Dumper(\%Device);

exit;

sub process_main {
  my $ts = shift;
  my($host, $model) = _scrub(($ts->rows)[1]);
  $Device{host}  = $host;
  $Device{model} = $model;
}

sub process_detail {
  $_[0]->count % 2 ? _proc_detail_stats(@_) : _proc_detail_name(@_);
}

sub _proc_detail_name {
  my $ts = shift;
  my($name, $part, $pct) = _scrub(($ts->rows)[0]);
  $part =~ s/.*:\s+//;
  $Device{stats}[$ts->count] =
    { name => $name, part => $part, pct => $pct };
}

sub _proc_detail_stats {
  my $ts = shift;
  my @stats = map(_scrub($_), $ts->rows);
  my $i = $ts->count - 1;
  @{$Device{stats}[$i]}{qw(pages_left hist low serial_num pages_printe
+d)}
    = (map(_scrub($_), $ts->rows))[1,2,4,6,8];
}

sub _scrub {
   grep(!/^\s*$/s, map(split(/(^M|\n)+/,$_), @{shift()}));
}
[download]

[reply]
[d/l]

Re: Scraping HTML: orthodoxy and reality
by hsmyers (Canon) on Jul 08, 2003 at 13:12 UTC

The need to handle nesting.
The need to survive arbitrary changes in the source.

Text::DelimMatch

Text::Balanced

As for those who knee jerk instead of thinking, since many of them have no experience witting parsers (formal or ad hoc) they fail to see that the regex approach is a form of parsing, just without the overhead of dealing with things that don't matter.

--hsm

"Never try to teach a pig to sing...it wastes your time and it annoys the pig."

[reply]

(jeffa) Re: Scraping HTML: orthodoxy and reality
by jeffa (Bishop) on Jul 08, 2003 at 14:27 UTC

"Actually it's not quite perfect, since the resulting array also fills up with a pile of undefs"

Couldn't you just filter the result through grep first?

my (@s) = grep $_, m{
   big regex here
}gx;
[download]

grep

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

[reply]
[d/l]

Re: Scraping HTML: orthodoxy and reality
by chunlou (Curate) on Jul 08, 2003 at 19:17 UTC

"Parse" vs "extract" or "regular language" vs "context free," etc. are indeed important distinctions to be made, as pointed out by some monks. Parsing data is a (more or less) mechanical process; extracting info is a human (A.I.) process.

Suppose you want to extract info by paragraph. Consider the following text fragment:

________________________________________

Look at the table below...
ho ho ho...

Could you behold the secret this unfolds?

A bit more, a bit more, irrelevant thought, a new paragraph...

________________________________________

see

<p>Look at the table below...
<table border="1"><tr><td><p>ho ho ho...</p></td></tr></table><br><br>
Could you behold the secret this unfold?<br><br>

A bit more, a bit more, irrelevant thought, a new paragraph...</p>
[download]

A parser might only see one paragraph between the  and  tags. There is a  pair in the table. Is it a paragraph? A parser might ask.

Suppose the parser takes into consideration that some people use   to denote the end of a paragraph. "Look..." and "Could..." might be considered two paragraphs. What about "A bit..."? Or are "Look..." and the table two paragraphs?

Human can read semantically; machine mostly syntactically. That's why extracting info is not the same problem as parsing data.

[reply]
[d/l]
[select]

Re: Re: Scraping HTML: orthodoxy and reality

by BrowserUk (Patriarch) on Jul 08, 2003 at 20:01 UTC

I applaud your attempt to clarify, though I don't necessarially agree with your example. For me, the difference between parsing something and and extracting something is that when I parse something, I am interested in the whole something. When I extract something, I am only interested in a small part of the whole.

That's basically why using an HTML parser when all I want is an extraction, is so wasteful. I go to all the trouble of analysing (sometimes validating) and capturing the structure of the HTML only to throw all that effort away as soon as I have captured the bit I want.

It's a bit like carefully dismantling, labelling and boxing an entire stately home, brick by brick, when the only part of value is the Adam's fireplace in the drawing room. If the rest of the place is simply going to be discarded, there really isn't any point in doing the extra work unless some conservation or restoration is intended. In code terms, that means I am going modify, reconstuct or otherwise utilise the structure I've spent the effort capturing. In a large majority of screen-scraping tasks, the structure is simply discarded.

The argument for using a parser is that semantic knowledge gained from the structure is useful when used to locate the bits of data required, and that using the structural clues is more reliable (less brittle) when the page is updated than a dumb match by regex. The problem I have found is that when the page changes, the structure is just as likely to change in ways that require the script to be re-worked as is a match-by-context regex.

For every example one way, there is a counter example the other, and I would never critisise anyone who chooses to use a parser for this purpose. I just wish that they would give my (and others) informed and cognisant choice to do otherwise, the same respect.

Examine what is said, not who speaks.

[reply]

Re: Re: Re: Scraping HTML: orthodoxy and reality

by chunlou (Curate) on Jul 08, 2003 at 20:29 UTC

[reply]

Re: Scraping HTML: orthodoxy and reality
by ff (Hermit) on Jul 09, 2003 at 03:18 UTC

In the meantime: when I go to the web page from the link via my IE browser and do a Ctl-A and Ctl-C and then paste the text into a Notepad screen, this particular output is quite comprehensible to my HTML-untrained eye (vs the HTML stuff), e.g.

     
  impse400 (I3C) / 172.17.8.182
hp color LaserJet 4600 
 
 
  
 Information       

<snip much miscellaneous info>
  
  
For highest print quality always use genuine Hewlett-Packard supplies.
+ 
  
  
 BLACK CARTRIDGE
HP Part Number: HP C9720A  73%
 
    
 
 
Estimated Pages Remaining: 
 11025 
 
(Based on historical black page coverage of 2%) 
 
Low Reached: 
 NO 
 
Serial Number: 
 35860 
 
Pages printed with this supply: 
 4078 
 
 
  
 TRANSFER KIT
HP Part Number: HP C9724A  87%
 
  
 
Estimated Pages Remaining: 
 103856 
 

Etc.
[download]

With my regex sledgehammer it would be straightforward to process this data. Oftentimes, when I look at the "pure text" version of a web page there aren't nearly as many nice hooks for sorting things out. But this is THIS case, and my question is: might there be a tool which emulates this action of select/copy/paste of a web page to automate the production of such text for follow-on regex processing?

[reply]
[d/l]

Re: Re: Scraping HTML: orthodoxy and reality

by BrowserUk (Patriarch) on Jul 09, 2003 at 03:35 UTC

You could probably automate the C&P from your favorite browser (under Win32 at least) or use one of the console browsers (Lynx etc.) under *nix.

The question is, what would you have achieved. Not only would you have used a parser (the one builtin to the browser), but you would have also used its rendering engine, spawned a new process and gone through some form of IPC whether its a pipe or clipboard. And you would still need to apply a regex to the result.

If your going to use a parser, then you mught as well use one of the many available to you via CPAN and avoid all that additional overhead:)

Examine what is said, not who speaks.

[reply]

Re: Re: Re: Scraping HTML: orthodoxy and reality

by ff (Hermit) on Jul 09, 2003 at 05:38 UTC

So, assuming that efficiency doesn't matter, I'm still fishing for something like building the $html object via a LWP 'get' as above and then turning it into text that I can examine with regexen. (However, since this is turning a golden object into lead, I'll do some more digging as you suggest, like re-reading this thread's Data::Dumper/HTML::TableExtract example! :-)

[reply]

Re: Re: Re: Re: Scraping HTML: orthodoxy and reality

by chanio (Priest) on Jul 09, 2003 at 07:17 UTC

Re: Re: Re: Scraping HTML: orthodoxy and reality

by markexpjp (Novice) on Jul 09, 2003 at 13:27 UTC


<grumble>Isn't the real problem here the
obsession that some designer at HP has with
producing beautiful output, good HTML practice
be damned? Why do Dreamweaver jockies have to
make my life hard!!! AHHHHH!</grumble>

10POKE53280,A:POKE53281,A
20?"C64 RULES ";
30A=A+1:IFA=16THENA=0:GOTO10

[reply]

Re: Scraping HTML: orthodoxy and reality
by John M. Dlugosz (Monsignor) on Jul 10, 2003 at 18:21 UTC

my program

—John

[reply]

Back to Meditations