Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Screen-scraping using XSH - O'Reilly Animal lister

by merlyn (Sage)
on Oct 22, 2003 at 21:32 UTC ( #301381=sourcecode: print w/replies, xml ) Need Help??
Category: HTML Utility
Author/Contact Info merlyn
Description: Using the XSH language, screen-scrape O'Reilly's "Animals" page, generating a new XML file showing the list organized alphabetically by animals and the covers that use that animal.

From a forthcoming Linux Magazine column of mine.

The output looks like:

<root> .. <cover> <animal>Turtle</animal> <book>Using and Managing PPP</book> </cover> <cover> <animal>Victoria crowned pigeons</animal> <book>lex &amp; yacc</book> </cover> <cover> <animal>Wall creepers</animal> <book>Transact-SQL Programming </book> </cover> <cover> <animal>Wallaby &amp; joey</animal> <book>Enterprise JavaBeans</book> <book>WebLogic Server 6.1 Workbook for Enterprise Javabeans</book> <book>WebSphere 4.0 AEs Workbook for Enterprise Javabeans</book> </cover> <cover> <animal>Warriors</animal> <book>Security Warrior</book> </cover> <cover> <animal>Weasel</animal> <book>Web Design in a Nutshell</book> </cover> .. </root>

use XML::XSH;
xsh <<'END_XSH';

recovering 1; # for broken entity recovery (a frequent HTML problem)
quiet; # avoid tracing of open
open HTML animals = "";
foreach {1..2} {
  foreach //table[not(.//table)
                  and contains(tr[1]/td[$__], "Book Title")
                 ]/tr[position() > 1] {
  # pwd;
  $cover = string(td[last()]);
  $subject = string(td[last() - 1]);
  eval { push @{$cover{$cover}}, $subject; }
create t1 root;
foreach {sort keys %cover} {
  ## print "animal $__";
  insert element cover into /root;
  cd /root/cover[last()];
  insert element animal into .;
  insert text $__ into animal;
  foreach {sort @{$cover{$__}}} {
    ## print " book $__";
    insert element book into .;
    insert text $__ into book[last()];
quiet; # avoid final message from ls
ls /;
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://301381]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2020-09-19 10:21 GMT
Find Nodes?
    Voting Booth?
    If at first I donít succeed, I Ö

    Results (114 votes). Check out past polls.