Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Screen-scraping using XSH - O'Reilly Animal lister

by merlyn (Sage)
on Oct 22, 2003 at 21:32 UTC ( [id://301381]=sourcecode: print w/replies, xml ) Need Help??
Category: HTML Utility
Author/Contact Info merlyn
Description: Using the XSH language, screen-scrape O'Reilly's "Animals" page, generating a new XML file showing the list organized alphabetically by animals and the covers that use that animal.

From a forthcoming Linux Magazine column of mine.

The output looks like:

<root> .. <cover> <animal>Turtle</animal> <book>Using and Managing PPP</book> </cover> <cover> <animal>Victoria crowned pigeons</animal> <book>lex &amp; yacc</book> </cover> <cover> <animal>Wall creepers</animal> <book>Transact-SQL Programming </book> </cover> <cover> <animal>Wallaby &amp; joey</animal> <book>Enterprise JavaBeans</book> <book>WebLogic Server 6.1 Workbook for Enterprise Javabeans</book> <book>WebSphere 4.0 AEs Workbook for Enterprise Javabeans</book> </cover> <cover> <animal>Warriors</animal> <book>Security Warrior</book> </cover> <cover> <animal>Weasel</animal> <book>Web Design in a Nutshell</book> </cover> .. </root>
#!/usr/bin/perl

use XML::XSH;
xsh <<'END_XSH';

recovering 1; # for broken entity recovery (a frequent HTML problem)
quiet; # avoid tracing of open
open HTML animals = "http://www.oreilly.com/animals.html";
foreach {1..2} {
  foreach //table[not(.//table)
                  and contains(tr[1]/td[$__], "Book Title")
                 ]/tr[position() > 1] {
  # pwd;
  $cover = string(td[last()]);
  $subject = string(td[last() - 1]);
  eval { push @{$cover{$cover}}, $subject; }
  }
}
create t1 root;
foreach {sort keys %cover} {
  ## print "animal $__";
  insert element cover into /root;
  cd /root/cover[last()];
  insert element animal into .;
  insert text $__ into animal;
  foreach {sort @{$cover{$__}}} {
    ## print " book $__";
    insert element book into .;
    insert text $__ into book[last()];
  }
}
quiet; # avoid final message from ls
ls /;
END_XSH

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://301381]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-03-29 11:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found