Re: Scraping HTML: orthodoxy and reality

in reply to Scraping HTML: orthodoxy and reality

Gosh, I wish I even knew the difference between those HTML:: modules and how to put them to work! Given the examples in this thread, soon I'll have more than a hammer to do my scraping. :-)

In the meantime: when I go to the web page from the link via my IE browser and do a Ctl-A and Ctl-C and then paste the text into a Notepad screen, this particular output is quite comprehensible to my HTML-untrained eye (vs the HTML stuff), e.g.

     
  impse400 (I3C) / 172.17.8.182
hp color LaserJet 4600 
 
 
  
 Information       

<snip much miscellaneous info>
  
  
For highest print quality always use genuine Hewlett-Packard supplies.
+ 
  
  
 BLACK CARTRIDGE
HP Part Number: HP C9720A  73%
 
    
 
 
Estimated Pages Remaining: 
 11025 
 
(Based on historical black page coverage of 2%) 
 
Low Reached: 
 NO 
 
Serial Number: 
 35860 
 
Pages printed with this supply: 
 4078 
 
 
  
 TRANSFER KIT
HP Part Number: HP C9724A  87%
 
  
 
Estimated Pages Remaining: 
 103856 
 

Etc.
[download]

With my regex sledgehammer it would be straightforward to process this data. Oftentimes, when I look at the "pure text" version of a web page there aren't nearly as many nice hooks for sorting things out. But this is THIS case, and my question is: might there be a tool which emulates this action of select/copy/paste of a web page to automate the production of such text for follow-on regex processing?

In Section Meditations