http://qs321.pair.com?node_id=272552


in reply to Scraping HTML: orthodoxy and reality

Gosh, I wish I even knew the difference between those HTML:: modules and how to put them to work! Given the examples in this thread, soon I'll have more than a hammer to do my scraping. :-)

In the meantime: when I go to the web page from the link via my IE browser and do a Ctl-A and Ctl-C and then paste the text into a Notepad screen, this particular output is quite comprehensible to my HTML-untrained eye (vs the HTML stuff), e.g.

impse400 (I3C) / 172.17.8.182 hp color LaserJet 4600 Information <snip much miscellaneous info> For highest print quality always use genuine Hewlett-Packard supplies. + BLACK CARTRIDGE HP Part Number: HP C9720A 73% Estimated Pages Remaining: 11025 (Based on historical black page coverage of 2%) Low Reached: NO Serial Number: 35860 Pages printed with this supply: 4078 TRANSFER KIT HP Part Number: HP C9724A 87% Estimated Pages Remaining: 103856 Etc.

With my regex sledgehammer it would be straightforward to process this data. Oftentimes, when I look at the "pure text" version of a web page there aren't nearly as many nice hooks for sorting things out. But this is THIS case, and my question is: might there be a tool which emulates this action of select/copy/paste of a web page to automate the production of such text for follow-on regex processing?