Dearest Monks
I am writing a couple of web-page-scraping tools that will help me in my job seek. I already have something working, but what I am missing is a nice pure perl solution that would format a web page to a nice plain text, so that if an announcement is, for any reason, removed, I still have a chance of getting to the contents
And hence the question: is there anything like lynx -dump in Perl? I dug into CPAN for about half an hour and tried html2text, but it didn't really do a good job...
For the few of you that don't know what lynx is and what it does:
NAME
lynx - a general purpose distributed information browser for
the World Wide Web
...
DESCRIPTION
Lynx is a fully-featured World Wide Web (WWW) client for
users running cursor-addressable, character-cell display
devices (e.g., vt100 terminals, vt100 emulators running on
Windows 95/NT or Macintoshes, or any other "curses-oriented"
display).
...
OPTIONS
...
-dump dumps the formatted output of the default document or
one specified on the command line to standard output.
This can be used in the following way:
lynx -dump http://www.subir.com/lynx.html
Thanks a lot in advance for your help
Ciao! --bronto
In theory, there is no difference between theory and practice. In practice, there is.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|