Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

about retrieving and parsing html without writing on disk

by limner (Novice)
on Apr 09, 2018 at 21:26 UTC ( #1212612=perlquestion: print w/replies, xml ) Need Help??

limner has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all brothers monks

i've successfully wrote a perl script that retrieve an html page, parse it and prepare,
at the end a logfile from the html page.

In order to do this, at this moment, the program does the following:

1) unlink the file from disk, if exist on disk
2) retrieve in memory the correct html page
3) write on disk the html page on a standard filename (file.html)
4) read the file on disk (file.html) and parse it
5) write on disk the logfile

What i would like to do is avoid to write the "file.html" on disk and work only
in ram, so i would like to retrieve it, NOT write it on disk, and parse it in memory.

The following are the program lines that do this:
$nomefile="file.html"; ### name of temporary filename unlink $nomefile; ### remove the file $url="http://www.sitename.com/pagespecial.html"; $mech->get($url); $mech->save_content($nomefile); ### Instr i would like to change use WWW::Mechanize; use HTML::TableExtract; use HTML::Entities; use Text::Unidecode; $user_agent='Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) + Gecko/20101203 Firefox/3.6.13'; my $mech = WWW::Mechanize->new(agent => $user_agent); my $headers = ['col1', 'col2', 'col3', 'col4', 'col5']; my $table_extract = HTML::TableExtract->new(headers => $headers); $table_extract->parse_file($nomefile); ### Inst i would like to chang +e my ($table) = $table_extract->tables;

Everithing works as i would, but in this way every time i parse a page
i remove and write file.html in order to parse it.

How can i do everithin in memory without writing the file?
Thanks Limner

Replies are listed 'Best First'.
Re: about retrieving and parsing html without writing on disk
by LanX (Cardinal) on Apr 09, 2018 at 22:15 UTC
    hmm, I'm too busy to install the modules, but it's at least possible to open a variable for reading and writing.

    open my $fh , "<", \$cache

    so if you can operate with filehandles instead of files this should work.

    update

    HTML::Parser allows ->parse_file($fh) and even ->parse($string)

    update

    Maybe have a look at $string = $mech->content(...) from WWW::Mechanize

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

      Maybe have a look at $string = $mech->content(...) from WWW::Mechanize

      and maybe at HTTP::Response as well, because

      $mech->get( $uri )

      returns an object of that type.

        Good note for checking $response->code and such. Along those lines, for the OP, if you use WWW::Mechanize remember that it fails hard, dies, on any non-success responses, 400s and 500s, unless you set autocheck => 0. You also have access to the response object from the mech object with $mech->response so you don't necessarily need a new variable for it.

Re: about retrieving and parsing html without writing on disk
by marto (Cardinal) on Apr 11, 2018 at 09:24 UTC
Re: about retrieving and parsing html without writing on disk
by learnedbyerror (Monk) on Apr 15, 2018 at 19:03 UTC

    The short answer is yes, you can. I don't use the exact parsing utilities that you are using, but I routinely WWW::Mechanize and parse the content

    Something like the following should work for you. NOTE: I did not test this exact code

    use HTML::TableExtract; use WWW::Mechanize; my $user_agent='Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2. +13)Gecko/20101203 Firefox/3.6.13'; my $mech = WWW::Mechanize->new(autocheck => 0, agent = $user_agent ); if ( $mech->success ) { my $html_string = $mech->content; my $headers = ['col1', 'col2', 'col3', 'col4', 'col5']; my $te = HTML::TableExtract->new( headers => $headers ); my @tables = $te->parse($html_string)->tables; } ...

    lbe

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1212612]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2020-09-30 00:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I donít succeed, I Ö










    Results (155 votes). Check out past polls.

    Notices?