Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Problem while using WWW::Mechanize module for getting html

by yujong_lee (Novice)
on Mar 04, 2020 at 10:37 UTC ( #11113758=perlquestion: print w/replies, xml ) Need Help??

yujong_lee has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I'm student studying programming. I have experience in programming C and Perl. I know most of syntax and concept of both language. However, I don't have any experience building my own project. and I have only basic knowledge of Networks.

So I start my first project using WWW::Mechanize. My goal is to get list of titles and url of bulletin board for given period. To start with, I tried to get html from this site. http://hiphople.com/kboard (It a Korean)

#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get( "http://hiphople.com/kboard" ); print $mech->content();

But the output is ���w�Ʊ0��tN��ӆR#�... WWW::Mechanize use utf-8 as default, and target site's html header said it use utf-8 too. So it's not encoding problem. and I found the fact that target site use gzip.(I found it at the http response header).

To solve this, first I tried to use WWW::Mechanize::Gzip. but the document said "If the webserver does not support gzip-compression, no decompression will be made." and I guess http://hiphople.com/kboard web server does not support gzip-compression. because It doesn't working.

So I tried to decompress it without getting help from webserver. the code below is my attempt to do that.

#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; use IO::Uncompress::Gunzip qw(gunzip); my $mech = WWW::Mechanize->new( autocheck => 1 ); my $responce = $mech->get( "http://hiphople.com/kboard" ); my $output = "file1.txt"; gunzip $responce => $output;

But the result is .IO::Uncompress::Gunzip::gunzip: illegal input parameter. I guess it's because $response is not .gz format due to mechanize. and that all I can guess. I don"t know what to do.

So this is what I have encountered during getting simple html file from site that I want. Now I need some help from other people. Getting a html file is the first step of my project and it was hard to achieve. Can anyone help me?

P.S My English is not that great. So I'm afraid it was difficult for you to read. Sorry for that.

Replies are listed 'Best First'.
Re: Problem while using WWW::Mechanize module for getting html
by marto (Archbishop) on Mar 04, 2020 at 10:49 UTC

    Welcome! When I run your program I get the page content dumped, including some Korean text.

    According to Google Translate tells me that Korean text means "We go through security procedures to prevent automatic registration", so they likely don't want you to scrape their site.

    What is a "wide character"? explains why we see the warning printed above, and what to do about it.

Re: Problem while using WWW::Mechanize module for getting html
by cavac (Curate) on Mar 04, 2020 at 14:43 UTC

    GZip transfer encoding depends on the Client sending an "Accept-Encoding" header in the request which has to contain the string "gzip". (Other compression schemes like bzip2 are also possible).

    If the server supports gzip and the client has requested it, the server *may* decide to send the BODY of the response compressed as a gzip stream (depending on things like if the file is compressible and if the server wants to spend CPU resources to reduce network load at this point in time). To do this, it adds a "Content-Encoding" header in the response with the value set to "gzip".

    From what i remember, ye olde WWW::Mechanize doesn't send any Accept-Encoding header which is was gets it into trouble sometimes. Let me quote from RFC7231, page 41, Chapter "5.3.4 Accept-Encoding", sub-paragraph 1:

    If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent.

    Here is the link: https://tools.ietf.org/html/rfc7231#page-41

    This is what can get WWW::Mechanize in trouble, because the server MAY decide to use gzip, bzip2 or whatever in the reply. If you use WWW::Mechanize::GZip, which *does* send the correct header, the server is only allowed to either send uncompressed or gzip compressed, and WWW::Mechanize::GZip understands both as far as i remember. It's just the more reliable option.

    BTW, when we are talking about Transfer-Encoding, this isn't the same as "file format". So you wont download a .gz file and unzip it. Instead, the content just gets gzipped on the server side for sending over the network, then it gets automatically decompressed by the client library before it gets handed (uncompressed) to the client. This is just to speed up transfer, in practise, your script should not even realize (or bother) that this compression magic is going on in the background to save network bandwith and speed up data transfer.

    perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
Re: Problem while using WWW::Mechanize module for getting html
by NERDVANA (Beadle) on Mar 05, 2020 at 01:18 UTC
    You probably want $mech->response->decoded_content.

    $mech->content is a shortcut for $mech->response->content which is the raw bytes returned from the web server. See HTTP::Response for details; decoded_content applies any character encoding declared in the headers.

    —EDIT—

    So actually, $mech->content should be giving you decoded_content. I think I misunderstood the problem.

    When I run your example program, I get the same result as marto. It prints html, and there is no gzip data that I see, and only a little bit of Korean text that could possibly run into encoding issues. If it displays garbled gzip bytes for you, the there must be some problem with your perl environment, or a locale setting that is breaking things somehow? Are you accessing it through an HTTP proxy?

Re: Problem while using WWW::Mechanize module for getting html
by pmqs (Friar) on Mar 05, 2020 at 08:47 UTC

    The other contributors to this thread have suggested better ways to achieve what you need, but I thought I'd point out a couple problems with how you are using the gunzip function.

    Here is a quick reminder of the relevant part of your code.
    my $responce = $mech->get( "http://hiphople.com/kboard" ); my $output = "file1.txt"; gunzip $responce => $output;
    The first problem is you need to get the payload data from the response object. That can be done with $responce->content

    The second issue is how you are using the gunzip - if the first parameter is a standard scalar variable it assumes it is a filename. That isn't what you have. You have an in-memory buffer. To get gunzip to read that buffer you pass a scalar reference. In this case you do that by prefixing $responce with a backslash.

    Putting those two part together the gunzip line becomes this
    gunzip \$responce->content => $output;
    When I made that change your code it returned this content.
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko" lang="ko"><he +ad><meta http-equiv="Content-Type" content="text/html; charset=utf-8" + /><style type="text/css">body { width:100%; height:100%; } .wrap { p +osition:fixed; top:50%; left:50%; margin:-185px 0 0 -315px; width:630 +px; height:370px; } h1 {margin: 0 0 20px; font-size: 15pt;}</style></ +head><body><script type="text/javascript" src="/cupid.js" ></script>< +script>function toNumbers(t){var e=[];return t.replace(/(..)/g,functi +on(t){e.push(parseInt(t,16))}),e}function toHex(){for(var t=[],t=1==a +rguments.length&&arguments[0].constructor==Array?arguments[0]:argumen +ts,e="",o=0;o<t.length;o++)e+=(16>t[o]?"0":"")+t[o].toString(16);retu +rn e.toLowerCase()}function getUrlParams(){var t={};return window.loc +ation.search.replace(/[?&]+([^=&]+)=([^&]*)/gi,function(e,o,r){t[o]=r +}),t}var a=toNumbers("e89bdf97de8b4fd9fbb5729c74357328"),b=toNumbers( +"041f9966c585822241063e5f88c891d1"),c=toNumbers("b56e9ff4cacfbf3311ee +9ae970a2a2c5"),now=new Date,time=now.getTime();time+=864e5,now.setTim +e(time),document.cookie="CUPID="+toHex(slowAES.decrypt(c,2,a,b))+"; e +xpires="+now.toUTCString()+"; path=/",oParams=getUrlParams(),nCkattem +pt=0,oParams.ckattempt&&(nCkattempt=parseInt(oParams.ckattempt)),nCka +ttempt<3&&(location.href="http://hiphople.com/kboard?ckattempt=1");</ +script><div class="wrap"><div align="center"><h1>\uc790\ub3d9\ub4f1\u +b85d\ubc29\uc9c0\ub97c \uc704\ud574 \ubcf4\uc548\uc808\ucc28\ub97c \u +ac70\uce58\uace0 \uc788\uc2b5\ub2c8\ub2e4.</h1><p>Please prove that y +ou are human.</p><form action="/___verify" method="POST"><input type= +"submit" value=" OK "></form></div></div></body></html>
Re: Problem while using WWW::Mechanize module for getting html
by Aldebaran (Deacon) on Apr 18, 2020 at 20:51 UTC
    So I start my first project using WWW::Mechanize. My goal is to get list of titles and url of bulletin board for given period. To start with, I tried to get html from this site. http://hiphople.com/kboard (It a Korean)

    This is interesting as I am working in this same area today. A great tool that WM has is its mech-dump script. It helps to see what you're up against:

    $ mech-dump -all http://hiphople.com/kboard >1.kboard.txt $ cat 1.kboard.txt | more --> Headers: Cache-Control: no-cache Connection: close Date: Sat, 18 Apr 2020 19:35:46 GMT Server: nginx Vary: Accept-Encoding Content-Encoding: gzip Content-Type: text/html Expires: Thu, 01 Jan 1970 00:00:01 GMT Client-Date: Sat, 18 Apr 2020 19:35:46 GMT Client-Peer: 1.234.1.230:80 Client-Response-Num: 1 Client-Transfer-Encoding: chunked --> Forms: POST http://hiphople.com/___verify <NONAME>= OK (submit) --> Links: --> Images: $

    Pretty sparse for mech-dump...it would seem that there is almost nothing there according to WM. When you set a browser on it, you see that it is loaded with javascript, and that's when this tale changes. You might have a look at your own documentation by imitating this:

    $ locate Mechanize.pm /usr/local/share/perl/5.26.1/WWW/Mechanize.pm $ cd /usr/local/share/perl/5.26.1/WWW $ ls Mechanize Mechanize.pm $ cd Mechanize/ $ pwd /usr/local/share/perl/5.26.1/WWW/Mechanize $ ls Chrome Cookbook.pod FAQ.pod Image.pm Pluggable Plugin Chrome.pm Examples.pod GZip.pm Link.pm Pluggable.pm $ perldoc FAQ.pod | more NAME WWW::Mechanize::FAQ - Frequently Asked Questions about WWW::Mechan +ize VERSION version 1.96 How to get help with WWW::Mechanize If your question isn't answered here in the FAQ, please turn to th +e communities at: * StackOverflow <https://stackoverflow.com/questions/tagged/www-mechanize> * #lwp on irc.perl.org * <http://perlmonks.org> * The libwww-perl mailing list at <http://lists.perl.org> JavaScript I have this web page that has JavaScript on it, and my Mech program +doesn't work. That's because WWW::Mechanize doesn't operate on the JavaScript. I +t only $

    In my opinion, there's 2 things you need to do in order to get perl to operate on site translation to english. 1) You need to use something that can make heads or tails of javascript like WMC or WMF. To use the site as they intend, you need to log in. Do you have an account with them?

    P.S My English is not that great.

    You did better than great; you did fine.

    Best Wishes,

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11113758]
Approved by marto
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (9)
As of 2020-07-09 12:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?