Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Reading PerlMonks offline

by zjunior (Beadle)
on Mar 30, 2002 at 00:22 UTC ( [id://155366]=monkdiscuss: print w/replies, xml ) Need Help??

Hello Monks. Perhaps this idea could sound odd, sorry.

I access the net via modem, and read PM a lot. Some (and the majority) of the nodes here aren't to be only read. They need to be studied, word by word. Tasted. I'm not playing or lying here, this is serious.

The problem is that my connection time don't allow me to stay and study in the monastery how I would want, to begin my Perl journey. This problem let me think if I could create a sort of parser of the nodes. It would work like: I look to a node and discover that it needs to be studied. I give the ID to the supost program, and it fetches the node and show me, or save it, to future use.

I know some of monks (including me) has already used the cool "Save Page..." or hit the "p" key in lynx to save some Pearls. But this isn't enought, to me.

Then I come to you, fellows, seeking clarity. Do you have any clue to read, study, save and organize nodes and the respectively follow-ups for personal use? Does PM have a XML or like to simplify this? Or do you only think "Oh, this is yours problem. Go and take a T1". :) ?

-- zjunior

Edited Sun Mar 31 22:17:28 2002 (UTC) by footpad

Replies are listed 'Best First'.
Re: Reading PerlMonks offline
by Zaxo (Archbishop) on Mar 30, 2002 at 00:56 UTC

    There are some practical approaches you can try, but I see two basic difficulties with your core idea.

    ...I look to a node and discover that it needs to be studied. I give the ID to the supost program, and it fetches the node and show me,...

    That suggests a solution based on heuristics about authors and titles and the tree of replies, but prevents decisions based on the content.

    The larger problem is that you want machine evaluation of human expression -- understanding of speech. That holy grail has been drifting just out of reach for a very long time.

    PM does have XML feeds, including 'newest nodes', so heuristics like I mentioned are possible in limited bandwidth.

    After Compline,
    Zaxo

Re: Reading PerlMonks offline
by zengargoyle (Deacon) on Mar 30, 2002 at 00:53 UTC

    Setup a Squid Web Proxy Cache (link) or something similar. Configure it so that it caches www.perlmonks.org forever.

    Browse as normal, every page you see will be cached.

    Clean it up every once in a while.

(jeffa) Re: Reading PerlMonks offline
by jeffa (Bishop) on Mar 30, 2002 at 19:04 UTC
    Hello. I wrote this script a few months ago - it is ugly and nasty and doesn't even grab the author's name. But it works ;) Just pipe the output to a file and load that file in your browser.
    use LWP; use HTTP::Request::Common; use strict; use constant URL => 'http://perlmonks.org/index.pl'; $|++; my $node = shift; print "USAGE: $0 [node_id]\n" and exit unless $node; my $ua = LWP::UserAgent->new; $ua->agent('node_grabber/1.0 (' . $ua->agent .')'); my $request = POST(URL, Content => [ node_id => $node ] ); my $response = $ua->request($request) or die "can't download id $node" +; my $html = $response->content(); my ($date) = $html =~ /a>\s*on\s*([^<]+?)<\/font/i; my ($title) = $html =~ /<title>([^<]+)<\/title>/; my $chunk; my @ends = ( qr|<BR>\s*<hr\s*\/>|, qr|<BR>\s+<BR><font size=2><I>|, qr|<CENTER>\s+back to <A HREF="|, qr|<BR>go see more <A HREF="|, qr|<center><TABLE width=|, ); foreach (@ends) { ($chunk) = $html =~ /<INPUT type=hidden name=op value=vote>(.*)(?:$ +_)/ms; last if $chunk; } unless ($chunk) { ($chunk) = $html =~ /<!--\s+-->(.*)(?:<!--\s+-->)/ms; } print "couldn't parse it :(\n" and exit unless $chunk; print "$title [$date]\n", $chunk;
    I orginally tried to solve this problem with HTML::TokeParser but failed miserably. So, i just used some very fragile regexes instead. If anyone feels the need to improve this, then please be my guest. :)

    P.S. an RSS feed that grabbed just a node sure would be nice. :)

    UPDATE (April 2,2002): Kanji just /msg'ed me about displaytype=xml (i could have sworn that this was broken some time ago). This will make the above code not only overkill, but just plain silly. I'll be working on a replacement in the meantime this morning. And there it is below ---> ;)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      This is soooo much better. It should last a lot longer without maintenance as well.
      use LWP; use XML::Simple; use HTTP::Request::Common; use strict; use constant URL => 'http://perlmonks.org/index.pl'; my $node = shift; print "USAGE: $0 [node_id]\n" and exit unless $node; my $ua = LWP::UserAgent->new; $ua->agent('node_xml_grabber/1.0 (' . $ua->agent .')'); my $request = POST(URL, Content => [ node_id => $node, displaytype => 'xml', ]); my $response = $ua->request($request) or die "no download id $node"; my $content = $response->content(); my $xml = XMLin($content) or die "xml error!"; my $date = scalar localtime($xml->{ucreatetime}->{content}); print <<EOF; $xml->{title}->{content} by [$xml->{author_user}->{content}] on $date $xml->{doctext}->{content} EOF
      Big thanks to Kanji :)

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: monkdiscuss [id://155366]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-19 22:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found