Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Some light PerlMonks reading by the campfire

by hacker (Priest)
on Feb 18, 2007 at 17:02 UTC ( [id://600714] : perlmeditation . print w/replies, xml ) Need Help??

Ok, now that I've got your attention... what I'm wondering, is the best way to capture a bulk of PM for reading "offline", where there is no net connectivity (think WAY offline, as in.. no power for hundreds of miles), for potentially months at a time.

I'm going to start small, just try to see if its even feasible, and then expand it. I've done some similar projects in this area over the last few years, which have been quite successful.

I also looked around the monastery here, and found these somewhat-relevant nodes:

There are quite a few useful replies in there, and some referencing ThePen (which is down as I type this). Some talk about spidering the site, others about converting from XML to html, others to just pulling a database dump and reusing that.

Ideally, the best approach would be to dump the node tables and replies to some form of XML, like Wikimedia projects do. They have a tool called mwdumper (written in Java) that will take the XML export and pump it back into MySQL (I just did this for the latest Wikipedia database this weekend, it was over 4.5 million separate rows and took 20 hours to import, whew!).

But it doesn't have to be that complex... even just the XML dumps with some sort of linking to each of the replies, would be perfect.

Now I can also spider ThePen during off-hours (when it comes back online) and store the plain HTML that way, but that introduces load, latency, bandwidth issues and so on. I'd rather avoid that strain on someone else's server, because I know what its like when someone does it to my public servers.

Has there been any movement on the implementation of "nodeballs" yet in PM? The Everything Engine powering PerlMonks supports it, so I guess its just a matter of a concensus, and a vote, and enabling it?

What say ye?

Replies are listed 'Best First'.
Re: Some light PerlMonks reading by the campfire
by Corion (Patriarch) on Feb 18, 2007 at 17:53 UTC

    I think the best/most usable solution for you would be to get a MySQL dump of the node table. This circumvents all the XML trickery and other problems that arise from the transfer to and from XML. If you are really keen on getting XML, g0n maintains an XML mirror of most newer nodes.

    Perlmonks is "based" on the Everything Engine but the developments have diverged into different directions and the two engines are basically incompatible, so I'm not sure that nodeballs could be enabled here easily.

Re: Some light PerlMonks reading by the campfire
by bart (Canon) on Feb 18, 2007 at 18:14 UTC
    Just two references you seem to have overlooked:
    1., "Perlmonks with some bits missing"
    2. katterbox, a Java client that, last I looked, which isn't too recently, contained an offline browser. This project has been discontinued, but might still work.
Re: Some light PerlMonks reading by the campfire
by planetscape (Chancellor) on Feb 19, 2007 at 08:26 UTC
Re: Some light PerlMonks reading by the campfire
by wolfger (Deacon) on Feb 22, 2007 at 13:36 UTC
    no power for hundreds of miles, for potentially months at a time

    What the heck is a Perl monger doing in a condition like that??!?

      What the heck is a Perl monger doing in a condition like that??!?

      Teaching other neophytes the benefits of Perl and Linux, of course.. can't become a Master Jedi without becoming a young Padawan first, now can we?

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://600714]
Approved by ww
Front-paged by ww
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-03-01 00:56 GMT
Voting Booth?
My favourite way to spend a leap day ...

Results (28 votes). Check out past polls.