Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Okay, here's the background: there's a website that I use, which in general is quite good and very useful. It's called Lang-8. The basic idea is, you write journal entries in the language you're studying, and native speakers post comments and corrections. In turn, you post comments and corrections to entries they've written in your native language. The idea is good, and the site has a lot of really useful features.

One feature it doesn't have, unfortunately, is a really good search capability...

In particular, I wanted to be able to search through the comments and corrections I've made in the past. When you're working with people coming to English from the same liguistic background, they tend to make some of the same mistakes (e.g., Japanese people seem to have trouble learning the correct use of the English phrase "after all", which, admittedly, is somewhat idiomatic), so several times I've run into situations where I remembered having explained a particular thing in some detail before, with examples. Being the lazy person that I am, I wanted to have a look at that previous explanation and possibly copy and paste some or all of it in response to someone else who was asking about the same thing, or who made the same mistake.

So I wanted to search my past corrections and comments, but the site doesn't seem to have a way to do that. I can search my own journal entries, but that doesn't solve my problem. I thought about Google's site-specific search, but privacy features prevent most of the journal entries, and the comments on them, from being visible to the world; Google, from the site's perspective, is the world.

So I used my virtue of laziness to create a way to quickly search through my past comments and corrections...

#!/usr/bin/perl # -*- cperl -*- use Data::Dumper; use WWW::Mechanize; use HTML::TreeBuilder; my $email = 'username' . '@' . 'example.net'; my $pass = 'censored'; my (@substring) = @ARGV; if (scalar @substring) { print "Looking for " . @substring . " strings.\n"; } else { die "You must specify one or more strings to look for.\n"; } my $mech = WWW::Mechanize->new(); $mech->get('http://lang-8.com/login'); $mech->submit_form(form_number => 2, fields => {username => $email, password => $pass,}); my ($page, $done, @pagetosearch) = (1, 0); while (not $done) { print "Fetching page $page...\n"; $mech->get("http://lang-8.com/journals/joined?page=$page"); my $content = $mech->content(); open OUT, '>', 'tempfile.html'; print OUT Dumper($content); close OUT; my $tree = HTML::TreeBuilder->new(); $tree->parse_file('tempfile.html'); my @entry = $tree->look_down('_tag' => 'h3', "class" => 'journal_title',); my @url = map { $_->look_down('_tag' => 'a')->attr('href'); } @entry +; if (scalar @url) { print " * Found " . @url . " journal entries.\n"; push @pagetosearch, @url; sleep 1; ++$page; } else { ++$done; }} for my $url (@pagetosearch) { print "Checking $url\n"; $mech->get($url); my $content = $mech->content(); for my $str (@substring) { my @match = $content =~ /([^<>]*${str}[^<>]*)/sg; print " * Found $str: $_\n" for @match; } select undef, undef, undef, 0.1; }

One screenfull of easy code, and my computer is pointing me right to my previous explanation. The first time I used it, it saved me more time than it took to write it, and I know I'll be using this one again and again and again.

-- 
We're working on a multi-year set of freely redistributable Vacation Bible School materials.

In reply to Targetted Web Searching on the Client Side: A Little Programming Knowledge Can Save a Lot of Time by jonadab

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others rifling through the Monastery: (4)
    As of 2021-04-17 21:43 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found

      Notices?