Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

A Tagcloud For Cory Doctorow

by McD (Chaplain)
on Aug 28, 2008 at 01:32 UTC ( [id://707360]=CUFP: print w/replies, xml ) Need Help??

Cory Doctorow is the well known editor of BoingBoing, a sci-fi author, and an all around digital luminary. Recently, he posted a call for help visualizing notes he's been collecting for his next book. You can read the full post here, but in a nutshell: he's got a text file full of notes, separated by blank lines. Each note has a series of tags on the end, each tag is indicated by an "@" prefix.

Here's a sample note that's been tagged with the "foo," "bar," and "baz" tags:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. @foo @bar @baz
The actual requirement is given in the post:
"I'm looking for something that'll parse out the tags at the end of the lines and then make a tag-cloud out of them, and let me click on tags to retrieve them, as well as searching the fulltext of all the notes."

Naturally, I immediately thought about how easy this would be in Perl. Parsing the tags is right in the wheelhouse. A quick bit of surfing taught me what a tag cloud was, and lead me to the HTML::TagCloud module. And I've always wanted an excuse to play with HTTP::Server::Simple.

The comments following the post on BoingBoing quickly filled with helpful voices, but none of the answers offered seemed to quite fit the bill. After watching for a day, I was pretty sure the answer to the question was "no such tool exists" - so I took a crack at writing one, leveraging the modules I'd found.

Knowing that Cory uses a Linux desktop with Perl readily at hand, I sent him an early draft of the script to see if it was the sort of thing he was looking for. After a little polishing, we arrived at the code below, which I'm delighted to say seems to be just what he wanted. I share it here in the hopes that someone else might find it useful some day.

On the technical side, the script is pretty simple: parse the notes file into some data structures. Use those data structures to build a Tag Cloud, then run a simple web server who's only function is to display that tag cloud and/or search results. The code is slightly backwards to read, since the web server class is defined up front - the main logic really starts at "BEGIN HERE".

#!/usr/bin/perl -w use strict; use HTML::TagCloud; use HTML::Entities; # Dirt-simple web server - displays the tag cloud, and the set of all # notes that match a given tag, if provided. Also accepts requests to # search the notes, showing highlighted results. { package MyWebServer; use HTTP::Server::Simple::CGI; use base qw(HTTP::Server::Simple::CGI); sub handle_request { my $self = shift; my $cgi = shift; return if !ref $cgi; # Print out the headers, html, the tag cloud, and the search form. print "HTTP/1.0 200 OK\r\n", $cgi->header, $cgi->start_html("Tag C +loud"); print $main::cloud->html_and_css(); print $cgi->start_form(), "<p align=\"right\"> Search all notes for: ", $cgi->textfield('search_string'), $cgi->submit(-value => 'Search'), $cgi->end_form(), "<br><i>(search is case-insensitive)</i><p>"; print "<hr>"; # Now do something interesting with your params, if any. my $tag = $cgi->param('tag'); my $search_string = $cgi->param('search_string'); if ($search_string) { # Display search results my $output; # Perform same HTML encoding on the search string that we did on # the notes, so that searching for things like "<" will work. $search_string = HTML::Entities::encode($search_string); print $cgi->h1("Notes that match \"$search_string\""); # A little ugly: We're going to grep thru @all_notes looking for # a match - but we need to strip the HTML markup (which we've # added to turn tags into links) out of the notes before checkin +g # for a match, so that you don't match inside the HTML markup # while searching. Also, you need to use a temp var, because # otherwise grep will modify $_. Finally, use \Q (quotemeta) +- # we don't want full patterns here, too much risk foreach (grep {my $t; ($t=$_) =~ s/<.*?>//g; $t =~ /\Q$search_string/i} @main::all_notes) { # We want to highlight the match in yellow, but not change the # saved copy of the note - so we work on a copy, $output. # # Regex to (roughly) match an HTML tag: <.*?> # # This s/// matches either an entire tag, or our search # string. The replacement bit is executed (/e): if $2 (our # search string) has matched, wrap it in yellow. # Otherwise, $1 (a tag) is what matched, and it gets # replaced with itself. # # The /e is used instead of just saying "$1$2" (with $2 wrappe +d # in yellow) because that produces endless warnings about use # of undefined values - because only one of the two alternates # is ever defined in the replacement bit. ($output = $_) =~ s{(<.*?>)|($search_string)} {$2 ? "<b><FONT style=\"BACKGROUND-COLOR: yellow\">$2</FONT></b>" : $1 +}eig; print $output, "<p>"; } } elsif ($tag) { # Display notes that match "$tag" print $cgi->h1("Notes tagged with \"$tag\""); foreach my $ref (@{$main::lines{$tag}}) { print $$ref, "<p>"; } } print $cgi->end_html; } } # End of web server class ############ # Begin Here ############ # Parse the notes file, locating tags at the end of entries and # building up two data structures. # # Both of these structures collect "notes," references in %lines and # the actual scalar in @all_notes, which contains a note ready for # display in our HTML output. First, these notes have had HTML # elements encoded to simplify processing and make it harder to do # nasty things to the user's browser. Then the tags at the end of the # lines have been turned into links, same as are used in the tag # cloud, to enhance navigation. # # %lines # foo => [ # "note ref (tagged with foo)", # "another note ref (tagged with foo)", # ... # ] # # @all_notes - arrary of the set of all notes refered to in %lines - # in other words, every note found. Used in searching. our %lines; our @all_notes; # URL used in constructing tag-links my $url = '?tag='; # Parse notes file { local $/ = "\n\n"; # Double-newline separates input records while (<>) { # Need a copy of the "note" to work on and refer to, and we need # it with HTML chars like <, >, etc, escaped to "&lt;", "&gt;", # etc. my $this_line = HTML::Entities::encode($_); # Pop words off the end of the note, processing them as tags as # long as they start with "@". Keep a list of these tags so tha +t # we can wrap them in href's when we're done picking them out. my @words = split; my @tags = (); # tags found at the end of this note while (my $tag = pop @words) { last if $tag !~ /^\@/; # Not a tag, bail $tag =~ s/^@//; # Trim the "@" push @tags, $tag; push (@{$lines{$tag}}, \$this_line); } foreach my $tag (@tags) { # Greedy match in $1 insures that $2 will be the last instance # of $tag in the note - in other words, the one on the end with # the "@" prefix. And we know that each $tag was parsed off +the # end of this note, insuring this works. $this_line =~ s|(.*)\b($tag)\b|$1<a href="$url$2">$2</a>|; } push @all_notes, $this_line; } } # Build tag cloud our $cloud = HTML::TagCloud->new(levels => 24); foreach my $tag (keys %lines) { $cloud->add($tag, $url.$tag, scalar @{$lines{$tag}}); } # Start an instance of MyWebServer on port 8080, only bind to localhos +t my $pid = MyWebServer->new(8080); $pid->host('localhost'); $pid->run(); # Copyright (c) 2008, Dan McDonald. All Rights Reserved. # This program is free software. It may be used, redistributed # and/or modified under the terms of the Perl Artistic License # (see http://www.perl.com/perl/misc/Artistic.html)

Replies are listed 'Best First'.
Re: A Tagcloud For Cory Doctorow
by jwkrahn (Abbot) on Aug 28, 2008 at 13:50 UTC

    As an alternative to:

    my @words = split; my @tags = (); # tags found at the end of this note while (my $tag = pop @words) { last if $tag !~ /^\@/; # Not a tag, bail $tag =~ s/^@//; # Trim the "@" push @tags, $tag; push (@{$lines{$tag}}, \$this_line); }

    You could do it like this:

    my @tags; # tags found at the end of this note for my $tag ( reverse split ) { last unless $tag =~ s/^\@//; # Not a tag, bail push @tags, $tag; push @{ $lines{ $tag } }, \$this_line; }
      Yup, that's much cleaner. Nice!
Re: A Tagcloud For Cory Doctorow
by lwicks (Friar) on Aug 28, 2008 at 09:05 UTC
    Very cool, lets get the cape wearing, zeplin flying, Cory converted to being a Perl devotee! ;-)

    Downloaded and tested on a similarly structured text file (different purpose) I have here and it worked perfect.

    Kia Kaha, Kia Toa, Kia Manawanui!
    Be Strong, Be Brave, Be perservering!

Re: A Tagcloud For Cory Doctorow
by Anonymous Monk on Aug 28, 2008 at 03:59 UTC
    minor suggestion, the # Parse notes file loop, I would write as
    # Parse notes file { local $/ = "\n\n"; # Double-newline separates input records while (<>) { my $this_line = ""; # escape everything before @tags $this_line = HTML::Entities::encode("$1") . "$2" if /^([^@]*)(.*)/s; # url encode and html escape tags $this_line =~ s| \@( # capture to $1 [^\s@]+ # more flexible than \w+ )\b | push @{$lines{"$1"}}, \$this_line; # maybe rework following using URI my $oneEH = HTML::Entities::encode($1); my $oneE = CGI::escapeHTML(CGI::escape($1)); qq~<a href="$url$oneE">$oneEH</a>~; |gex; push @all_notes, $this_line; } }
      Hmmm.
      # escape everything before @tags $this_line = HTML::Entities::encode("$1") . "$2" if /^([^@]*)(.*)/s;
      ...only works if you assume that '@'-signs are exclusively found in tags, not the body of the notes.

      I like the one-pass s///g for both building up %lines and linkifying tags. I used split and pop because everyone in the BoingBoing comments was saying "Use a regex! This is perfect for regexes!"

      I loves me a good regex as much as the next monk, but there's a tendency to want to use them for everything once you grasp the power of 'em. I decided to use something simpler in the note parsing deliberately to counter that.

      I still managed to have some regex fun in the search string highlighting code. :-)

      Hi, I'm an end user. I'm located near the very, very end. I do legal research (working on a JD) and this would be EXACTLY what I need. The problem is that I don't understand how I would use this code to do what I need. At the risk of sounding incompetent (which I am), how would I use this on a windows (XP) machine as an executable with a nice GUI interface?
Re: A Tagcloud For Cory Doctorow
by rhenley (Initiate) on May 10, 2009 at 03:51 UTC
    I just read Cory's article about this and had to try it out. Looks great! I've seen these tag clouds used a few times before without having any idea what they were supposed to be. One question: is there any way to have it reread the "notes" file periodically without having to stop/start the script? For the curious, the article is located at http://www.locusmag.com/Perspectives/2009/05/cory-doctorow-extreme-geek.html
      One question: is there any way to have it reread the "notes" file periodically without having to stop/start the script?

      Sure. There are two changes needed: replace the "while (<>) ..." with more robust file handling, and add some code to re-read the file(s) provided when they change.

      Not huge changes, but I guess I didn't envision this being a particularly long-lived script. I was thinking more start it up, work for a few hours, shut it down.
Re: A Tagcloud For Cory Doctorow
by zecg (Initiate) on May 10, 2009 at 13:51 UTC
    A very useful thing. I'm new to Perl, how would one go about making this thing properly display utf-8 characters in notes and tags? I presume it's a problem with HTTP::server::simple?
      A very useful thing.

      I'm glad you think so!

      I'm new to Perl,

      Welcome! Hope you enjoy it.

      ...how would one go about making this thing properly display utf-8 characters in notes and tags? I presume it's a problem with HTTP::server::simple?

      THAT is an excellent question. I suspect, assuming you've got a reasonably recent version of Perl installed, that it's pretty simple. I'll go read up on the right way to specify utf8 file I/O and make sure that HTTP::server::simple can set up the proper headers, etc. Gimmie a few hours.

        Really useful indeed! Any luck with the unicode thing yet? I'd love to use this for my German language studies....
Re: A Tagcloud For Cory Doctorow
by monk0338gne83 (Initiate) on Nov 18, 2009 at 21:13 UTC

    thanks so much for this, i used your tag cloud religiously when studying for oral exams.

    i'm trying to integrate the tagcloud into my desktop so that it's always there in the background (I'm on a mac and used WebDesktop to do this).

    is there any way to format the html output so that the text background is transparent and the text itself remains opaque? this way the tags will just float above my desktop image… thanks!

      Glad you find it useful!

      is there any way to format the html output so that the text background is transparent and the text itself remains opaque?
      Transparency isn't a function of the HTML, it's a function of the web browser - either WebDesktop, or Firefox, or Safari, or whatever. So what you're really looking for is a web browser with a transparent window, ideally WebDesktop with a transparent window.

      I'm not sure how to accomplish that, but a quick search turned up a couple of OS X utilities that claim to make your windows transparent. You might experiment with something like CrystalClear Interface or Afloat and see if that gets you closer to what you want.

      Good luck!

Re: A Tagcloud For Cory Doctorow
by Anonymous Monk on May 10, 2009 at 15:20 UTC
    where does the file with your notes go?
      It's probably simplest to put your notes file wherever you have your Perl script (or your Perl script wherever you have your notes file...however you want to think of it). You have to specify the name of your notes file when you run the Perl script. I use Windows, so at the command prompt I type 'perl tagcloud.pl notes.txt'. If you want to put your notes file in one place and your script in another, just type the full path to your notes file on the command line (for example, 'perl tagcloud.pl c:\notes\notes.txt').
        thanks so much.
        is the syntax the same for a file on a server?
      Can anyone explain how to run this on a Mac?
        To start with, you have to make sure your perl installation supports the modules that the scripts uses.
        I'm doing this from Windows but in theory it should be the same for Mac.

        - Open a terminal session.
        - Run:
        ppm list
        - This will show you all of the modules that are installed.
        - If any of:
        HTML::TagCloud
        HTML::Entities
        HTTP::Server::Simple::CGI

        are missing use the command:

        ppm install <module_name_from_above>

        to install the module.

        - HTTP::Server::Simple::CGI doesn't seem to be available via ppm - I guess because it is still developmental. I had to download direct from cpan and manually install.*

        Then, when all installed quite simply:
        perl -w tagcloud.pl notes.txt


        * Manual HTTP::Server::Simple::CGI install.
        - Download from: http://search.cpan.org/dist/HTTP-Server-Simple/lib/HTTP/Server/Simple/CGI.pm
        perl Makefile.PL
        make
        make install


Re: A Tagcloud For Cory Doctorow
by Anonymous Monk on May 11, 2009 at 19:38 UTC
    There are a few line wraps in the display above, indicated by red plus signs, that need to be cleaned up if you cut and paste this.
Re: A Tagcloud For Cory Doctorow
by Anonymous Monk on Jun 04, 2009 at 19:10 UTC
    Nicely done. Works like a dream.

    It is lacking one thing, I think. What about notes where tags collide. For example, maybe I want to find all the notes that have @foo AND @baz tags.

    I tried using the search box by typing:
    @foo @baz
    but it searches for "@foo @baz". So unless these tags are added in exactly that order, the search is empty.

    Am I missing something obvious or is this perl script unable to find tag collisions? Is there a fix? Workaround?

    Have a nice day,
    Thorarin Bjarnason
      Nicely done. Works like a dream.

      Thanks! Glad you like it.

      It is lacking one thing, I think. What about notes where tags collide. For example, maybe I want to find all the notes that have @foo AND @baz tags.

      The word "collide" threw me there for a minute, but I think I understand - search for notes with two or more tags in common. The search function as implemented is a simple string match in the notes, you're thinking more along the lines of a "search engine" search. Interesting.

      Is there a fix? Workaround?

      You could change the code to make the search more in line with what you're envisioning. It would be a bit more plumbing to separate the search string into discreet terms, and then change the matching (and highlighting) code to work with a set of terms and logical conditions (AND and OR) rather than just a straight match/no match. That's a bit of work, actually.

      But there is a very simple, if ugly, workaround. First, you need to remove the \Q from the pattern matching loop here:

      # A little ugly: We're going to grep thru @all_notes looking for # a match - but we need to strip the HTML markup (which we've # added to turn tags into links) out of the notes before checkin +g # for a match, so that you don't match inside the HTML markup # while searching. Also, you need to use a temp var, because # otherwise grep will modify $_. Finally, use \Q (quotemeta) +- # we don't want full patterns here, too much risk foreach (grep {my $t; ($t=$_) =~ s/<.*?>//g; $t =~ /\Q$search_string/i} @main::all_notes) {
      (Note the comment about \Q being put in there explicitly because I thought regex searches would be too risky. We're taking the safety off, watch where you aim this thing...)

      Anyway, change that line to look like this:

      $t =~ /$search_string/i}

      ...and now you can do what you want by using the techniques outlined in The Perl Cookbook, Chapter 6.17, "Expressing AND, OR, and NOT in a Single Pattern." To search for @foo AND @bar, you'd search for:

      @foo.*@bar|@bar.*@foo

      To search for @foo OR @bar, you'd use:

      @foo|@bar

      Cavaet - if you're searching for strings not found in tags, search result highlighting will be screwy with this approach.

      So that's a quick and dirty fix. Can't say I'm in love with it, but it does do what you want, and making the search box more robust would be a good deal more work.
      Have a nice day

      Thanks, I already am. You too. :-)
        Wow - you are fast!

        This will do for me. Thanks you!

        Thorarin Bjarnason

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://707360]
Approved by GrandFather
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-19 22:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found