A Tagcloud For Cory Doctorow

Cory Doctorow is the well known editor of BoingBoing, a sci-fi author, and an all around digital luminary. Recently, he posted a call for help visualizing notes he's been collecting for his next book. You can read the full post here, but in a nutshell: he's got a text file full of notes, separated by blank lines. Each note has a series of tags on the end, each tag is indicated by an "@" prefix.

Here's a sample note that's been tagged with the "foo," "bar," and "baz" tags:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. @foo @bar @baz

The actual requirement is given in the post:

"I'm looking for something that'll parse out the tags at the end of the lines and then make a tag-cloud out of them, and let me click on tags to retrieve them, as well as searching the fulltext of all the notes."

Naturally, I immediately thought about how easy this would be in Perl. Parsing the tags is right in the wheelhouse. A quick bit of surfing taught me what a tag cloud was, and lead me to the HTML::TagCloud module. And I've always wanted an excuse to play with HTTP::Server::Simple.

The comments following the post on BoingBoing quickly filled with helpful voices, but none of the answers offered seemed to quite fit the bill. After watching for a day, I was pretty sure the answer to the question was "no such tool exists" - so I took a crack at writing one, leveraging the modules I'd found.

Knowing that Cory uses a Linux desktop with Perl readily at hand, I sent him an early draft of the script to see if it was the sort of thing he was looking for. After a little polishing, we arrived at the code below, which I'm delighted to say seems to be just what he wanted. I share it here in the hopes that someone else might find it useful some day.

On the technical side, the script is pretty simple: parse the notes file into some data structures. Use those data structures to build a Tag Cloud, then run a simple web server who's only function is to display that tag cloud and/or search results. The code is slightly backwards to read, since the web server class is defined up front - the main logic really starts at "BEGIN HERE".

#!/usr/bin/perl -w
use strict;

use HTML::TagCloud;
use HTML::Entities;

# Dirt-simple web server - displays the tag cloud, and the set of all
# notes that match a given tag, if provided.  Also accepts requests to
# search the notes, showing highlighted results.
{
  package MyWebServer;

  use HTTP::Server::Simple::CGI;
  use base qw(HTTP::Server::Simple::CGI);
 
  sub handle_request {
    my $self = shift;
    my $cgi  = shift;
    return if !ref $cgi;

    # Print out the headers, html, the tag cloud, and the search form.
    print "HTTP/1.0 200 OK\r\n", $cgi->header, $cgi->start_html("Tag C
+loud");
    print $main::cloud->html_and_css();

    print $cgi->start_form(),
      "<p align=\"right\"> Search all notes for: ",
    $cgi->textfield('search_string'),
      $cgi->submit(-value => 'Search'), $cgi->end_form(),
        "<br><i>(search is case-insensitive)</i><p>";

    print "<hr>";

    # Now do something interesting with your params, if any.

    my $tag = $cgi->param('tag');
    my $search_string = $cgi->param('search_string');

    if ($search_string) {    # Display search results
      my $output;

      # Perform same HTML encoding on the search string that we did on
      # the notes, so that searching for things like "<" will work.
      $search_string = HTML::Entities::encode($search_string);

      print $cgi->h1("Notes that match \"$search_string\"");

      # A little ugly: We're going to grep thru @all_notes looking for
      # a match - but we need to strip the HTML markup (which we've
      # added to turn tags into links) out of the notes before checkin
+g
      # for a match, so that you don't match inside the HTML markup
      # while searching.  Also, you need to use a temp var, because
      # otherwise grep will modify $_.    Finally, use \Q (quotemeta) 
+-
      # we don't want full patterns here, too much risk
      foreach (grep {my $t; ($t=$_) =~ s/<.*?>//g;
                     $t =~ /\Q$search_string/i}
               @main::all_notes) {

        # We want to highlight the match in yellow, but not change the
        # saved copy of the note - so we work on a copy, $output.
        #
        # Regex to (roughly) match an HTML tag:     <.*?>
        #
        # This s/// matches either an entire tag, or our search
        # string.  The replacement bit is executed (/e): if $2 (our
        # search string) has matched, wrap it in yellow.
        # Otherwise, $1 (a tag) is what matched, and it gets
        # replaced with itself.
        #
        # The /e is used instead of just saying "$1$2" (with $2 wrappe
+d
        # in yellow) because that produces endless warnings about use
        # of undefined values - because only one of the two alternates
        # is ever defined in the replacement bit.

        ($output = $_) =~ s{(<.*?>)|($search_string)}
{$2 ? "<b><FONT style=\"BACKGROUND-COLOR: yellow\">$2</FONT></b>" : $1
+}eig;
        print $output, "<p>";
      }

    } elsif ($tag) {        # Display notes that match "$tag"
      print $cgi->h1("Notes tagged with \"$tag\"");
      foreach my $ref (@{$main::lines{$tag}}) {
        print $$ref, "<p>";
      }
    }
    print $cgi->end_html;
  }
} # End of web server class
 
############
# Begin Here
############

# Parse the notes file, locating tags at the end of entries and
# building up two data structures.
#
# Both of these structures collect "notes," references in %lines and
# the actual scalar in @all_notes, which contains a note ready for
# display in our HTML output.  First, these notes have had HTML
# elements encoded to simplify processing and make it harder to do
# nasty things to the user's browser.  Then the tags at the end of the
# lines have been turned into links, same as are used in the tag
# cloud, to enhance navigation.
#
# %lines
#  foo => [
#           "note ref (tagged with foo)",
#           "another note ref (tagged with foo)",
#            ...
#          ]
#
# @all_notes - arrary of the set of all notes refered to in %lines -
# in other words, every note found.     Used in searching.

our %lines;
our @all_notes;

# URL used in constructing tag-links
my $url = '?tag=';

# Parse notes file
{
  local $/ = "\n\n"; # Double-newline separates input records
  while (<>) {
    # Need a copy of the "note" to work on and refer to, and we need
    # it with HTML chars like <, >, etc, escaped to "&lt;", "&gt;",
    # etc.
    my $this_line = HTML::Entities::encode($_);

    # Pop words off the end of the note, processing them as tags as
    # long as they start with "@".    Keep a list of these tags so tha
+t
    # we can wrap them in href's when we're done picking them out.
    my @words = split;
    my @tags = (); # tags found at the end of this note
    while (my $tag = pop @words) {
      last if $tag !~ /^\@/; # Not a tag, bail
      $tag =~ s/^@//; # Trim the "@"
      push @tags, $tag;
      push (@{$lines{$tag}}, \$this_line);
    }

    foreach my $tag (@tags) {
      # Greedy match in $1 insures that $2 will be the last instance
      # of $tag in the note - in other words, the one on the end with
      # the "@" prefix.     And we know that each $tag was parsed off 
+the
      # end of this note, insuring this works.
      $this_line =~ s|(.*)\b($tag)\b|$1<a href="$url$2">$2</a>|;
    }

    push @all_notes, $this_line;
  }
}

# Build tag cloud 
our $cloud = HTML::TagCloud->new(levels => 24);
foreach my $tag (keys %lines) {
  $cloud->add($tag, $url.$tag, scalar @{$lines{$tag}});
}

# Start an instance of MyWebServer on port 8080, only bind to localhos
+t
my $pid = MyWebServer->new(8080);
$pid->host('localhost');
$pid->run();

#            Copyright (c) 2008, Dan McDonald. All Rights Reserved.
#        This program is free software. It may be used, redistributed
#        and/or modified under the terms of the Perl Artistic License
#             (see http://www.perl.com/perl/misc/Artistic.html)
[download]

Comment on A Tagcloud For Cory Doctorow Download Code

Replies are listed 'Best First'.
Re: A Tagcloud For Cory Doctorow by jwkrahn (Abbot) on Aug 28, 2008 at 13:50 UTC
As an alternative to: `my @words = split; my @tags = (); # tags found at the end of this note while (my $tag = pop @words) { last if $tag !~ /^\@/; # Not a tag, bail $tag =~ s/^@//; # Trim the "@" push @tags, $tag; push (@{$lines{$tag}}, \$this_line); }` [download] You could do it like this: `my @tags; # tags found at the end of this note for my $tag ( reverse split ) { last unless $tag =~ s/^\@//; # Not a tag, bail push @tags, $tag; push @{ $lines{ $tag } }, \$this_line; }` [download]	[reply] [d/l] [select]
Re^2: A Tagcloud For Cory Doctorow by McD (Chaplain) on Aug 28, 2008 at 14:01 UTC
Yup, that's much cleaner. Nice!	[reply]
Re: A Tagcloud For Cory Doctorow by lwicks (Friar) on Aug 28, 2008 at 09:05 UTC
Very cool, lets get the cape wearing, zeplin flying, Cory converted to being a Perl devotee! ;-) Downloaded and tested on a similarly structured text file (different purpose) I have here and it worked perfect. Kia Kaha, Kia Toa, Kia Manawanui! Be Strong, Be Brave, Be perservering!	[reply]
Re: A Tagcloud For Cory Doctorow by Anonymous Monk on Aug 28, 2008 at 03:59 UTC
minor suggestion, the # Parse notes file loop, I would write as # Parse notes file { local $/ = "\n\n"; # Double-newline separates input records while (<>) { my $this_line = ""; # escape everything before @tags $this_line = HTML::Entities::encode("$1") . "$2" if /^([^@])(.)/s; # url encode and html escape tags $this_line =~ s\| \@( # capture to $1 [^\s@]+ # more flexible than \w+ )\b \| push @{$lines{"$1"}}, \$this_line; # maybe rework following using URI my $oneEH = HTML::Entities::encode($1); my $oneE = CGI::escapeHTML(CGI::escape($1)); qq~<a href="$url$oneE">$oneEH</a>~; \|gex; push @all_notes, $this_line; } } [download]	[reply] [d/l]
Re^2: A Tagcloud For Cory Doctorow by McD (Chaplain) on Aug 28, 2008 at 13:03 UTC
Hmmm. `# escape everything before @tags $this_line = HTML::Entities::encode("$1") . "$2" if /^([^@])(.)/s;` [download] ...only works if you assume that '@'-signs are exclusively found in tags, not the body of the notes. I like the one-pass s///g for both building up %lines and linkifying tags. I used split and pop because everyone in the BoingBoing comments was saying "Use a regex! This is perfect for regexes!" I loves me a good regex as much as the next monk, but there's a tendency to want to use them for everything once you grasp the power of 'em. I decided to use something simpler in the note parsing deliberately to counter that. I still managed to have some regex fun in the search string highlighting code. :-)	[reply] [d/l]
Re^2: A Tagcloud For Cory Doctorow by Anonymous Monk on Jun 18, 2009 at 22:51 UTC
Hi, I'm an end user. I'm located near the very, very end. I do legal research (working on a JD) and this would be EXACTLY what I need. The problem is that I don't understand how I would use this code to do what I need. At the risk of sounding incompetent (which I am), how would I use this on a windows (XP) machine as an executable with a nice GUI interface?	[reply]
Re: A Tagcloud For Cory Doctorow by rhenley (Initiate) on May 10, 2009 at 03:51 UTC
I just read Cory's article about this and had to try it out. Looks great! I've seen these tag clouds used a few times before without having any idea what they were supposed to be. One question: is there any way to have it reread the "notes" file periodically without having to stop/start the script? For the curious, the article is located at http://www.locusmag.com/Perspectives/2009/05/cory-doctorow-extreme-geek.html	[reply]
Re^2: A Tagcloud For Cory Doctorow by McD (Chaplain) on May 10, 2009 at 18:55 UTC
One question: is there any way to have it reread the "notes" file periodically without having to stop/start the script? Sure. There are two changes needed: replace the "while (<>) ..." with more robust file handling, and add some code to re-read the file(s) provided when they change. Not huge changes, but I guess I didn't envision this being a particularly long-lived script. I was thinking more start it up, work for a few hours, shut it down.	[reply]
Re: A Tagcloud For Cory Doctorow by zecg (Initiate) on May 10, 2009 at 13:51 UTC
A very useful thing. I'm new to Perl, how would one go about making this thing properly display utf-8 characters in notes and tags? I presume it's a problem with HTTP::server::simple?	[reply]
Re^2: A Tagcloud For Cory Doctorow by McD (Chaplain) on May 10, 2009 at 19:01 UTC
A very useful thing. I'm glad you think so! I'm new to Perl, Welcome! Hope you enjoy it. ...how would one go about making this thing properly display utf-8 characters in notes and tags? I presume it's a problem with HTTP::server::simple? THAT is an excellent question. I suspect, assuming you've got a reasonably recent version of Perl installed, that it's pretty simple. I'll go read up on the right way to specify utf8 file I/O and make sure that HTTP::server::simple can set up the proper headers, etc. Gimmie a few hours.	[reply]
Re^3: A Tagcloud For Cory Doctorow by paulskin (Initiate) on May 13, 2009 at 09:39 UTC
Really useful indeed! Any luck with the unicode thing yet? I'd love to use this for my German language studies....	[reply]
Re^4: A Tagcloud For Cory Doctorow by McD (Chaplain) on May 13, 2009 at 18:38 UTC
Re: A Tagcloud For Cory Doctorow by monk0338gne83 (Initiate) on Nov 18, 2009 at 21:13 UTC
thanks so much for this, i used your tag cloud religiously when studying for oral exams. i'm trying to integrate the tagcloud into my desktop so that it's always there in the background (I'm on a mac and used WebDesktop to do this). is there any way to format the html output so that the text background is transparent and the text itself remains opaque? this way the tags will just float above my desktop image� thanks!	[reply]
Re^2: A Tagcloud For Cory Doctorow by McD (Chaplain) on Nov 30, 2009 at 15:00 UTC
Glad you find it useful! is there any way to format the html output so that the text background is transparent and the text itself remains opaque? Transparency isn't a function of the HTML, it's a function of the web browser - either WebDesktop, or Firefox, or Safari, or whatever. So what you're really looking for is a web browser with a transparent window, ideally WebDesktop with a transparent window. I'm not sure how to accomplish that, but a quick search turned up a couple of OS X utilities that claim to make your windows transparent. You might experiment with something like CrystalClear Interface or Afloat and see if that gets you closer to what you want. Good luck!	[reply]
Re^2: A Tagcloud For Cory Doctorow by ambrus (Abbot) on Nov 30, 2009 at 16:07 UTC
See also formatting html output.	[reply]
Re: A Tagcloud For Cory Doctorow by Anonymous Monk on May 10, 2009 at 15:20 UTC
where does the file with your notes go?	[reply]
Re^2: A Tagcloud For Cory Doctorow by rhenley (Initiate) on May 10, 2009 at 16:35 UTC
It's probably simplest to put your notes file wherever you have your Perl script (or your Perl script wherever you have your notes file...however you want to think of it). You have to specify the name of your notes file when you run the Perl script. I use Windows, so at the command prompt I type 'perl tagcloud.pl notes.txt'. If you want to put your notes file in one place and your script in another, just type the full path to your notes file on the command line (for example, 'perl tagcloud.pl c:\notes\notes.txt').	[reply]
Re^3: A Tagcloud For Cory Doctorow by Anonymous Monk on May 10, 2009 at 16:56 UTC
thanks so much.	[reply]
Re^3: A Tagcloud For Cory Doctorow by Anonymous Monk on May 10, 2009 at 21:34 UTC
is the syntax the same for a file on a server?	[reply]
Re^4: A Tagcloud For Cory Doctorow by rhenley (Initiate) on May 11, 2009 at 03:16 UTC
Re^2: A Tagcloud For Cory Doctorow by Anonymous Monk on May 11, 2009 at 02:33 UTC
Can anyone explain how to run this on a Mac?	[reply]
Re^3: A Tagcloud For Cory Doctorow by gnuchu (Novice) on May 11, 2009 at 10:26 UTC
To start with, you have to make sure your perl installation supports the modules that the scripts uses. I'm doing this from Windows but in theory it should be the same for Mac. - Open a terminal session. - Run: ppm list - This will show you all of the modules that are installed. - If any of: HTML::TagCloud HTML::Entities HTTP::Server::Simple::CGI are missing use the command: ppm install <module_name_from_above> to install the module. - HTTP::Server::Simple::CGI doesn't seem to be available via ppm - I guess because it is still developmental. I had to download direct from cpan and manually install.* Then, when all installed quite simply: perl -w tagcloud.pl notes.txt * Manual HTTP::Server::Simple::CGI install. - Download from: http://search.cpan.org/dist/HTTP-Server-Simple/lib/HTTP/Server/Simple/CGI.pm perl Makefile.PL make make install	[reply]
Re^4: A Tagcloud For Cory Doctorow by McD (Chaplain) on May 20, 2009 at 17:19 UTC
Re^5: A Tagcloud For Cory Doctorow by Anonymous Monk on May 23, 2009 at 17:41 UTC
Some notes below your chosen depth have not been shown here
Re^4: A Tagcloud For Cory Doctorow by Anonymous Monk on May 23, 2009 at 17:35 UTC
Re: A Tagcloud For Cory Doctorow by Anonymous Monk on May 11, 2009 at 19:38 UTC
There are a few line wraps in the display above, indicated by red plus signs, that need to be cleaned up if you cut and paste this.	[reply]
Re: A Tagcloud For Cory Doctorow by Anonymous Monk on Jun 04, 2009 at 19:10 UTC
Nicely done. Works like a dream. It is lacking one thing, I think. What about notes where tags collide. For example, maybe I want to find all the notes that have @foo AND @baz tags. I tried using the search box by typing: @foo @baz but it searches for "@foo @baz". So unless these tags are added in exactly that order, the search is empty. Am I missing something obvious or is this perl script unable to find tag collisions? Is there a fix? Workaround? Have a nice day, Thorarin Bjarnason	[reply]
Re^2: A Tagcloud For Cory Doctorow by McD (Chaplain) on Jun 05, 2009 at 13:08 UTC
Nicely done. Works like a dream. Thanks! Glad you like it. It is lacking one thing, I think. What about notes where tags collide. For example, maybe I want to find all the notes that have @foo AND @baz tags. The word "collide" threw me there for a minute, but I think I understand - search for notes with two or more tags in common. The search function as implemented is a simple string match in the notes, you're thinking more along the lines of a "search engine" search. Interesting. Is there a fix? Workaround? You could change the code to make the search more in line with what you're envisioning. It would be a bit more plumbing to separate the search string into discreet terms, and then change the matching (and highlighting) code to work with a set of terms and logical conditions (AND and OR) rather than just a straight match/no match. That's a bit of work, actually. But there is a very simple, if ugly, workaround. First, you need to remove the `\Q` from the pattern matching loop here: # A little ugly: We're going to grep thru @all_notes looking for # a match - but we need to strip the HTML markup (which we've # added to turn tags into links) out of the notes before checkin +g # for a match, so that you don't match inside the HTML markup # while searching. Also, you need to use a temp var, because # otherwise grep will modify $_. Finally, use \Q (quotemeta) +- # we don't want full patterns here, too much risk foreach (grep {my $t; ($t=$_) =~ s/<.?>//g; $t =~ /\Q$search_string/i} @main::all_notes) { [download] (Note the comment about `\Q` being put in there explicitly because I thought regex searches would be too risky. We're taking the safety off, watch where you aim this thing...) Anyway, change that line to look like this: `$t =~ /$search_string/i}` ...and now you can do what you want by using the techniques outlined in The Perl Cookbook, Chapter 6.17, "Expressing AND, OR, and NOT in a Single Pattern." To search for @foo AND @bar, you'd search for: `@foo.@bar\|@bar.@foo` To search for @foo OR @bar, you'd use: `@foo\|@bar` Cavaet - if you're searching for strings not found in tags, search result highlighting will be screwy with this approach. So that's a quick and dirty fix. Can't say I'm in love with it, but it does do what you want, and making the search box more robust would be a good deal more work. Have a nice day* Thanks, I already am. You too. :-)	[reply] [d/l]
Re^3: A Tagcloud For Cory Doctorow by Anonymous Monk on Jun 05, 2009 at 15:58 UTC
Wow - you are fast! This will do for me. Thanks you! Thorarin Bjarnason	[reply]