Cory Doctorow is the well known editor of BoingBoing, a sci-fi author, and an all around digital luminary. Recently, he posted a call for help visualizing notes he's been collecting for his next book. You can read the full post here, but in a nutshell: he's got a text file full of notes, separated by blank lines. Each note has a series of tags on the end, each tag is indicated by an "@" prefix.
Here's a sample note that's been tagged with the "foo," "bar," and "baz" tags:Lorem ipsum dolor sit amet, consectetuer adipiscing elit. @foo @bar @bazThe actual requirement is given in the post:
"I'm looking for something that'll parse out the tags at the end of the lines and then make a tag-cloud out of them, and let me click on tags to retrieve them, as well as searching the fulltext of all the notes."
Naturally, I immediately thought about how easy this would be in Perl. Parsing the tags is right in the wheelhouse. A quick bit of surfing taught me what a tag cloud was, and lead me to the HTML::TagCloud module. And I've always wanted an excuse to play with HTTP::Server::Simple.
The comments following the post on BoingBoing quickly filled with helpful voices, but none of the answers offered seemed to quite fit the bill. After watching for a day, I was pretty sure the answer to the question was "no such tool exists" - so I took a crack at writing one, leveraging the modules I'd found.
Knowing that Cory uses a Linux desktop with Perl readily at hand, I sent him an early draft of the script to see if it was the sort of thing he was looking for. After a little polishing, we arrived at the code below, which I'm delighted to say seems to be just what he wanted. I share it here in the hopes that someone else might find it useful some day.
On the technical side, the script is pretty simple: parse the notes file into some data structures. Use those data structures to build a Tag Cloud, then run a simple web server who's only function is to display that tag cloud and/or search results. The code is slightly backwards to read, since the web server class is defined up front - the main logic really starts at "BEGIN HERE".
#!/usr/bin/perl -w use strict; use HTML::TagCloud; use HTML::Entities; # Dirt-simple web server - displays the tag cloud, and the set of all # notes that match a given tag, if provided. Also accepts requests to # search the notes, showing highlighted results. { package MyWebServer; use HTTP::Server::Simple::CGI; use base qw(HTTP::Server::Simple::CGI); sub handle_request { my $self = shift; my $cgi = shift; return if !ref $cgi; # Print out the headers, html, the tag cloud, and the search form. print "HTTP/1.0 200 OK\r\n", $cgi->header, $cgi->start_html("Tag C +loud"); print $main::cloud->html_and_css(); print $cgi->start_form(), "<p align=\"right\"> Search all notes for: ", $cgi->textfield('search_string'), $cgi->submit(-value => 'Search'), $cgi->end_form(), "<br><i>(search is case-insensitive)</i><p>"; print "<hr>"; # Now do something interesting with your params, if any. my $tag = $cgi->param('tag'); my $search_string = $cgi->param('search_string'); if ($search_string) { # Display search results my $output; # Perform same HTML encoding on the search string that we did on # the notes, so that searching for things like "<" will work. $search_string = HTML::Entities::encode($search_string); print $cgi->h1("Notes that match \"$search_string\""); # A little ugly: We're going to grep thru @all_notes looking for # a match - but we need to strip the HTML markup (which we've # added to turn tags into links) out of the notes before checkin +g # for a match, so that you don't match inside the HTML markup # while searching. Also, you need to use a temp var, because # otherwise grep will modify $_. Finally, use \Q (quotemeta) +- # we don't want full patterns here, too much risk foreach (grep {my $t; ($t=$_) =~ s/<.*?>//g; $t =~ /\Q$search_string/i} @main::all_notes) { # We want to highlight the match in yellow, but not change the # saved copy of the note - so we work on a copy, $output. # # Regex to (roughly) match an HTML tag: <.*?> # # This s/// matches either an entire tag, or our search # string. The replacement bit is executed (/e): if $2 (our # search string) has matched, wrap it in yellow. # Otherwise, $1 (a tag) is what matched, and it gets # replaced with itself. # # The /e is used instead of just saying "$1$2" (with $2 wrappe +d # in yellow) because that produces endless warnings about use # of undefined values - because only one of the two alternates # is ever defined in the replacement bit. ($output = $_) =~ s{(<.*?>)|($search_string)} {$2 ? "<b><FONT style=\"BACKGROUND-COLOR: yellow\">$2</FONT></b>" : $1 +}eig; print $output, "<p>"; } } elsif ($tag) { # Display notes that match "$tag" print $cgi->h1("Notes tagged with \"$tag\""); foreach my $ref (@{$main::lines{$tag}}) { print $$ref, "<p>"; } } print $cgi->end_html; } } # End of web server class ############ # Begin Here ############ # Parse the notes file, locating tags at the end of entries and # building up two data structures. # # Both of these structures collect "notes," references in %lines and # the actual scalar in @all_notes, which contains a note ready for # display in our HTML output. First, these notes have had HTML # elements encoded to simplify processing and make it harder to do # nasty things to the user's browser. Then the tags at the end of the # lines have been turned into links, same as are used in the tag # cloud, to enhance navigation. # # %lines # foo => [ # "note ref (tagged with foo)", # "another note ref (tagged with foo)", # ... # ] # # @all_notes - arrary of the set of all notes refered to in %lines - # in other words, every note found. Used in searching. our %lines; our @all_notes; # URL used in constructing tag-links my $url = '?tag='; # Parse notes file { local $/ = "\n\n"; # Double-newline separates input records while (<>) { # Need a copy of the "note" to work on and refer to, and we need # it with HTML chars like <, >, etc, escaped to "<", ">", # etc. my $this_line = HTML::Entities::encode($_); # Pop words off the end of the note, processing them as tags as # long as they start with "@". Keep a list of these tags so tha +t # we can wrap them in href's when we're done picking them out. my @words = split; my @tags = (); # tags found at the end of this note while (my $tag = pop @words) { last if $tag !~ /^\@/; # Not a tag, bail $tag =~ s/^@//; # Trim the "@" push @tags, $tag; push (@{$lines{$tag}}, \$this_line); } foreach my $tag (@tags) { # Greedy match in $1 insures that $2 will be the last instance # of $tag in the note - in other words, the one on the end with # the "@" prefix. And we know that each $tag was parsed off +the # end of this note, insuring this works. $this_line =~ s|(.*)\b($tag)\b|$1<a href="$url$2">$2</a>|; } push @all_notes, $this_line; } } # Build tag cloud our $cloud = HTML::TagCloud->new(levels => 24); foreach my $tag (keys %lines) { $cloud->add($tag, $url.$tag, scalar @{$lines{$tag}}); } # Start an instance of MyWebServer on port 8080, only bind to localhos +t my $pid = MyWebServer->new(8080); $pid->host('localhost'); $pid->run(); # Copyright (c) 2008, Dan McDonald. All Rights Reserved. # This program is free software. It may be used, redistributed # and/or modified under the terms of the Perl Artistic License # (see http://www.perl.com/perl/misc/Artistic.html)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: A Tagcloud For Cory Doctorow
by jwkrahn (Abbot) on Aug 28, 2008 at 13:50 UTC | |
by McD (Chaplain) on Aug 28, 2008 at 14:01 UTC | |
|
Re: A Tagcloud For Cory Doctorow
by lwicks (Friar) on Aug 28, 2008 at 09:05 UTC | |
|
Re: A Tagcloud For Cory Doctorow
by Anonymous Monk on Aug 28, 2008 at 03:59 UTC | |
by McD (Chaplain) on Aug 28, 2008 at 13:03 UTC | |
by Anonymous Monk on Jun 18, 2009 at 22:51 UTC | |
|
Re: A Tagcloud For Cory Doctorow
by rhenley (Initiate) on May 10, 2009 at 03:51 UTC | |
by McD (Chaplain) on May 10, 2009 at 18:55 UTC | |
|
Re: A Tagcloud For Cory Doctorow
by zecg (Initiate) on May 10, 2009 at 13:51 UTC | |
by McD (Chaplain) on May 10, 2009 at 19:01 UTC | |
by paulskin (Initiate) on May 13, 2009 at 09:39 UTC | |
by McD (Chaplain) on May 13, 2009 at 18:38 UTC | |
|
Re: A Tagcloud For Cory Doctorow
by monk0338gne83 (Initiate) on Nov 18, 2009 at 21:13 UTC | |
by McD (Chaplain) on Nov 30, 2009 at 15:00 UTC | |
by ambrus (Abbot) on Nov 30, 2009 at 16:07 UTC | |
|
Re: A Tagcloud For Cory Doctorow
by Anonymous Monk on May 10, 2009 at 15:20 UTC | |
by rhenley (Initiate) on May 10, 2009 at 16:35 UTC | |
by Anonymous Monk on May 10, 2009 at 16:56 UTC | |
by Anonymous Monk on May 10, 2009 at 21:34 UTC | |
by rhenley (Initiate) on May 11, 2009 at 03:16 UTC | |
by Anonymous Monk on May 11, 2009 at 02:33 UTC | |
by gnuchu (Novice) on May 11, 2009 at 10:26 UTC | |
by McD (Chaplain) on May 20, 2009 at 17:19 UTC | |
by Anonymous Monk on May 23, 2009 at 17:41 UTC | |
| |
by Anonymous Monk on May 23, 2009 at 17:35 UTC | |
|
Re: A Tagcloud For Cory Doctorow
by Anonymous Monk on May 11, 2009 at 19:38 UTC | |
|
Re: A Tagcloud For Cory Doctorow
by Anonymous Monk on Jun 04, 2009 at 19:10 UTC | |
by McD (Chaplain) on Jun 05, 2009 at 13:08 UTC | |
by Anonymous Monk on Jun 05, 2009 at 15:58 UTC |