Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

RFC: Snippets Permuted Index without Stopwords

by clp (Friar)
on Oct 16, 2009 at 20:34 UTC ( [id://801644]=perlmeditation: print w/replies, xml ) Need Help??

Hi,

I started with the program to make the Snippets Permuted Index from jdporter.

My goal was to make minimal modifications to it, to prevent stopwords from being used as index terms.

Stopwords (eg, many articles, prepositions, and conjunctions) can clutter an index, and make it much larger without being more useful.

I added a new command line option, -nostopwords;
two new subroutines (one with a list of the stopwords);
modified a few other lines;
and added a short POD section.

The stopword processing seems OK to me. If you see any improvements, or bugs or problems with it, please let me know (search the source for 'stopword' to find the areas that I changed). I am less inclined to change other parts of the code, which seems to be fine.

I might post this in the Code Catacombs eventually, or provide a patch to jdporter's program that includes this new functionality. I think he updates the permuted index page occasionally, and perhaps this would be a useful format.

Thanks for looking at my first code for perlmonks.
clp

#!/usr/bin/perl # generate_snip_indx_nostop.pl clpoda 2009.08.30 # Derived from generate_snippets_index.pl use Getopt::Long; use LWP::Simple; use Data::Dumper; $Data::Dumper::Indent=1; use HTML::Entities; use strict; use warnings; my $DEBUG = 0; my( $get_new, $by_titles, $permuted, $nostopwords )=(0)x99; GetOptions( 'getnew|new!' => \$get_new, 'titles|by_titles!' => \$by_titles, 'permuted!' => \$permuted, 'nostopwords!' => \$nostopwords, ); $get_new + $by_titles + $permuted + $nostopwords == 1 or die "Usage: $0 [-getnew] [-titles] [-permuted] [-nostopwords] (exa +ctly one)\n"; my @stopwords; read_stopwords (); # the files we refer to: my $snippets_data_file = 'snippets.pld'; # read/written by get_new; re +ad by the others my $last_date_file = 'snippets_last_date.pld'; # read/written by get_n +ew. # if title is not provided, the id will be used as the title. # XXX this really should be made to get the title from the xml record. sub get_one_snippet { my( $id, $title ) = @_; warn "\tgetting $id\n"; local $_ = get( "http://perlmonks.org/?displaytype=xml;node_id=$id +" ); my( $author_id, $author_name ) = m#<author id="(\d+)">\s*(.*)</aut +hor>#; $author_id == 52855 and return(); my( $date ) = m#<node id=.* created="([^"]*)" updated=#; my( $desc ) = m#<field name="snippetdesc">\s*(.*?)</field>#s; $desc =~ s/\r//g; # any other cleanup needed? { id => $id, date => $date, author_id => $author_id, author_name => $author_name, desc => decode_entities($desc), title => decode_entities($title), } } my $snips = do $snippets_data_file; # load the snippets base (ref to A +oH) if ( $get_new ) { my %snips = map { ( $_->{'id'} => $_ ) } @$snips; # key by id warn scalar(keys %snips)." snippets already in local cache.\n"; my $last_date = do $last_date_file; my( $year, $mon, $mday ) = @{$last_date}{qw( year mon mday )}; # back the date up by a day or three # sure, we could use a module for this, but that would be # way heavier-weight than we need. Fuzzy is fine. if ( $mday > 1 ) { $mday--; } elsif ( $mon > 1 ) { $mon--; $mday = $mon == 2 ? 28 : 30; } else { $year--; $mon = 12; $mday = 31; } my $query = sprintf "yr=%04d;mo=%02d;dy=%02d", $year, $mon, $mday; warn "Querying for snippets posted since $query ...\n"; $query = "node_id=3989;nf=0;$query;re=N;Sn;go=Search;as_user=961"; + # anonymonk my $html = get( "http://www.perlmonks.org/bare/?".$query ); $html =~ /\(searched [.0-9]+% of DB\)/ or die "Failed to get good result from Super Search!"; my $n_added = 0; for my $tr ( $html =~ m#<tr>.*?</tr>#sg ) { local $_ = $tr; s/[\r\n]+//g; /<td><a href="\?node_id=1980">Snippet<\/a><\/td>/ or next; my( $y, $m, $d ) = m/<td>(\d\d\d\d)?(\d\d)?(\d\d)<\/td>/; my( $author_id, $author_name, $id, $title, ) = m/<td><a href="\?node_id=(\d+)">(.*?)<\/a><\/td>\s*<td><a href +="\?node_id=(\d+)">(.*?)<\/a><\/td>/; if ( $author_id == 52855 ) { warn "Skipping $id, owned by $author_name\n"; next; } if ( $snips{$id} ) { warn "Skipping $id, already got it.\n"; next; } my $r = get_one_snippet( $id, $title ); unless ( defined $r ) { warn "Bogus! No valid snippet data returned for $id ". qq("$title" by $author_name ($author_id)\n); next; } warn "Adding new snippet $id, posted on $y-$m-$d\n"; $snips{$id} = $r; $n_added++; } print $n_added ? "Added $n_added new snippets.\n" : "No new snippe +ts found.\n"; @$snips = values %snips; open F, ">", $snippets_data_file or die "write $snippets_data_file + - $!"; print F Dumper($snips); close F; my @now = gmtime; $last_date->{'year'} = $now[5] + 1900; $last_date->{'mon'} = $now[4] + 1; $last_date->{'mday'} = $now[3]; open F, ">", $last_date_file or die "write $last_date_file - $!"; print F Dumper($last_date); close F; exit 0; } =pod =begin comment typical: { 'id' => 248201, 'title' => 'Quick and Dirty Seti@home Server Status', 'author_name' => 'Mr. Muskrat', 'author_id' => 155876, 'date' => '2003-04-04 17:14:39', 'desc' => 'Perform a quick and dirty check of the Seti@home server +s.', } =end comment =cut sub header { my( $title, $ref_id, $ref_title ) = @_; my( $year, $mon, $day ) = (gmtime)[5,4,3]; my $date = sprintf "%04d-%02d-%02d", $year+1900, $mon+1, $day; <<EOF <html><head><title>$title</title></head><body> <h1>$title</h1> <p>See <a href="http://perlmonks.org/?node_id=$ref_id">$ref_title</a>. +</p> <p>Generated on $date</p> EOF } sub linkId { my( $id, $title ) = @_; if (0) { return defined $title ? "[id://$id|$title]" : "[id://$id]" } defined $title or $title = $id; qq(<a href="http://perlmonks.org/?node_id=$id">$title</a>) } # by title/id if ( $by_titles ) { print header('PerlMonks Snippets - By Title',619683,'New Snippets +Index'), "<table>\n"; for ( sort { $a->{'title'} cmp $b->{'title'} or $a->{'id'} <=> $b->{'id'} } @$snips ) { my $d = $_->{'date'}; $d =~ s/ .*//; # strip off time part print '<tr><td>', linkId( $_->{'id'}, $_->{'title'} ), '</td><td>', linkId( $_->{'author_id'}, $_->{'author_name'} ), "</td><td>$d</td></tr>\n" } print "</table></body></html>\n"; exit 0; } # permuted title index if ( $permuted or $nostopwords ) { print header('PerlMonks Snippets - Permuted Titles',619691,'Snippe +ts Permuted Index'), "<pre>\n"; my @permut; for my $s ( @$snips ) { local $_ = $s->{'title'}; /untitled node/ and next; print "Analyzing input string .$_. \n" if $DEBUG; while ( /\b[\w]/g ) { my $p = pos; my $l = substr $_, 0, $p-1; my $r = substr $_, $p-1; # # Skip a record where a stopword is the index term # and if this cmd line option was specified: 'nostopwords' next if ( $nostopwords && stopword_record_found($r) ); # push @permut, [ $l, $r, $s ]; } } my $limit = 40; my $lmax = 0; for ( @permut ) { $lmax < length($_->[0]) and $lmax = length($_->[0]); } $lmax > $limit and $lmax = $limit; for my $p ( sort { lc($a->[1]) cmp lc($b->[1]) or $a->[2]{'id'} <=> $b->[2]{'id'} or lc($a->[0]) cmp lc($b->[0]) } @permut ) { my $l = $p->[0]; my $r = $p->[1]; my $id = $p->[2]{'id'}; #$l =~ /./ and $l = "[id://$id|$l]"; #$r =~ /./ and $r = "[id://$id|$r]"; #print qq(<tr><td align=right>$l</td><td>$r</td></tr>\n); if ( length($l) > $lmax ) { my $chop = length($l) - $lmax; # how many to chop substr( $l, 0, $chop ) = ''; } else { print ' ' x ( $lmax - length($l) ); } print linkId( $id, $l.$r ),"\n"; } print "</pre></body></html>\n"; exit 0; } # </pre></body></html> sub stopword_record_found { my @string = split (' ', shift); foreach my $w (@stopwords) { #D print "Now testing stopword .$w.\n"; if ( (lc $string[0]) eq $w ) { # A stopword was found. print "Stopword was found; input word .$string[0]. and sto +pword .$w. .\n" if $DEBUG; return 1; } } # Stopword was not found. #D print "Stopword was not found; returning 0 for word .$string[0] +. .\n"; return 0; } sub read_stopwords { # Terms removed from stopword list. # always # off # on # only #D my @stopwords2; @stopwords=qw( a about above according across actually adj after afterwards again against all almost alone along already also although among amongst an and another any anyhow anyone anything anywhere are aren't around as at be became because become becomes becoming been before beforehand begin beginning behind being below beside besides between beyond billion both but by can can't cannot caption co company corp corporation could couldn't did didn't do does doesn't don't down during each eg eight eighty either else elsewhere end ending enough etc even ever every everyone everything everywhere except few fifty first five for former formerly forty found four from further had has hasn't have haven't he he'd he'll he's hence her here here's hereafter hereby herein hereupon hers herself him himself his how however hundred i i'd i'll i'm i've ie if in inc indeed instead into is isn't it it's its itself last later latter latterly least less let let's like likely ltd made make makes many maybe me meantime meanwhile might million miss more moreover most mostly mr mrs much must my myself namely neither never nevertheless next nine ninety no nobody none nonetheless noone nor not nothing now nowhere of often once one one's onto or other others otherwise our ours ourselves out over overall own per perhaps rather recent recently same seem seemed seeming seems seven seventy several she she'd she'll she's should shouldn't since six sixty so some somehow someone something sometime sometimes somewhere still stop such taking ten than that that'll that's that've the their them themselves then thence there there'd there'll there're there's there've thereafter thereby therefore therein thereupon these they they'd they'll they're they've thirty this those though thousand three through throughout thru thus to together too toward towards trillion twenty two under unless unlike unlikely until up upon us used using very via ve was wasn't we we'd we'll we're we've well were weren't what what'll what's what've whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who'd who'll who's whoever whole whom whomever whose why will with within without won't would wouldn't yeah yes yet you you'd you'll you're you've your yours yourself yourselves 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ); } =head1 NAME generate_snip_indx_nostop - Generate permuted index of titles of code snippets from perlmonks.org, ignoring stopwords. =head1 USAGE generate_snip_indx_nostop [-getnew] [-titles] [-permuted] [-nostopwo +rds] (exactly one) # Make a permuted index without showing any stopwords in the index pos +ition: generate_snip_indx_nostop -nostopwords =head1 OPTIONS One and only one of the following options must be specified: -getnew -titles -permuted -nostopwords B<-getnew> Retrieve metadata about code snippets from perlmonks.org that does not yet exist on the local machine. B<-titles> Print a list of titles of code snippets, HTML format. B<-permuted> Print a list of titles of code snippets in the permuted format, using each word from each title as an index term, HTML format. B<-nostopwords> Print a permuted index that does not use any stopword as an index term, HTML format. =head1 DESCRIPTION To make a permuted index with no stopwords, first get a list of titles from Code Snippets section of the perlmonks.org s +ite, using the -getnew option: generate_snip_indx_nostop -getnew Next, make a permuted index without stopwords, using the -nostopwords option: generate_snip_indx_nostop -nostopwords =head1 DEPENDENCIES The get_new function requires access to the perlmonks.org web site (using LWP::Simple), to get titles of the code snippets. =head1 BUGS AND LIMITATIONS B<A title containing angle brackets is not handled correctly.> =over The angle brackets and any text inside them in the title of a snippet, do not appear in the output. (But angle brackets used in the corresponding URL for the title are handled correctly). Any text to the right of such a term is shifted to the left in the permuted index output line, causing the wrong text to be placed in the index term's location for that line. If other index terms for that line follow the angle brackets, those lines in the permuted index will also be misaligned by the number of missing spaces in the index term plus the two angle brackets. Eg, the title "Search for <n>th occurrence of regex" will produce entries shifted left by three spaces in the permuted index for these terms: <n>th, occurrence, and regex. Idea: Consider encoding angle brackets so they appear properly on the permuted index web page. =back =head1 SEE ALSO Snippets Permuted Index at Perlmonks: L<http://perlmonks.org/?node_id=619691> =head1 ACKNOWLEDGEMENTS The Perlmonks permuted index program was provided by jdporter. The original list of stopwords was also from Perlmonks: Removing Stopwords from a String, L<http://perlmonks.org/?node_id=50257> =head1 AUTHOR C. Poda, clppm at poda.net =head1 LICENSE AND COPYRIGHT This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See L<perlartistic>. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. =cut

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://801644]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2024-04-18 02:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found