Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I had to upgrade the search capability of a web site this week and it needed to be done quickly. I started by looking at what the current code did vs. what the site really needed it to do. The code (inherited) was horrible. Here is the "search":
sub search_biography { my $search_text=shift; my @result; my $table; my $sth=$dbh->prepare("select distinct hm_index,first_name,last_name, category from Biography where first_name like '%$search_text%' or last_name like '%$search_text%' or bio_text like '%$search_text%' or occupation like '%$search_text%' or description like '%$search_text%'" ); $sth->execute; while(my @row=$sth->fetchrow) { push @results , "blah - was long HTML string" } return (\@result); }
This nice code would match 'across' if you looked for 'Ross', I doubt if that is what they wanted. I felt there had to be something already available to do this on CPAN and so began my search.

It took me some searching at CPAN but I finally stumbled across the DBIx::FullTextSearch module, but it was late at night and the doucmentation didn't seem to make much sense so I called it a day.

Well that sleep did me good, I was finally able to make some sense out of the documentation and make a plan on how to convert what I had over.

The documentation still doesn't give a clear (at least to my mind) path to implementing a solution, that is understandable since it isn't exactly an "everyday" item. This is my attempt at explaining the steps I took to shoehorn it into an existing application without too much pain. The documentation talks about a 'frontend' and a 'backend', the frontend is the information to be indexed. In this example we are going to use a database for both the backend and the frontend. See the docs for more options.

Phase One - Tests

One thing I have learned the hard way is that you should test your code outside of the web environment before you add an additional layer of complexity on to it, aka "mod_perl/<your framework/template here>". With that in mind I made two scripts, full_text_search_create.pl and full_text_search.pl

The first and most important script is the full_text_search_create.pl script since it actually creates the indexes that are used by the second one.

In my case I was lucky when it came to the tables because they were already properly setup with Primary keys so it removed a step from the process. If you attempting to work on a table that doesn't have a primary key, I suggest you do that before you attempt using this module.

Here is our first script

full_text_search_create.pl
use DBIx::FullTextSearch; use DBI; use strict; # the data source, that is the tables to be indexed my $dbh = DBI->connect('dbi:mysql:database','user','password'); my $table_info = { 'DBIx_biography' => { table_name => 'Biography', column_name => 'bio_text', }, 'DBIx_glossary' => { table_name => 'Glossary', column_name => 'glossary_definatio +n', }, 'DBIx_timeline' => { table_name => 'Timeline', column_name => 'event_description' +, }, 'DBIx_quotes' => { table_name => 'Quotes', column_name => 'quote', }, }; my $table_check = $dbh->prepare('show tables'); $table_check->execute; my $tables; while ( my ($table) = $table_check->fetchrow() ) { $tables{$table} = 1; } foreach my $front_end_table ( keys %{$table_info} ) { if ($tables{$front_end_table}) { $fts = DBIx::FullTextSearch->open($dbh, $front_end_table); $fts->empty; } else { $fts = DBIx::FullTextSearch->create($dbh, $front_end_table, frontend => 'table', # type of frontend # table to be indexed table_name => $table_info->{$front_end_tabl +e}{'table_name'}, # column that has the info to be indexed column_name => $table_info->{$front_end_table}{'column_nam +e'}, backend => 'column', # type of storage for the back +end # you really should set these up, but I won't get # into that in this writeup #stoplist => 'sl_en', # from example #stemmer => 'en-us' # from example ); } my $ids = $dbh->prepare(qq!SELECT hm_index FROM $table_info->{$front_end_table}{'tabl +e_name'}!); $ids->execute() or die "$DBI::errstr problem"; while ( my ($id) = $ids->fetchrow ) { $fts->index_document($id); } }


In my table structures each table has the same id name so I was able to avoid having to add that into the hash, but it wouldn't be so bad to include it anyway just allow for future growth.

The above code will create a new index or erase an existing one and recreate it. This makes it simple to run anytime there is a major update or in a nightly cron job.

The new names are created inside of the database you connected to in your DBI connect. I named them DBIx_<table_name> so I could spot them easily and it is also unlikely to conflict with existing table names.

For every index you want to create there are three tables: <front_end_table_name> <front_end_table_name>_data <front_end_table_name>_words These are all created automatically by the create statement and no action is required on your end.

Once the indexes are created we need to be able to test our create from the command line to see if it is returning the correct results.

full_text_search.pl
# full_text_search.pl use strict; use DBIx::FullTextSearch; use DBI; # the data source, that is the tables to be indexed my $dbh = DBI->connect('dbi:mysql:database','user','password'); my $front_end_table = shift @ARGV; my $fts = DBIx::FullTextSearch->open($dbh, $front_end_table); my $table = shift @ARGV; # this is a mapping of the SQL statements for each table, I was able # to get these directly from some existing code and then just # replaced my where conditional. This minimized the amount of # refactoring and time required to swap things out. my $table_column = { Biography => 'distinct hm_index,first_name,last_name,category', Quotes => qq!distinct hm_index,quote_stamp,quote, date_format(birth_year,'%Y'), date_format(death_year,'%Y'), name,bio_index,media_index,category!, Timeline => qq!hm_index,event_title,event_brief,event_description +, timeline_stamp,date_format(event_date, '%Y')!, Glossary => qq!hm_index,glossary_title,glossary_defination!, }; my @docs = $fts->contains(@ARGV); my $doc = join(',',@docs); if ($doc) { my $info = $dbh->prepare(qq!SELECT $table_column->{$table} FROM $table WHERE hm_index IN($doc)! ); $info->execute; while ( my @row = $info->fetchrow ) { # this is just for testing so it doesn't need # to be pretty :) print "'$row[1]' '$row[2]'\n"; } } else { print "no matches for '", join(' ',@ARGV) , "'"; }


The second program expects a minimum of 3 command line arguments, something along the lines of:
full_text_search.pl DBIx_biography Biography trs80
Where:
DBIx_biography = front end table Biography = table that contains our primary key, this is the table that we will get our final mathes from trs80 = what to look for


You could also do:
full_text_search.pl DBIx_biography Biography trs80 jeffa Ovid
and it will search for all three. It gives fairly ugly/simple results, but it just an index/search test interface that isn't intended for end users.

Beyond this things become more specific to your own site and how you want to render the results, but hopefully this revealed some of the magic in this module.


Update(1): Made some minor verbiage adjustments and added code tags on commandline for running the program

In reply to Using DBIx::FullTextSearch by trs80

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-20 13:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found