(tye)Re: Newest Super Search

The only modules are DBI, CGI, and Everything. Most of the actual code is included below.

This is a pretty specialized situation. For example, we can't allow any single query to run very long since MySQL was designed assuming some aspects of the threading model (light-weight processes) that aren't present on FreeBSD. So a single long-running query can nearly lock up access to the database for everyone.

Another MySQL quirk is that even when you use DBI, once you submit a query MySQL does the whole dang thing before it lets you get the first matching record back. If it weren't for this quirk, I could work around the previous problem (in many cases) by terminating the long-running query after I'd fetched enough records (or taken too much time). But I can't, so I have to ensure that each query will finish fairly quickly in all cases.

And we tried using MySQL's "full-text search" features. I pretty much hated the results (can't search for 3-letter words, for example). But worse, it still had "worst case" situations where a single search could end up locking up the whole site.

I remember my first run-ins with full-text search (in the '80s). It sounds so nice but in practice I find that it rarely works very well. It usually takes lots and lots of practice to learn how to do a decent search that gives you back something you are looking for but not buried amid 4000 things you don't want.

Well, I haven't noticed it get much better. Sure, there are some "words" that are unique enough that searching on them works great. But way too much of the time you don't get that lucky. I find that I almost always prefer searching an index than trying my luck at "full-text search".

Except, of course, for Google! The miracle of Google is how they sort the results. Trying to imitate Google's ingenious sorting technique (what we know of it) seemed like too much work for me to pull off in my spare time.

Now, we can use Google. But I've tried using Google against PerlMonks (directly and through "thepen") and it just doesn't work that well.

And lots of times I don't want to search for "words". Even Google doesn't do a very good job if you want to search for parts of "words". It does pretty good at search for "phrases". But throw lots of punctuation in and it and I often don't agree on what a "word" is. And Perl code certain often has lots of punctuation in it.

So I realy liked the original PerlMonks simple and Super searches. They were pretty much just substring searches that required "and" (match all of the criteria listed). I found that I was successful with them much more frequently than I am with full-text searches. But, they are resource intensive.

So I've heard quite a few people's suggestions on ways to build a word lists and lists of "stop" words that you aren't allowed to search for, etc. to roll our own full-text search. Or some canned full-text search to use. Well, I encourage anyone who wants to to pursue such. We've even added XML tickers so external search engines can be built more easily. But I doubt I will find them very useful. (BTW, my quick glance at DBIx::FullTextSearch makes me think it wants to create its own tables so you might want to try making an external search out of it -- if it turns out nice, we can probably find a place to host it.)

Anyway, in trying to find out why the site was locking up, I made changes to "mytop" (like "top" but for MySQL queries -- it shows the current queries against database and how long they've been running) and did lots of watching and fixing. And I've come to understand some of what makes a query slow in MySQL.

I kind of like MySQL's query optimizer. It is quite simple (compared to some of the ones I deal with) and so it is much easier to predict what it is going to do. I spent years writing database code where we didn't use SQL, we would seek to a point in an index and read forward or backward from there. So I think of ways to get results efficiently at a low level. But sometimes it is dang hard to write SQL that convinces the optimizer to do what I thought up (so, somewhat paradoxically, a query that I could perform very efficiently will be executed very inefficiently by the SQL server).

So, the basic lesson in this case is that you want to make a query that can find all of the records that it needs by searching a fairly small range of one index and then set a reasonable LIMIT. Then you have to tell SQL to order the records based on that index (that makes it much more likely that the optimizer will actually choose to use the index that you gave to it on a silver platter).

Of course, many queries can't be fullfilled that way so I end up findng the matches by doing a sequence of such queries.

If you don't specify a small range, then you could spend too much time searching a large number of records (and the optimizer also "understands" this and so becomes more likely to ignore your preferred index) which would lock up MySQL on FreeBSD (a joke user someone made for me when I started blaming all site problems on this combination). And you have to set a reasonable LIMIT or else you could spend too much time sending the matching information across (with the same result). And since you specified the sort order, you can also efficiently continue where you left off (without having to use something like MySQL's "LIMIT 50,150" which probably requires the first part of the search basically be repeated).

So that is what the newest Super Search does. And, just in case I don't know the MySQL optimizer as well as I think I do, it asks the optimizer to explain its plan before letting MySQL try to perform the query. If it doesn't decide to use an index that requires only a fairly small number of records to be read, then I won't run the query.

I've made some fairly minor changes to the code and removed some code that is just for future features in hopes of it being slightly easier to understand. Some changes are things I planned to do to clean up the code but I just did them quickly here and haven't tested. So if you see a syntax error, it is probably just a typo. /:

Also, writing code for Everything makes it hard to write utility subroutines so this isn't really "factored" like I would normally write code. Okay, enough excuses. ;)

Update: I forgot one excuse. (: The CGI parameters names were chosen to be very short (but still somewhat mnemonic) because I plan to add a feature where you can cut'n'paste a URL that performs the search that you have crafted for repeating it later or referring to it in a node, etc.

- tye (but my friends call me "Tye")

[%
    my $html= '';
    my @errors;

    my @types= q(
        Wi  perlquestion            SoPW        Seekers of Perl Wisdom
        D   monkdiscuss             PMD         PM Discussions
        Ob  obfuscated              Obfu        Obfuscation
        CU  CUFP                    CUFP        Cool Uses For Perl
        CC  sourcecode              Code        Code Catacombs
        CQ  categorized_question    CatQ        Categorized Questions
        CA  categorized_answer      CatA        Categorized Answers
        Hlp sitefaqlet              Help        Monk Help

        Tu  perltutorial            Tut         Tutorial
        U   user                    User
        Po  poem                    Poem
        Cr  perlcraft               Craft
        Sn  snippet                 Snippet
        N   perlnews                News
        Q   quest                   Quest
        Pol poll                    Poll

        M   perlmeditation          Med         Meditations
        SP  scratchpad              SPad        Scratch Pad
        MR  modulereview            ModRev      Module Review
        BR  bookreview              BkRev       Book Review
        pPd perlman                 perlman     Perl Manpage
        pFn perlfunc                perlfunc    Perl Function
        pFq perlfaq_nodetype        perlfaq     Perl FAQ

    ) =~ /(\S.*\S)/g;

    my( %abbr, %desc, %typeId, %link );
    for(  @types  ) {
        my( $abbr, $type, $link, $desc )= split " ", $_, 4;
        $type =~ tr/_/ /;
        my $id= getId( getType($type) );
        $typeId{$type}= $id;
        $abbr{$type}= $abbr;
        $desc{$type}= $desc || $link;
        $link{ $id }= $link;
        $_= $type;
    }

    my %typeTable= qw(
        snippet         snippet
        bookreview      review
        modulereview    review
        sourcecode      sourcecode
        poll            polls
    );
    my %fieldOfTable= (
        snippet => [qw( snippetdesc snippetcode )],
        review => [qw( itemdescription usercomment doctext )],
        sourcecode => [qw( codedescription doctext )],
        polls => ['choices'],
        #user => [qw( scratchpad )],    # Needs to change
    );

    my @sects;
    my $sects= do {
        my $negSects= ( $q->param("xs") )[-1] ? 1 : 0;
        my %checked;
        @checked{keys %abbr}= map {
            ( ()= $q->param($abbr{$_}) ) ? 1 : 0;
        } keys %abbr;
        @sects= grep $negSects != $checked{$_}, keys %abbr;
        @sects= @types   if  ! @sects;
        join ", ", map $typeId{$_}, @sects;
    };

    my @criteria;

    my @users= grep length, $q->param("a");
    if(  1 == @users  &&  $users[0] =~ m#^(\s*\[[^\]]+\])+\s*$#g  ) {
        @users= $users[0] =~ m#\[([^\]]+)\]#g;
    }
    for my $user (  @users  ) {
        my $type = "user";
        my $reason = "does not exist";
        my $U;
        if(  $user !~ m#^id://(\d+)$#  ) {
            $U = getNode( $user, "user" );
        } else {
            ( $type, $user ) = ( "node ID", $1 );
            $U = getNodeById( $user );
            if(  $U  &&  "user" ne $U->{type}{title}  ) {
               undef $U;
               $reason = "is not a user";
            }
        }
        if(  $U  ) {
            $user= getId($U);
        } else {
            $user= 0;
            push @errors, qq[\u$type "] . $query->escapeHTML($user)
                . qq[" $reason.<br />];
        }
    }
    @users= grep $_, @users;
    my $negAuthor= ( $q->param("xa") )[-1] ? 1 : 0;
    $negAuthor= $negAuthor ? " NOT" : "";
    if(  @users  ) {
        push @criteria, "n.author_user$negAuthor IN ( "
            . join( ", ", @users ) . " )";
    }

    my $replies= ( $q->param("re") )[-1];
    $q->param( "re", $replies );
    my $xRoots= ()= $q->param("xr");
    my $note= getId( getType("note") );

    push @criteria, do {
        if(  "N" eq $replies  ) {           # No replies:
            push @errors,
                "No root nodes and no replies means no search.<br />"
              if  $xRoots;
            $xRoots
              ? "n.node_id = 0"                  # Find nothing!
              : "n.type_nodetype IN ( $sects )"; # Just sel. roots
        } elsif(  "A" eq $replies           # All replies (same as
              ||  @sects == @types  ) {     #  re.s from all sect.s):
            $xRoots
              ? "n.type_nodetype = $note"              #Just all re.s
              : "n.type_nodetype IN ( $note, $sects )";#^ + sel. roots
        } else {                            # Replies from sel. sects:
            $q->param( "re", undef );
            my $c= "( n.type_nodetype = $note"
                . " AND root.type_nodetype IN ( $sects ) )";
            $xRoots
              ? $c                                         # Sel. re.s
              : "( n.type_nodetype IN ( $sects ) OR $c )"; # ^ + roots
        }
    };

    # ( Head Body ) + ( Includes Excludes ) + ( Terms Seperator )

    my $getTerms= sub {
        my( $textParam, $sepParam )= @_;
        my $str= $q->param( $textParam );
        my $sep= $q->param( $sepParam );
        $sep =~ s/^\s*//;
        $sep =~ s/\s*$//;
        $sep= " "   if  ! length $sep;
        $q->param( $sepParam, $sep );
        my @terms= grep length, split /\Q$sep/, $str;
        $q->param( $textParam, join $sep, @terms );
        return @terms;
    };
    my @headHas= $getTerms->( "HIT", "HIS" );
    my @headLacks= $getTerms->( "HET", "HES" );
    my @bodyHas= $getTerms->( "BIT", "BIS" );
    my @bodyLacks= $getTerms->( "BET", "BES" );

    my( @tables, @fields );
    push @tables, 'note',
        "left join node as root on root.node_id=root_node";

    if(  @bodyHas  ||  @bodyLacks  ) {
        my( %tables, %fields );
        push @sects, 'note'
            unless  'N' eq $replies;
        for my $type (  @sects  ) {
            if(  $typeTable{$type}  ) {
                ++$tables{ $typeTable{$type} };
                ++$fields{$_}
                  for  @{  $fieldOfTable{ $typeTable{$type} }  };
            } else {
                ++$tables{document};
                ++$fields{doctext};
            }
        }
        push @tables, keys %tables;
        push @fields, keys %fields;
    }

    my $tables= "node as n";
    for my $table (  @tables  ) {
        if(  $table =~ / /  ) {
            $tables .= "\n$table";
        } else {
            $tables .= "\nleft join $table on ${table}_id=n.node_id";
        }
    }

    if(  @headHas  ) {
        push @criteria, map {
            my $quoted= $_;
            $quoted =~ s#\\#\\\\#g;     # MySQL bug
            $quoted =~ s#(['%_\\\[\]])#\\$1#g;
            "n.title LIKE '%$quoted%'";
        } @headHas;
    }

    if(  @headLacks  ) {
        push @criteria, map {
            my $quoted= $_;
            $quoted =~ s#\\#\\\\#g;     # MySQL bug
            $quoted =~ s#(['%_\\\[\]])#\\$1#g;
            "n.title NOT LIKE '%$quoted%'";
        } @headLacks;
    }

    if(  @bodyHas  ) {
        push @criteria, map {
            my $quoted= $_;
            $quoted =~ s#\\#\\\\#g;     # MySQL bug
            $quoted =~ s#(['%_\\\[\]])#\\$1#g;
            "( " . join( " OR ", map {
                "$_ LIKE '%$quoted%'";
            } @fields ) . " )";
        } @bodyHas;
    }

    if(  @bodyLacks  ) {
        push @criteria, map {
            my $quoted= $_;
            $quoted =~ s#\\#\\\\#g;     # MySQL bug
            $quoted =~ s#(['%_\\\[\]])#\\$1#g;
            map {
                "$_ NOT LIKE '%$quoted%'";
            } @fields;
        } @bodyLacks;
    }

    my $oldFirst= ! ( $q->param("nf") )[-1];
    my $n0= $q->param("n0");
    my $doSearch=  $n0  &&  ! @errors;
    my $lastNode= $DB->sqlSelect( "max(node_id)", "node" );
    $n0 ||= $oldFirst ? 1 : $DB->sqlSelect( "max(node_id)", "node" );

    push @criteria, "n.node_id BETWEEN !TBD!";

    my $limit= 50;
    if(  $doSearch  ) {
        require Time::HiRes;
        my @matches;
        my $start= $n0;
        my $startTime= Time::HiRes::time();
        while(  1  ) {

            my( $min, $max );
            if(  $oldFirst  ) {
                ( $min, $max )= ( $n0, $n0+10000 );
                $max= 1000 * int( $max/1000 + 0.5 );
                $max= $lastNode   if  $lastNode < $max;
            } else {
                ( $min, $max )= ( $n0-10000, $n0 );
                $min= 1000 * int( $min/1000 + 0.5 );
                $min= 1   if  $min < 1;
            }
            $criteria[-1]= "n.node_id BETWEEN $min AND $max";

            my $explainTime= Time::HiRes::time();
            my $query= qq[
                SELECT n.node_id, n.title, n.type_nodetype,
                       n.author_user, n.createtime, root.type_nodetype
                FROM   $tables
                WHERE  ] . join( " AND ", @criteria ) . qq[
                ORDER BY n.node_id
                LIMIT ] . ( $limit - @matches );
            my $explain= $DB->getDatabaseHandle()->prepare(
                "EXPLAIN $query" );
            $explain->execute();
            my $rec= $explain->fetchrow_hashref();
            $explain->finish();
            my $key_used= $rec->{key};
            my $key_rows= $rec->{rows};
            my $comment= $rec->{Comment};

            $explainTime= Time::HiRes::time() - $explainTime;
            if(  3 < $explainTime  ) {
                push @errors, ( $start==$n0 ? "Q" : "Remainder of q" )
                    . qq[uery was not run;  Server is too busy ]
                    . sprintf(
                        qq[("explain" took %.2f seconds)<br />],
                        $explainTime
                      );
                last;
            }

            unless(  "PRIMARY" eq $key_used
                 or  "" ne $key_used  &&  $key_rows < 10000
            ) {
                push @errors, ( $start==$n0 ? "Q" : "Remainder of q" )
                    . "uery would not run quickly"
                    . ( $comment ? " ($comment)" : "" )
                    . ".<br />\n";
                last;
            }

            my $cursor= $DB->sqlSelectMany(
                "n.node_id as node_id, n.title as title,
                 n.type_nodetype as type_nodetype,
                 n.author_user as author_user, n.createtime as createt
+ime,
                 root.type_nodetype as root_nodetype",
                $tables,
                join( " AND ", @criteria ),
                "ORDER BY n.node_id LIMIT " . ( $limit - @matches ),
            );
            my $rec;
            while(  $rec= $cursor->fetchrow_hashref()  ) {
                push @matches, $rec;
            }
            $cursor->finish();

            if(  @matches < $limit  ) {
                $n0= 1 + $max;
            } else {
                $n0= 1 + $matches[-1]{node_id};
                last;
            }
            last   if  $lastNode < $n0;
            my $runTime= Time::HiRes::time() - $startTime;
            if(  10 < $runTime  ) {
                push @errors, ( $start==$n0 ? "Q" : "Remainder of q" )
                    . qq[uery was not run ]
                    . sprintf(
                        qq[(used %.2f seconds so far)<br />],
                        $runTime
                      );
                last;
            }
        }

        my $startDate= ( split " ", $DB->sqlSelect(
            "createtime","node","node_id=$start") )[0];
        my $endDate= ( split " ", $DB->sqlSelect(
            "createtime","node","node_id=".($n0-1)) )[0];
        my $matches= @matches;

        $html .= qq[<p><hr />
            <b>Found $matches node] . ( 1==$matches ? "" : "s" )
            . qq[</b>between IDs $start ($startDate) and ] . ($n0-1)
            . qq[($endDate)];

        if( @bodyHas || @bodyLacks || @headHas || @headLacks ) {
            $html .= qq[<br />where ] . join qq[<br />and ],
                map {
                    my( $desc, @terms )= @$_;
                    if(  ! @terms  ) {
                        ();
                    } else {
                        $desc . join( ", ", map {
                            '"<tt>' . $q->escapeHTML($_) . '</tt>"'
                        } @terms )
                    }
                } ["any text contains all of ",@bodyHas],
                  ["no text contains any of ",@bodyLacks],
                  ["title contains all of ",@headHas],
                  ["title doesn't contain any of ",@headLacks],
        }

        if(  @users  ) {
            $html .= qq[<br />written by ]
                . ( $negAuthor ? "anyone but " : "any of " )
                . join ", ", map linkNode($_), @users;
        }

        $html .= qq[</p>\n];

        my $linkType= sub {
            my( $typeId )= @_;
            return  linkNode( $typeId, $link{$typeId} );
        };

        $html .= qq[<p><table width="100%">];
        for my $rec (  @matches  ) {
            $html .= $q->Tr(
                $q->td( ( split " ", $rec->{createtime} )[0] ),
                $q->td( linkNode($rec->{author_user}) ),
                $q->td( linkNode($rec->{node_id},$rec->{title}) ),
                $q->td(
                    $note == $rec->{type_nodetype}
                      ? "Re:" . $linkType->( $rec->{root_nodetype} )
                      : $linkType->( $rec->{type_nodetype} )
                ),
            );
        }
        $html .= qq[</table></p>\n];
    }

    $q->param( "n0", $n0 );

    if(  $doSearch  ) {
        if(  $oldFirst  &&  $n0 < $lastNode
         ||  ! $oldFirst  &&  1 < $n0  ) {
            my( $min, $max )=
                $oldFirst ? ( $n0, $lastNode ) : ( 1, $n0 );
            $html .= qq[<p>
                Press a "Search" button (below) <b>to continue</b>
                (IDs $min thru $max).
                </p>\n];
        }
        $html .= "<hr />";
    }


    $html .= '<p>' . linkNode( $NODE, "Reset search form" ) . "</p>\n"
+;

    $html .= $/ . htmlcode('openform') . $/;

    $html .= qq[<p>
        Match <b>text</b> containing ] . $q->textfield( "BIT", "", 60 
+)
        . qq[<br />(seperate strings with ]
        . $q->textfield( "BIS", " ", 2 ) . qq[ -- default is spaces)
        <br />] . $q->radio_group( "BH", [ "0", "1" ], "1", 0,
            { 0=>"Don't match -or-", 1=>"Also match" },
        ) . qq[ <b>titles</b> against above.</p>];

    $html .= $/ . $q->submit("","Search") . qq[
        Please be patient after submitting your search.\n];

    $html .= qq[<p>
        Match <b>titles</b> containing ]
        . $q->textfield( "HIT", "", 60 )
        . qq[<br />(separate strings with ]
        . $q->textfield( "HIS", " ", 2 )
        . qq[ -- default is spaces)</p>];

    $html .= qq[<p>
        ] . $q->radio_group(
            "xa", [ "0", "1" ], "0", 0, { 0=>"Match -or-", 1=>"Exclude
+" },
        ) . qq[ <b>authors</b> ] . $q->textfield( "a", "", 20 )
        . qq[<br />
        (use "&#91;one&#93; &#91;two&#93;" to list multiple authors)
        <br />(Searching by author doesn't work for Categorized
        Questions and Answers yet.)</p>];

    $html .= qq[<p>
        Search ] . $q->radio_group(
            -name=>"nf", -values=>[ "1", "0" ], -default=>"0",
            -labels=>{ 1=>"Newest first -or-", 0=>"Oldest first" },
            -disabled=>"disabled",
        ) . qq[,<br />starting at node
        ] . $q->textfield( "n0", "0", 12 ) . qq[ (]
        . ( split " ", $DB->sqlSelect(
                "createtime","node","node_id=$n0") )[0]
        . qq[).</p>];

    $html .= qq[<!-- <p> Show {10|20|50} matches per page.</p> -->];

    $html .= qq[<p>
        Search ] . $q->radio_group(
            "xs", [0,1], 0, 0, {0=>"only -or-",1=>"all but"},
        ) . qq[<br />the following <b>sections</b>:];
    $html .= qq[<ul>] . $q->table(
        map(
            "\n    "
            . $q->Tr(
                map "\n        " . $q->td(
                    $q->checkbox(
                        -name => $abbr{$types[$_]},
                        -value => "",
                        -label => $desc{$types[$_]},
                        "scratchpad" eq $types[$_]
                            ? ( -disabled => "disabled" )
                            : (),
                        )
                ), @$_
            ), map( [ $_, $_+8, $_+16 ], 0..6 ), [7,15]
        ), $/
    ) . qq[</ul>\n];

    $html .= qq[<p>
        <i>Skip</i> <b>text</b> containing ]
        . $q->textfield( "BET", "", 60 )
        . qq[<br />(seperate strings with ]
        . $q->textfield( "BES", " ", 2 )
        . qq[ -- default is spaces)<br />
        (Does not exclude based on titles)</p>];

    $html .= qq[<p>
        <i>Skip</i> <b>titles</b> containing ]
        . $q->textfield( "HET", "", 60 )
        . qq[<br />(seperate strings with ]
        . $q->textfield( "HES", " ", 2 )
        . qq[ -- default is spaces)</p>];

    $html .= qq[</p><p>\n] . $q->radio_group(
        "xr", ["0","1"], "0", 1, {
            0 => "Include <b>root</b> nodes from selected sections",
            1 => "Don't include <b>root</b> nodes",
        },
    );

    $html .= qq[</p><p>\n] . $q->radio_group(
        "re", [qw( A S N )], "S", 1, {
            A => "Include <b>replies</b> from <i>any</i> section",
            S => "Include <b>replies</b> from <i>selected</i> sections
+",
            N => "<i>Don't</i> include <b>replies</b>",
        },
    );

    $html .= qq[\n<p>] . $q->submit("","Search") . qq[
        Please be patient after submitting your search.</p>\n];

    $html .= qq[</form>\n];

    $html .= qq[\n<!-- CGI::VERSION=$CGI::VERSION -->\n];

    return "<b>@errors</b>$html";
%]
[download]

Comment on (tye)Re: Newest Super Search Download Code


We don't bite newbies here... much
	PerlMonks