New simple search

As part of the More HTML Escaping roll-out, the simple search (at the top of each page or via the "node" CGI parameter) was switched away from using MySQL's "full text search" feature. This meant that you could once again search for 3-letter and 2-letter words in node titles. This version avoided the "worst case" situations of the servers sorting through way too many matches but would not find any matches unless all of the words entered matched.

I've just rolled some more improvements into the simple search. The current implementation works like this:

If an exact title match is found (after ignoring nodes that you don't have permission to read unless you have changed your user settings), then no further searching is done.

Otherwise your search string is split on whitespace resulting in a list of "words". We look for nodes that contain the greatest number of your "words" in their titles as simple substrings. Titles that match this maximal number of words are listed, newest first. That is, if you specify 5 words and there are no titles that include 4 or more of your words but there is a title that contains 3 of your words, then you will only be shown titles that contain 3 of your words.

If there are more than 500 such matches, then the oldest 500 are listed (newest first). In future it should change to showing the newest 500 matches but that requires a database change to work around a subtle bug in the MySQL optimizer.

Future changes to the Search results display code will probably reduce clutter by hiding most of the information about replies if a large list of matches was found.

Note that 1-character words must be surrounded by whitespace in the node title for them to match (so / c finds C Client / Perl Server incompatibility and its replies but little else -- note that the ends of titles count as whitespace).

Also, there are no "stop words". A search for perl script takes about the same time as a search for something much more specific.

More flexibility will be available via Super Search when it gets rewritten (hopefully RSN).

- tye (but my friends call me "Tye")

Comment on New simple search

Replies are listed 'Best First'.
Re: New simple search by BazB (Priest) on Jul 08, 2002 at 20:52 UTC
If it isn't already implemented (or at least on the cards), would it be possible to have the Nodereaper's nodes excluded from search results?	[reply]
Re: New simple search by Aristotle (Chancellor) on Jul 08, 2002 at 18:04 UTC
Again, thanks for the volunteer work. I played with the search a bit - looks nice. Might be confusing that it matches "side" for "ide" though.. Makeshifts last the longest.	[reply]
(tye)Re: New simple search by tye (Sage) on Jul 08, 2002 at 18:23 UTC
That is completely intentional; it does "substring" matches, as I said. Note that it also matches "PerlIde4.2" when you search for "ide" and "compiling", "compiler", and "compiled" when you search for "compil". In my experience, in such a simple interface where you can only offer one or the other, the substring behavior is much preferable (especially when searching titles). You'll have more control with the next "Super Search". Update: I just experimented with a change where searching for ide searches for both "ide" and " ide " which means that a node title containing " ide " gets 2 points while one that only contains something like "side" only gets 1 point. Of course, that prevents the nodes containing "editor/ide" from showing up and I'm not about to go back to regular expressions for simple search nor try to determine what a good "word boundary" is and then document all of that complexity just for the simple search. But with the addition of a "search more" option in Search results, this modification might be used. And it gives me some ideas for "Super Search". Thanks! - tye (but my friends call me "Tye")	[reply]
Re2: New simple search by blakem (Monsignor) on Jul 09, 2002 at 05:32 UTC
I like the points system.... Though I might tweak the definition of a full-word. How about "a substring bounded by non-letters (or the ends of the string)". Something like `(^\|[^A-Za-z])ide([^A-Za-z]\|$)` That approach has worked well for me in my search engine attempt. Especially considering titles like 'Using CGI::Cookie with HTML::Template' -Blake	[reply] [d/l]
(tye)Re2: New simple search by tye (Sage) on Jul 09, 2002 at 13:55 UTC
Re4: New simple search by blakem (Monsignor) on Jul 09, 2002 at 19:28 UTC
Re(2): New simple search by Cirollo (Friar) on Jul 08, 2002 at 18:26 UTC
Actually, the first match for "ide" is "infanticide" :-)	[reply]

Back to Perl Monks Discussion