Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Running SuperSearch off a fast full-text index.

by creamygoodness (Curate)
on Jun 10, 2007 at 14:30 UTC ( [id://620307]=note: print w/replies, xml ) Need Help??


in reply to Re: Running SuperSearch off a fast full-text index.
in thread Running SuperSearch off a fast full-text index.

Because we have access to metadata that Google's naive crawler does not, we enjoy certain advantages when building a custom search. Certainly we can offer bells and whistles on the Super Search page that Google's advanced search can't match — they can't do filtering by author, ranking by node reputation, and so on.

I am confident that our users would find a KinoSearch-based Super Search considerably more usable than the current version, and that this would make them very happy. Programmers like to tweak tweak tweak. :) As a bonus, I also suspect that we can provide simple search results superior to what Google can offer, and certainly better than what we have now. It will be interesting to compare search results before and after we factor node rep into our ranking algorithm.

Whether or not it is worthwhile to maintain custom indexing and search for a public site depends on the site's size and the demands of its user-base. I expect that with several hundred thousand pages and extremely sophisticated users, we're well past the threshold. My guess is that the time it takes to maintain full-text search, including an advanced search interface, will be fully justified by a collective productivity increase. :)

SEO improvements to help web search engine spiders should probably be implemented regardless because increasing this site's visibility will aid people seeking answers to Perl questions from outside. However, I understand the powers-that-be have had good reasons for clamping down on spider access, historically.

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

Replies are listed 'Best First'.
Re^3: Running SuperSearch off a fast full-text index.
by educated_foo (Vicar) on Jun 10, 2007 at 21:02 UTC
    A lot of these are obsoleted by a good ranking function, which will tend to pull the best hits to the top even without the additional metadata. For example, a search for "rectangular humphrey" turns up this: "I'm starting to get offers from people who want to sponsor features in my CPAN distro, KinoSearch," which is very relevant -- I didn't realize you were the author of KinoSearch, which you are also suggesting as a platform.

    I agree that node ratings, etc., can be useful, but one of Google's big lessons is that quantity can beat quality: intelligent analysis of huge amounts of generic data can beat analysis of specialized data. This is particularly visible in its approach to natural language translation, but is nearly as important in search.

      I didn't realize you were the author of KinoSearch, which you are also suggesting as a platform.

      And I, in turn, was unaware that dmitri was my sock puppet. ;)

      intelligent analysis of huge amounts of generic data can beat analysis of specialized data.

      Sure, those techniques are powerful... The brute force "did you mean" stratagem[1] is tough to top, no question!

      As for whether we'll be able to deliver an overall improvement on the PerlMonks search experience, I guess we'll just have to present something, and people can vote with their clicks.

      [1] Major search engines decide what to suggest based on search history: what most people have typed in after misspelling something. This has proven superior to algorithms based on edit distance.

      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://620307]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-03-29 10:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found