Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Running SuperSearch off a fast full-text index.

by creamygoodness (Curate)
on Jun 10, 2007 at 16:33 UTC ( [id://620322]=note: print w/replies, xml ) Need Help??


in reply to Running SuperSearch off a fast full-text index.

dmitri,

I've long wanted to do exactly what you've proposed, but just haven't found the cycles before now. I would be excited to collaborate with you on it.

As for hosting, for the time being I can run the app at rectangular.com... and maybe we could set up a repository at code.google.com? ;)

In addition to the indexer and search applications, we'll need a spidering app that pulls down a local copy of each PerlMonks node. tye has granted permission to spider the site, and suggested the PerlMonks XML node view for getting at the content (see What XML generators are currently available on PerlMonks? for info). Here's an XML rendering of your original post as an example.

In the initial pull, we'd iterate over each node numerically, probably saving individual XML files to the file system, 1000 nodes per directory. Some nodes will present problems — reaped nodes, for instance — but the responses will always contain sufficient information to dispatch sensibly.

Keeping the locally mirrored data up-to-date presents some problems, especially with regards to updated text and node rep fluctuations. These problems will be trivial to solve should the service move onto perlmonks.org directly; some of them are solveable even when running remotely, as the total volume of data is not very large. In any case, freshness issues will not have a major impact on the user experience and people will have no trouble making sensible comparisons between the old and the new.

Once we have a corpus, the indexing and search apps will present familiar challenges for us both. It will be fun to tinker with the ranking algorithms, and I expect that the extremely demanding user base will provide us with lots of high-quality feedback. :)

What say? Sound like a plan?

Cheers,

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

Replies are listed 'Best First'.
Re^2: Running SuperSearch off a fast full-text index.
by dmitri (Priest) on Jun 10, 2007 at 16:48 UTC
    Marvin,

    it will be an honor to work with you on this project. What shall we call it?

      - Dmitri.

      How about "MonkSearch", if it's available? Have you ever set up a code.google.com project? I haven't.

      I figure we should set things up like a standard CPAN distro. Most of the code in module files, utility scripts in bin/, yada yada. Sound good? We can call the Perl modules whatever we want at first -- it doesn't matter until there's a public API, which there may never be.

      Are you willing to play the role of lead developer on the project? I believe you're subscribed to the KinoSearch list, so you've seen how collaborations have gone there -- there's often some gory back end stuff that falls to me. For this project, I figure I'll end up spending significant time tweaking the ranking algo in response to user feedback once we have everything in place. In anticipation of that, it would be great if you could take responsibility for most of the high level architecture. I think we'll have a more fruitful collaboration if you own the code and I play a secondary role.

      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com
        I have never set one up at code.google.com before either, but here it is: http://code.google.com/p/monk-search/ -- it seems easy to use so far. What is your google ID so that I can add you?

        Standard CPAN distro sounds like a plan.

        I can certainly play the role :-)... The gory back end stuff will have to fall to you -- although I hope to learn more about KinoSearch's inner workings by the time we have something usable.

        I will create a "brain-dump" wiki page and we can derive the list of components and stuff we need to do from there. The first order of business will of course be to decide how and what to index. We will probably also want to use 0.20?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://620322]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-03-28 17:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found