comment on

dmitri,

I've long wanted to do exactly what you've proposed, but just haven't found the cycles before now. I would be excited to collaborate with you on it.

As for hosting, for the time being I can run the app at rectangular.com... and maybe we could set up a repository at code.google.com? ;)

In addition to the indexer and search applications, we'll need a spidering app that pulls down a local copy of each PerlMonks node. tye has granted permission to spider the site, and suggested the PerlMonks XML node view for getting at the content (see What XML generators are currently available on PerlMonks? for info). Here's an XML rendering of your original post as an example.

In the initial pull, we'd iterate over each node numerically, probably saving individual XML files to the file system, 1000 nodes per directory. Some nodes will present problems — reaped nodes, for instance — but the responses will always contain sufficient information to dispatch sensibly.

Keeping the locally mirrored data up-to-date presents some problems, especially with regards to updated text and node rep fluctuations. These problems will be trivial to solve should the service move onto perlmonks.org directly; some of them are solveable even when running remotely, as the total volume of data is not very large. In any case, freshness issues will not have a major impact on the user experience and people will have no trouble making sensible comparisons between the old and the new.

Once we have a corpus, the indexing and search apps will present familiar challenges for us both. It will be fun to tinker with the ranking algorithms, and I expect that the extremely demanding user base will provide us with lots of high-quality feedback. :)

What say? Sound like a plan?

Cheers,

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

In reply to Re: Running SuperSearch off a fast full-text index. by creamygoodness
in thread Running SuperSearch off a fast full-text index. by dmitri

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Keep It Simple, Stupid
	PerlMonks