App::MonkSearch
Sure, that'll work... Randy Kobes' CPAN search modules are under CPAN::Search::Lite, FWIW.
In this thread?
Probably not all in this thread. We might start a new thread called "MonkSearch - spider" to deal with spidering issues, for example.
I think it's important to facilitate participation by anyone in the PerlMonks community who wants to join in. The downside is that we might end up creating more messages than the PerlMonks threading model is optimized for, but this isn't that big a project and I think the volume will be manageable.
| [reply] |
It is a good point you make about other monks' participation. We can link to threads we create from this thread's parent node. The only thing that baffles me is which category would the threads related to the development of MonkSearch belong to? Seems like an off-topic wherever we place them.
| [reply] |
| [reply] |
I would also prefer discussion and than summary on wiki.
OOH, storing nodes locally in SQLite seems like an overkill. With good filesystem there is no reason to complicate crawler with DBI code, just dump files on disk.
| [reply] |
The reason I think that SQLite would be useful is that if we want to separate the spider from indexer, finding the articles to update in the index is as simple as
SELECT * FROM ARTICLES WHERE LAST_UPDATED > $LAST_TIME_I_RAN
instead of searching the filesystem. Stored on the filesystem, we will need code to
- search,
- store, and
- update
the documents. SQLite provides all of that for free. Want to move to a different machine? -- The database is a single file. Plus, who knows what other useful things SQLite's flexibility will allow us to do? | [reply] [d/l] |
I've worked on spiders that have used the file system, and spiders that have used databases. It's certainly cheaper to use the file system. But thinking about the size of the dataset, we can easily afford to put these records into a database. There are only c. 600,000 records, and they're small -- not even full web pages. I like the idea of using SQLite.
| [reply] |