comment on

Lucy is very interesting in that it is a Perl port of Apache Java Lucene/Solr Which YaCy is based on, I think.

My search engine, if it can actually be called that, though, does not use full text search, or any actual text search whatsoever. Except possibly site description. It basically ignores all text and just focuses on structured metadata.

It functions similar to a public library book indexing system where the indexing code has no real relation with any of the actual words in the books it indexes, where for example All books on computer programming are represented by the code 005

Personally, in many, if not most instances, I'm looking for a topic to read about. I don't need every word in two dozen different books on a particular topic indexed.

In the public library, books on Perl are encoded with 005.13. what kind of mad lunatic would go into a library and expect the librarian to scan through every word of every book in the entire library system to find books containing a particular word or two? Yet, on the internet, that is the status quo. The librarian just points a finger.

Comparatively speaking, what are the database requirements of full text indexing vs this kind of conceptual indexing used in libraries for a hundred years, which has the added advantage of being language independent?

My "search engine" is, fundamentally, more a method for packaging and unpackaging metadata.

Everything I generally ever need or want to know about a website can be encapsulated into a metadata string which more often then not takes up less space in the database than the websites URL.

A text search of an entire document can sometimes be useful, but, wouldn't it usually be better to at least narrow the text search down to the resources within a more well defined topic area first?

So, I'm interested, to some degree, on how to strip all that kind of full text searching stuff out. Or at least give it a secondary status of: use rarely and only if really needed.

But a SUBJECT (like "Perl programming": 005.13) is just one facet of a website that can be encoded. As mentioned, there are many other things that are often neglected by both website creators and search engines. Or can only be accessed through proprietary database systems. An events calendar perhaps.

Tom

In reply to Re^2: RFC: Peer to Peer Conceptual Search Engine by PerlGuy(Tom)
in thread RFC: Peer to Peer Conceptual Search Engine by PerlGuy(Tom)

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks