Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

RFC: Peer to Peer Conceptual Search Engine

by PerlGuy(Tom) (Acolyte)
on Jan 28, 2020 at 10:37 UTC ( [id://11111964]=perlmeditation: print w/replies, xml ) Need Help??

I consider myself a very novice Perl programmer, though I've been studying and using Perl for, I don't even know how many years. 30 maybe. Still so much to learn and so little time.

I got into programming because I wanted a better search engine than any that were available, back in the day, say 1995. (I still think a better search engine is needed and possible).

But all the programmer's I approached about my idea said

it was impossible. Or impossibly difficult.

I knew what needed to be done (or what I wanted at least, needed or not) from a user perspective, but I knew absolutely nothing about computer programming.

I thought, how hard could it be? So I started studying Perl. The best programming language around, as far as I was able to determine. Well, it was the only programming language, the study of which made me laugh, and gave me joy. Perl was poetry. Sometimes literally. Studying Perl made me happy and gave me hope. Best of all, it was open source.

After about ten years of reading old heavy Perl textbooks from Books-a-million's discount table, I was finally able to get a webserver going with Perl installed on a 386 and type a shebang! line and a very simple CGI script. Something like echo. Had a webpage with a form field and got the program to return something, without the program mysteriously erasing itself.

Ten more years and I finally had my own basic proof-of-concept search engine up and running on a free web-hosting service. Just barely. Along with a primative one page at a time web-crawler.

All this, in order to prove that a search engine could find websites by parameters other than key words. Things like concepts.

The reason I needed and wanted a more capable search engine was because I had been selected to take charge of a research organization and publisher that was networking with thousands of other organizations. I had to keep abreast of what all of these other organizations were doing. Their events and activities. All of this mass of information had to be prioritized. Part of my work was to attend events. This involved knowing what events were scheduled far enough in advance so as to reserve a space or table, or just schedule a lunch meeting while in a particular region with various individuals.

Reading through thousands of organizations fliers and newsletters and the like for the pertinent information was a never ending task that was never completed. I wanted to automate it.

I wanted to be able to have a computer scan through all this material and be able to find when and where the events were happening around the globe, in order to be able to travel in a circuit and attend as many events as possible. While traveling I also wanted to be able to meet with various individuals.

Also, many of these organizations had ongoing activities, but some were run-of-the-mill, and some were high priority. Some activities could wait, some required immediate action.

All of this kind of information is now up on websites. But sorting through it all is still mostly a manual chore. Technology for extracting various kinds of metadata from a website is available, but not often used for what is really important.

What I needed to be effective, and to be able to sort through all the most important data and have it organized and prioritized according to location and event schedules and all that kind of thing, was a search engine that could search for much more than key words; dates for events, locations, categories; like agricultural, political, scientific, environmental, human rights, etc. etc. There would need to be some means of prioritizing data according to importance and urgency, and perhaps credibility and any other such parameters. It would have to cut across languages: that is, be language independent. So, in other words, If I want to find events relating to organic hearloom gardening around the world, I need to be able to search for that as a concept, regardless if the actual data is presented in a language I may not understand. German, French, Spanish, Portuguese, whatever.

So I incorporated all of these features into my search engine. I think I can now demonstrate how all that I've described is possible. Oh, and, of course, the search engine needs to be able to search with any or all of these parameters simultaneously.

Now, I figure, if I can get a proof-of-concept search engine/spider/database working, being a mere self-taught, amateur Perl programmer, how much better could it be with some actual real experienced programmer's doing something with it?

But, my search engine still resides on a single server. It is therefore centralized. Ideally I believe it should be peer-to-peer, but, I haven't learned how to do that kind of programming yet.

Recently however, I discovered that there is, and has been, an open source, peer-to-peer search engine available for about the past 15 years, unbeknownst to me. Unfortunately it is written in Java. Well it is fortunate at least that it is open source, https://YaCy.net

I don't know Java. But, I learned some Perl, PhP, HTML, CSS, I also started to study many computer languages before settling on Perl, so, how hard could it be?

I shall be studying Java while trying to reverse engineer and transmutate YaCy from Java into Perl (if possible) and somehow or another integrate it with my own search engine metadata format.

Tom

  • Comment on RFC: Peer to Peer Conceptual Search Engine

Replies are listed 'Best First'.
Re: RFC: Peer to Peer Conceptual Search Engine
by soonix (Canon) on Jan 28, 2020 at 11:55 UTC
    To reverse engineer a system like YaCy, programming skills, even in both languages (Perl and Java), will not be sufficient. Much more important is knowledge of the underlying concepts (as opposed to the concepts that your search engine is to search/find) and understanding how they are connected.

    Why? Simply because the two languages differ in their concepts, and translating it 1:1 would result in a behemoth.

    Most probably the structure of YaCy is dictated (at least partially) by the structures that Java supports best, which are not necessarily those a Perl programmer would even consider. There might be similiar - but not the same - libraries for both languages. And so on.

    While writing this, I stumbled over A Tagcloud For Cory Doctorow, P2P Homework and Lucy, which might be not usable, but interesting in this context.

      Lucy is very interesting in that it is a Perl port of Apache Java Lucene/Solr Which YaCy is based on, I think.

      My search engine, if it can actually be called that, though, does not use full text search, or any actual text search whatsoever. Except possibly site description. It basically ignores all text and just focuses on structured metadata.

        The closest thing I ever came across in terms of an IDEAL search engine was the custom site search for Wiser.org

        Over 100,000 groups and organizations and unnumbered individuals, worldwide networked and organized through this social network, which would have been impossible without the unique multifaceted search interface.

        What happened to this social network? One day, it was simply announced that the site was shutting down. All that remains, it seems, is some of the non functional static pages archived on the Wayback Machine.

        https://en.m.wikipedia.org/wiki/Wiser.org

        Here is an Internet Archive page showing the deceptively simple search interface:

        https://web.archive.org/web/20120910002106/http://www.wiser.org/all/search?phrase=

        It had conceptual indexing of facets such as; "Solutions" (to world problems, issues and concerns) along with Organizations, Groups, People, Events, Resources, etc. Also these facets could be simultaneously searched by language, location and if desired, key word. I really loved that search engine.

        I may be a wee bit paranoid or something, but it seems, nearly every trace of the original free, open source WiserEarth API, and all documentation has been scrubbed from the internet. Including the Internet Archive. If anyone has a tip where it can still be found, I'd appreciate that.

        So, this brings to the foreground, one of the problems of centralized indexing. If a well organized, worldwide, social activist community becomes problematic, it is all to easy to take out a central server. Or maybe the maintainers of the site just got tired of maintaining it. Either way, something hundreds of thousands of world betterment groups, organizations and individual activists depended, really depended on, vanished.

        What essentially pulled all these groups and organizations together was a database with a functional search engine geared towards real human needs.

        Tom

Re: RFC: Peer to Peer Conceptual Search Engine
by PerlGuy(Tom) (Acolyte) on Feb 01, 2020 at 07:01 UTC
    If anyone is interested in helping do something with this, I could provide the rather amateurish, taped together, spider code I've managed to get running.

    Any help or advice on how to improve this would be much appreciated.

    Tom

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://11111964]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-04-24 08:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found