in reply to Re: RFC: Peer to Peer Conceptual Search Engine
in thread RFC: Peer to Peer Conceptual Search Engine

Lucy is very interesting in that it is a Perl port of Apache Java Lucene/Solr Which YaCy is based on, I think.

My search engine, if it can actually be called that, though, does not use full text search, or any actual text search whatsoever. Except possibly site description. It basically ignores all text and just focuses on structured metadata.

It functions similar to a public library book indexing system where the indexing code has no real relation with any of the actual words in the books it indexes, where for example All books on computer programming are represented by the code 005

Personally, in many, if not most instances, I'm looking for a topic to read about. I don't need every word in two dozen different books on a particular topic indexed.

In the public library, books on Perl are encoded with 005.13. what kind of mad lunatic would go into a library and expect the librarian to scan through every word of every book in the entire library system to find books containing a particular word or two? Yet, on the internet, that is the status quo. The librarian just points a finger.

Comparatively speaking, what are the database requirements of full text indexing vs this kind of conceptual indexing used in libraries for a hundred years, which has the added advantage of being language independent?

My "search engine" is, fundamentally, more a method for packaging and unpackaging metadata.

Everything I generally ever need or want to know about a website can be encapsulated into a metadata string which more often then not takes up less space in the database than the websites URL.

A text search of an entire document can sometimes be useful, but, wouldn't it usually be better to at least narrow the text search down to the resources within a more well defined topic area first?

So, I'm interested, to some degree, on how to strip all that kind of full text searching stuff out. Or at least give it a secondary status of: use rarely and only if really needed.

But a SUBJECT (like "Perl programming": 005.13) is just one facet of a website that can be encoded. As mentioned, there are many other things that are often neglected by both website creators and search engines. Or can only be accessed through proprietary database systems. An events calendar perhaps.

Tom

  • Comment on Re^2: RFC: Peer to Peer Conceptual Search Engine

Replies are listed 'Best First'.
Re^3: RFC: Peer to Peer Conceptual Search Engine
by PerlGuy(Tom) (Acolyte) on Jan 28, 2020 at 22:19 UTC
    The closest thing I ever came across in terms of an IDEAL search engine was the custom site search for Wiser.org

    Over 100,000 groups and organizations and unnumbered individuals, worldwide networked and organized through this social network, which would have been impossible without the unique multifaceted search interface.

    What happened to this social network? One day, it was simply announced that the site was shutting down. All that remains, it seems, is some of the non functional static pages archived on the Wayback Machine.

    https://en.m.wikipedia.org/wiki/Wiser.org

    Here is an Internet Archive page showing the deceptively simple search interface:

    https://web.archive.org/web/20120910002106/http://www.wiser.org/all/search?phrase=

    It had conceptual indexing of facets such as; "Solutions" (to world problems, issues and concerns) along with Organizations, Groups, People, Events, Resources, etc. Also these facets could be simultaneously searched by language, location and if desired, key word. I really loved that search engine.

    I may be a wee bit paranoid or something, but it seems, nearly every trace of the original free, open source WiserEarth API, and all documentation has been scrubbed from the internet. Including the Internet Archive. If anyone has a tip where it can still be found, I'd appreciate that.

    So, this brings to the foreground, one of the problems of centralized indexing. If a well organized, worldwide, social activist community becomes problematic, it is all to easy to take out a central server. Or maybe the maintainers of the site just got tired of maintaining it. Either way, something hundreds of thousands of world betterment groups, organizations and individual activists depended, really depended on, vanished.

    What essentially pulled all these groups and organizations together was a database with a functional search engine geared towards real human needs.

    Tom

      Your "conceptual indexing" sounds a lot like what's today called "social bookmarking", which tries to apply a similiar process to webpages as used in libraries. The Wikipedia page has a section "Comparison with search engines".

      The Wiser.org Search API probably was derived from (or the same as) the WiserEarth API, which (still) is in the Internet Archive (FAQ and Documentation)

      I don't think there's active scrubbing going on, the "normal" entropic force is strong enough already, especially if the information in question needs active maintenance.
        I misspoke. What I meant was the open source WiserEarth platform. The program(s) that ran the site. The backend rather than the frontend.

        I did just find it on SourceForge,(I think, looks like?).

        https://sourceforge.net/projects/wiserplatform/files/wiserplatform/

        There are, I suppose, some parallels between my program, or indexing system and social bookmarking, but social bookmarking, in practice, requires, generally speaking, some proprietary methodology on some specific platform with an inaccessible database. Delicious has gone by the wayside somewhere after passing through different hands.

        Along with it went 180 million bookmarks.

        Presumably, that will happen sooner or later with every such proprietary service or "black box" type database on the internet.

        What is needed IMO is an internet standard, similar to the Dewey Decimal System for books, in public libraries

        What I've endevored to produce is something more along the lines of Ranganathan's "colon classification system".

        https://en.m.wikipedia.org/wiki/Faceted_classification

        Such a faceted metadata structure is compact, concise yet comprehensive, sufficiently flexible and extensible to encompass everything on the internet for the foreseeable future, yet structured enough to be computer readable. i.e. it can be easily and reliably isolated from whatever else appears in the source code of a website (using regular expressions).

        Tom