OASIS Search Engine

We are still in the process of setting thing up on SourceForge, so apologies for the ascetic content and look.

An overview of the project's ideas and intents can be found below .
The project is a follow-up to our previous research work described here . That was an international research project that verified the concept and provided a working GPL'ed code that can be downloaded .
The code needs cleaning up and further development. New functionality needs to be supported. This update is in the planning stage, and the design of the new version is discussed in the Docs and Forums areas on the SourceForge . We are interested in your opinion.

Why another Web search engine?

OASIS is a distributed search engine. Computational load is spread among participating nodes both at indexing and at query processing time. Distributed search has several advantages:

No single operator or organisation controls the provided content and services
Adding a new node does not require extensive computing or network resources
Different nodes can specialise in indexing specific topics or sources. For example, Web sites, library catalogues and newsgroup archives can be searched simultaneously.
If the number of nodes is sufficiently large, the total amount of computational resources available for processing of a query can exceed the one available at large centralised search engines. New, cool, and more computationally intensive information retrieval techniques can be thus used.
In large scale intranets it may be desirable to index documents in each local subnetwork separately. It makes the breadth of the search scalable from a local subnetwork to the intranet and to the Internet.

Can centralised search engines hack it?

Unlikely. There amount of Web sites is growing, the number and size of stored documents grows even faster, and site contents get updated more and more often. Centralised systems can not grow that fast and cover ever decreasing segment of the Net.

Outline of the OASIS architecture

There are two major roles, a collection and a broker. Collections store parts of document index. Brokers receive queries from the users, choose the most appropriate collections, forward the queries and merge the collections' responses into a final result set (see the diagram below).

The broker selects the collections for query propagation relying on the collection descriptions stored in the LDAP directory service. It is the responsibility of collections to create and update their descriptions in the specified format.

Topical collections

Distributed search is efficient only when each query is propagated to a relatively small number of nodes. When the total number of collections is large, it becomes possible only when collections are different. When the collections consist of documents belonging to a reasonably well defined topic area, the broker can make a reasonable query propagation decision. In fact, such decisions are based on term frequency statistics in collections and queries. Collections covering distinct well-defined topics have distinctly different term usage statistics, thus allowing efficient query routing.

Indexing tool

Topical document indexes can come from multiple sources. If your site already has one or several focused topic ares, it may be sufficient to just index each area and advertise the corresponding number of collections in the directory service. The alternative way is to use a Web robot for discovering the relevant pages in the Net.

A topic-oriented Web Crawler is a part of our project. It takes 50-200 documents relevant to the collection topic as an example, and searches the Net for documents similar to the sample. Periodic manual inspection of the documents returned by the Crawler is necessary, but still it takes far less time than manual search.

Who can be interested in installation of the OASIS software?

Web site admins can run a collection in order to provide search functionality for their sites. Their collection(s) can be used by a broker run on the same site to provide searches with scope limited to the administrator's sites. More important, the collection will be found by other brokers and receive queries with global scope from the users of the brokers. Note that receiving and logging queries provides a good view of the interests of the users. They can also run a crawler to provide the site visitors with links to pages belonging to certain subject areas (and residing on third party hosts).
System admins of large corporate intranets can run a broker and a collection in each of the local networks comprising the intranet. One stop searches can be done, involving only a local network or (only when necessary), a larger part of the intranet, or the whole Internet (using local or third party collections). Access rights of users to brokers and brokers to collections can be set up according to the local policy. Crawlers can be used for focused harvesting of the Internet in search of documents on the topics relevant to the company and specified by the administrator.
ISP and network admins can run a broker as a service to their customers and users. [It would be interesting to make use of the ISP's cache in a search engine. The search engine can increase rankings of pages already in the cache, use cache hits for estimating page quality, etc. Any ISP interested?]

Contact:

E-mail us at oasis-team@oasis-europe.org