OASIS Search Engine
We are still in the process of setting thing up on SourceForge, so apologies
for the ascetic content and look.
-
An overview of the project's ideas and intents can be found below
.
-
The project is a follow-up to our previous research work described
here
. That was an international research project that verified the concept
and provided a working GPL'ed code that can be downloaded
.
-
The code needs cleaning up and further development. New functionality needs
to be supported. This update is in the planning stage, and the design of
the new version is discussed in the Docs
and Forums
areas on the SourceForge
. We are interested in your opinion.
Why another Web search engine?
OASIS is a distributed search engine. Computational load is spread among
participating nodes both at indexing and at query processing time. Distributed
search has several advantages:
-
No single operator or organisation controls the provided content and services
-
Adding a new node does not require extensive computing or network resources
-
Different nodes can specialise in indexing specific topics or sources.
For example, Web sites, library catalogues and newsgroup archives can be
searched simultaneously.
-
If the number of nodes is sufficiently large, the total amount of computational
resources available for processing of a query can exceed the one available
at large centralised search engines. New, cool, and more computationally
intensive information retrieval techniques can be thus used.
-
In large scale intranets it may be desirable to index documents in each
local subnetwork separately. It makes the breadth of the search scalable
from a local subnetwork to the intranet and to the Internet.
Can centralised search engines hack it?
Unlikely. There amount of Web sites is growing, the number and size of
stored documents grows even faster, and site contents get updated more
and more often. Centralised systems can not grow that fast and cover ever
decreasing segment of the Net.
Outline of the OASIS architecture
There are two major roles, a collection and a broker. Collections
store parts of document index. Brokers receive queries from the users,
choose the most appropriate collections, forward the queries and merge
the collections' responses into a final result set (see the diagram below).
The broker selects the collections for query propagation relying on the
collection descriptions stored in the
LDAP directory
service. It is the responsibility of collections to create and
update their descriptions in the specified format.
Topical collections
Distributed search is efficient only when each query is propagated to a
relatively small number of nodes. When the total number of collections
is large, it becomes possible only when collections are different.
When the collections consist of documents belonging to a reasonably well
defined topic area, the broker can make a reasonable query propagation
decision. In fact, such decisions are based on term frequency statistics
in collections and queries. Collections covering distinct well-defined
topics have distinctly different term usage statistics, thus allowing efficient
query routing.
Indexing tool
Topical document indexes can come from multiple sources. If your site already
has one or several focused topic ares, it may be sufficient to just index
each area and advertise the corresponding number of collections in the
directory service. The alternative way is to use a Web robot for discovering
the relevant pages in the Net.
A topic-oriented Web Crawler is a part of our project. It takes 50-200
documents relevant to the collection topic as an example, and searches
the Net for documents similar to the sample. Periodic manual inspection
of the documents returned by the Crawler is necessary, but still it takes
far less time than manual search.
Who can be interested in installation of the OASIS software?
-
Web site admins can run a collection in order to provide
search functionality for their sites. Their collection(s) can be used by
a broker run on the same site to provide searches with scope limited
to the administrator's sites. More important, the collection will be found
by other brokers and receive queries with global scope from the users of
the brokers. Note that receiving and logging queries provides a good view
of the interests of the users. They can also run a crawler to provide
the site visitors with links to pages belonging to certain subject areas
(and residing on third party hosts).
-
System admins of large corporate intranets can run a broker
and a collection in each of the local networks comprising the intranet.
One stop searches can be done, involving only a local network or (only
when necessary), a larger part of the intranet, or the whole Internet (using
local or third party collections). Access rights of users to brokers and
brokers to collections can be set up according to the local policy. Crawlers
can be used for focused harvesting of the Internet in search of documents
on the topics relevant to the company and specified by the administrator.
-
ISP and network admins can run a broker as a service to their
customers and users. [It would be interesting to make use of the ISP's
cache in a search engine. The search engine can increase rankings of pages
already in the cache, use cache hits for estimating page quality, etc.
Any ISP interested?]
Contact:
E-mail us at oasis-team@oasis-europe.org