The accelerating explosion in the amounts and variety of information has made the knowledge about its existence, location, and the means of retrieval very confusing. The information ranges from software artifacts to engineering and financial databases, and comes in different types (e.g., source code, e-mail messages, bitmaps) and representations (e.g., plain text, binary). This information has to be accessed through a variety of vendor tools and locally developed applications.
There have been numerous attempts to solve the problem by implementing a big variety of ever-changing filters for format conversions. The filters are used to generate html documents either off-line or at run-time, through the CGI mechanism. The off-line approach requires substantial human and computing resources for the initial conversion and maintenance of information. Maintaining the repositories presents the additional dilemma of either creating new and updating existing information in HTML, or continuously managing changing data in different formats.
The run-time approach helps to postpone the conversion until the information is requested and eliminates problems with the initial conversion and maintenance of information. However, the run-time conversion may not be appropriate for some document formats (framemaker, etc.) for the following reasons:
Harvest [bow94] provides support for extracting summaries from distributed heterogenous information and for executing searches over these summaries. Once the resources have been identified, the responsibilty of accessing them is handed over to the WWW browsers. Harvest provides efficient and flexible methods for indexing widely distributed information. MIME mappings are used to provide access to a wide range variety of information and the problems that were described earlier still persist.
There have been a number of other attempts to address the above concerns. The Multimedia-Oriented Repository Environment (MORE) [eic94] was designed as a set of CGI programs that operate in conjunction with a stock httpd server to provide access to a relational database that contains meta-information, which is used at run-time to retrieve physical data. The meta-information is entered into the database off-line by the human librarians.
HyperG [and95] uses an object-oriented database layer to provide information structuring and link maintenance facilities in addition to fully integrated attribute and content search, a hierarchical access control scheme, support for multiple languages, interactive link editing, and point-and-click document insertion.
WebMake [bae95] introduces methods for building web structures over existing software, e.g. source and object code for software systems. A set of tools has been developed to provide a fully distributed software development environment by utilizing the CGI mechanism.
The objective of InfoHarness [shk94,shk95-1] is to provide rapid access to huge amounts of heterogeneous information in a distributed environment without any relocation, restructuring, or reformatting of data. Like the MORE system, InfoHarness uses metadata for search and retrieval of heterogeneous information. InfoHarness provides advanced search and browsing capabilities without imposing constraints on information suppliers or creators. It utilizes a stable abstract class hierarchy, which need not be modified to define terminal classes that accommodate new types of information and new indexing technologies. InfoHarness provides a number of tools for the automatic generation of meta-data based on physical information.
Closely related to this effort is the work on defining IRDL [shk95-2] - a declarative language for describing information sources and the desired structure of information repositories. The language provides high flexibility in imposing logical structure on the heterogeneous physical information.
References: