back up on


Introduction

Leon Shklar and Kshitij Shah

Computer Science Department, Rutgers University, New Brunswick, NJ 08903
Bell Communications Research, 445 South Street, Morristown, NJ 07960

The accelerating explosion in the amounts and variety of information has made the knowledge about its existence, location, and the means of retrieval very confusing. The information ranges from software artifacts to engineering and financial databases, and comes in different types (e.g., source code, e-mail messages, bitmaps) and representations (e.g., plain text, binary). This information has to be accessed through a variety of vendor tools and locally developed applications.

There have been numerous attempts to solve the problem by implementing a big variety of ever-changing filters for format conversions. The filters are used to generate html documents either off-line or at run-time, through the CGI mechanism. The off-line approach requires substantial human and computing resources for the initial conversion and maintenance of information. Maintaining the repositories presents the additional dilemma of either creating new and updating existing information in HTML, or continuously managing changing data in different formats.

The run-time approach helps to postpone the conversion until the information is requested and eliminates problems with the initial conversion and maintenance of information. However, the run-time conversion may not be appropriate for some document formats (framemaker, etc.) for the following reasons:

  1. slowness of the conversion,
  2. quality of the generated html (may require human postprocessing).
Using MIME [bor93] helps to avoid the conversion through the use of native browsers but may require renaming the original files because the MIME type recognition mechanism relies on file extensions. Adding support for new MIME types may not be easy. Further, there is still the problem of logically linking together individual documents and of accessing arbitrarily formatted data. The WWW interface to the OMNIS system [cla95] adopts this approach. Documents that may contain images, postscript or other formatted information are converted on-the-fly to MIME compliant types which are understood by the WWW browsers.

Harvest [bow94] provides support for extracting summaries from distributed heterogenous information and for executing searches over these summaries. Once the resources have been identified, the responsibilty of accessing them is handed over to the WWW browsers. Harvest provides efficient and flexible methods for indexing widely distributed information. MIME mappings are used to provide access to a wide range variety of information and the problems that were described earlier still persist.

There have been a number of other attempts to address the above concerns. The Multimedia-Oriented Repository Environment (MORE) [eic94] was designed as a set of CGI programs that operate in conjunction with a stock httpd server to provide access to a relational database that contains meta-information, which is used at run-time to retrieve physical data. The meta-information is entered into the database off-line by the human librarians.

HyperG [and95] uses an object-oriented database layer to provide information structuring and link maintenance facilities in addition to fully integrated attribute and content search, a hierarchical access control scheme, support for multiple languages, interactive link editing, and point-and-click document insertion.

WebMake [bae95] introduces methods for building web structures over existing software, e.g. source and object code for software systems. A set of tools has been developed to provide a fully distributed software development environment by utilizing the CGI mechanism.

The objective of InfoHarness [shk94,shk95-1] is to provide rapid access to huge amounts of heterogeneous information in a distributed environment without any relocation, restructuring, or reformatting of data. Like the MORE system, InfoHarness uses metadata for search and retrieval of heterogeneous information. InfoHarness provides advanced search and browsing capabilities without imposing constraints on information suppliers or creators. It utilizes a stable abstract class hierarchy, which need not be modified to define terminal classes that accommodate new types of information and new indexing technologies. InfoHarness provides a number of tools for the automatic generation of meta-data based on physical information.

Closely related to this effort is the work on defining IRDL [shk95-2] - a declarative language for describing information sources and the desired structure of information repositories. The language provides high flexibility in imposing logical structure on the heterogeneous physical information.

References:

[and95]
K. Andrews, F. Kappe, and H. Maurer. Serving Information to the Web with Hyper-G WWW'95, April 10-14, Darmstadt, Germany.
[bae95]
M. Baentsch, G. Molter, and P. Sturm. WebMake: Integrating distributed software development in a structure-enhanced Web, WWW'95, April 10-14, Darmstadt, Germany.
[bor93]
N. Borenstein, N. Freed, and K. Moore. MIME (Multipurpose Internet Mail Extensions), Network Working Group. RFC 1521 and 1522.
[bow94]
C. Bowman, P. Danzig, D. Hardy, U. Manber, and M. Schwartz. The Harvest information discovery and access system. WWW'94, Chicago, IL, October 1994.
[cla95]
A. Clausnitzer, P. Vogel, and S. Wiesener. A WWW Interface to the OMNIS/Myriad Literature Retrieval Engine WWW'95, April 10-14, Darmstadt, Germany.
[eic94]
D. Eichmann. T. McGregor, and D. Danley. Integrating Structured Databases Into the Web: The MORE System WWW'94, Chicago, IL, October 1994.
[shk94]
L. Shklar, S. Thatte, H. Marcus, and A. Sheth. The ``InfoHarness'' Information Integration Platform, WWW'94, Chicago, IL, October 1994.
[shk95-1]
L. Shklar, A. Sheth, V. Kahyap, and K. Shah. "Use of Automatically Generated Metadata for Search and Retrieval of Heterogeneous Information", Proceedings of CAiSE'95, June 12-16, 1995, Jyvaskyla, Finland, Springer-Verlag Lecture Notes in Computer Science #932, pp. 217-230.
[shk95-2]
L. Shklar, K. Shah, and C. Basu. Putting Legacy Data on the Web: A Repository Definition Language, WWW'95, April 10-14, Darmstadt, Germany.

back up on

back up on