Computer Science Department,
Rutgers University,
New Brunswick, NJ 08903
Bell Communications Research,
445 South St., Morristown, NJ 07960
shklar@cs.rutgers.edu
A side-effect of the universal acceptance of the World-Wide Web
is an urgent need to provide Web access to the vast legacy
of existing heterogeneous information. This information ranges from
documents in a variety of proprietary (and sometimes obscure)
representation formats to engineering and financial databases, and
often may only be accessed through specialized vendor tools and locally
developed applications. Moreover, rapidly increasing sophistication in
presenting information on the Web is already forcing us to treat
ftp and gopher information sources, and even early
HTML pages as parts of the same legacy.
In this overview, our main focus is on the current state of technology rather than prospective research. We discuss existing tools, methods and architectures, but we also mention functioning prototypes. To date, the main efforts in providing Web access to legacy data have centered around the following directions:
HTML and
Java graphic
user interfaces for existing applications
[mil95, rfd95].
In Section 2, we discuss the use of alternative
technologies for building Web interfaces.
In Section 3, we discuss
methods and tools for providing direct Web access to existing heterogeneous
information.
We believe that the most promising approach is building logical data
models and using them to support all kinds of sophisticated
presentation of the original information on the World-Wide Web.
Section 4 describes current solutions to the problem
of building Web interfaces to stateful information systems. We discuss
a number of popular tricks that provide partial solutions. The emerging
mobile code systems should help us to do better than that in the near
future. In Section 5, we discuss some of
the common problems that arise when using
Data presentation on the Web is performed by the Web browsers and
controlled by Fig. 1.
The
An interesting attempt to automate the process of building
Another way of passing information between the gateways and the browsers
is by using the Server-Side Includes
(
The most efficient way of interfacing a legacy system is to make
a gateway its own Web server. This job is greatly simplified by
the public availability of the
Fig. 2. Mobile code gateways.
The emerging
There have been numerous attempts to provide partial remedies for data
heterogeneity by implementing a variety of ever-changing
filters for format conversions. The filters are used to generate
Using the Multipurpose Internet Mail Extensions
(
The
There have been a number of attempts to create logical models of
distributed heterogeneous information and use these models to support
advanced Web presentation.
The Multimedia-Oriented Repository Environment
(
Fig. 3. Simplified Harness Architecture.
The objective of the
Closely related to this effort is the work on defining an Information
Repository Definition Language (
As in [per95], we define a legacy system
as a system that strongly resists modification and evolution.
Database management systems represent an important subset of legacy
information systems.
In Section 2, we described different ways
of interfacing independent gateway programs with the Web.
In this section, we discuss using such programs to provide Web
access to legacy information systems.
The most common problems that arise when building Web interfaces to
legacy information systems are caused by the statelessness of the
The most popular solution is to either pass state and encoded authorization
information in gateway
Even assuming there was an acceptable way of maintaining state information,
there remains a more difficult problem of supporting transactions
that span multiple requests. Once the request is serviced, the
connection between the Web server and the browser breaks.
When a follow-up request comes through, the information system's thread
associated with the previous request may not be around. Starting a new
thread and retracing earlier steps within the same transaction to bring
the thread into a right state is not always possible, and quite
expensive when it is.
The Fig. 4. CICS Internet Gateway.
IBM DB2 WWW Connection Version 1 is now generally available
for OS/2 and AIX operating systems via a
download
from the Web. New beta versions for Windows NT, Sun Solaris and HP-UX
are also available from the same site.
Of course, it only makes sense to install this gateway product on
a machine that is also running a Web server and has a local
According to [lin95], IBM's current work is on
providing a Web
Other
O2 Technology provides a
Oracle has announced the upcoming release of its
Oracle WebServer 2.0 Fig. 5. The Evolution of Computer-Human Interaction.
Apart from the purely technical concerns, the layout and functionality of
the
Problems with applying traditional interface design technics to building
interactive applications on the Web are rooted in limitations of both
Fig. 6. Menus in HTML.
Consider a sample legacy application with the user interface that utilizes
a pop-down menu. Selecting an item from the menu results in an immediate
action. Pop-down menus are also available in
A successful attempt of re-engineering the graphic user interface to
provide the Web front-end for an existing application is reported in
[rfd95].
One interesting result of this effort is providing additional functionality
that was not available from the original system. This, of course,
is due to the fact that the old graphic user interface did not take
full advantage of the available functionality. To summarize, even within
the limitations imposed by the use of the current versions of
In this overview, we have discussed existing technology for building
Web interfaces to legacy information systems and for providing direct
Web access to legacy information. While
The World-Wide Web technology is advancing so rapidly that today's Web
data structures, including images, applets, and static and dynamic
HTML to build
graphic front-ends for legacy applications. We believe that the mobile
code systems should solve many of these problems.
A brief summary of the overview and our conclusions are presented in
Section 6, followed by acknowledgements.
2.0 Web Interfaces
HTML. When
presenting legacy data, the HTML
has to be generated dynamically by a gateway program that either
accesses the existing data directly or establishes a
TCP/IP connection with a legacy application.
Interaction between gateways programs and the Web servers is supported by the
Common Gateway Interface
(CGI) mechanism (fig. 1).
CGI and
NPH/CGI mechanisms.
CGI mechanism supports passing
information between the gateways and the browsers via
HTTP servers.
Output information, generated by a CGI
gateway, is passed on to an associated HTTP
server that adds an HTTP header
and passes everything to the browser (fig. 1).
Special No-Parse-Header
(NPH) gateways pass their outputs
directly to the browsers. Programmers are responsible for building
HTTP headers (fig. 1).
This provides the greatest flexibility and is not much of a chore.
With the early versions of CGI, the
NPH/CGI mechanism
was the only way for programmers to influence the treatment of information
by the Web browsers by assigning the MIME
[bor93] types, specifying error conditions, etc.
Recently, CGI programmers have been
given limited control over the generation of
HTTP headers by the Web servers.
Still, NPH/CGI gateways provide
slightly better performance and allow full control over the generation
of HTTP headers, which may be important
for passing meta-information, etc. For more information on the
CGI mechanism see
[far96].
CGI interfaces is
described in [pfi95]. Authors have designed
a declarative Interface Definition Language
(IDLE) to support the generation of
their CGI-based
translation servers. Such translation servers may be used
for building Web interfaces to stateful legacy applications
(Section 4). A prototype version of the
IDLE interpreter is available.
SSIs)
that force Web servers to parse documents and execute embedded
commands. However, letting users execute commands at the server
is a security risk, not to mention the cost of having the server parse
each requested HTML document.
Such cost may become quite significant under heavy load. Consequently,
few of the Web servers have the SSI
mechanism activated, further limiting its usefulness.
In our opinion, SSIs
are less secure, less efficient, and do not
present a viable alternative to the
CGI mechanism.
W3C
reference library,
which provides a general-purpose code base for implementing Web clients
and servers. The greater efficiency is due to eliminating
communications between the Web servers and application gateways.
This approach is the best choice for building commercial products though
the simplicity and flexibility of the CGI
mechanism make it preferable for research and prototyping efforts.
Java technology
[gos95] provides another viable alternative
by letting programmers establish TCP/IP
connections from a mobile program executing at the browser
(applet) directly to a legacy application
(fig. 2).
The complication is that such applets are
responsible for all communications with end-users, and considerable
effort may be required in setting up even a simple prototype.
In such an architecture, programmers are completely on their own,
because they are not taking any advantage of the World-Wide Web other
than presenting the applet's graphic user interface through the
browser.
3.0 Direct Access to Heterogeneous Information
HTML documents either off-line or at
run-time, using the
CGI mechanism. The off-line approach
requires substantial human and computing resources for the initial conversion
and maintenance of information. Maintaining the repositories presents the
additional dilemma of either creating new and updating existing information
in HTML, or
continuously managing evolving data in multiple formats.
The run-time approach helps to postpone the conversion until the information
is requested and eliminates problems with the initial conversion and
maintenance of information. However, the run-time conversion may not be
appropriate for some document formats (framemaker, etc.) for the following
reasons:
HTML
(may often require human post-processing).
MIME) [bor93],
supported by most Web browsers, helps to avoid data conversion
through the use of third-party presentation tools. However, it may require
renaming the original files because MIME's
type recognition mechanism relies on file extensions.
Adding support for new MIME
types often requires end users to obtain and install third-party tools.
Further, there is still a problem of logically linking together
individual documents and of accessing arbitrarily formatted data.
OMNIS system
[cla95] has been
designed to provide access to library information that includes both
catalogs and digitized texts.
The scanned-in documents may contain images, postscript or other formatted
information, and are stored in a database. At presentation time, the
OMNIS
gateway converts textual information to
HTML, while images are converted
to common MIME types before
being passed to the browser. This is quite feasible because
OMNIS has full
control over the format and representation of information that is stored
in its database.
Harvest [bow94]
provides support for extracting summaries
from distributed heterogeneous information and for executing searches
over these summaries.
Once the resources have been identified, the responsibility of accessing them
is handed over to the Web browsers.
Harvest
provides efficient and flexible methods of indexing widely distributed
information. MIME mappings
are used to provide access to the wide variety of information and the
problems that were described earlier still persist.
MORE) [eic94]
was designed as a set of CGI programs
that operate in conjunction with a stock
httpd server to provide
access to a relational database containing meta-information, which is
used at run-time to retrieve physical data. The meta-information is
entered into the database off-line by the human librarians.
WebMake [bae95]
introduces methods
for building Web structures over existing software, e.g. source and object
code for software systems. In WebMake,
meta-level
structural documents are used to create abstractions by
logically combining software modules or other structural documents.
A set of tools has been developed to provide a distributed
software development environment by utilizing the
CGI
mechanism. A specialized Web client is required to obtain full
access to the WebMake functionality.
HyperG [and95]
uses an object-oriented
database layer to provide information modeling and model maintenance
facilities in addition to integrated attribute and content-based search.
HyperG supports logical grouping of
documents into collections that may span multiple
HyperG servers. Special
cluster collections are used to group together related multimedia
and multi-lingual information. HyperG
uses its own HyperG Text Format
(HTF) that is converted to
HTML by the
HyperG servers when they respond to
HTTP requests.
Harness
system [shk94, ssk95]
is to provide rapid access to large amounts of
heterogeneous information in a distributed environment without any
relocation, restructuring, or reformatting of data.
Like MORE and
HyperG, the
Harness
system uses metadata for search and retrieval of heterogeneous information
(fig. 3).
It provides advanced search and browsing capabilities without
imposing constraints on information suppliers or creators.
Harness utilizes a stable abstract class
presentation hierarchy, which need not be modified to add terminal
classes that accommodate new types of information and new
indexing technologies. Harness
provides tools for the automatic generation of meta-data based on user
inputs and the analysis of existing information.
IRDL)
[shk95] - a high-level language for describing
information resources
and the desired logical structure of information repositories.
The language provides high flexibility in imposing abstractions on
heterogeneous information.
Presently, the IRDL
interpreter generates Harness
metadata entities. With the emergence of
active objects, it should become possible to perform the
direct generation of the Web data structures.
4.0 Interfacing Legacy Information Systems
HTTP protocol and servers.
HTTP's statelessness creates obvious
problems in communicating with stateful information systems and in
supporting proper authorization mechanisms.
URLs, or as
hidden fields. This, of course, is only possible with the dynamic
HTML pages. One trouble with this
solution is that it does not work with cached pages
that do not contain timely state and authorization information.
Most latest browsers either routinely re-evaluate dynamic
URLs or can be forced to always perform
the re-evaluation. Still, caching remains a problem.
CGI mechanism is widely
used by DBMS vendors for building
interfaces to their products. IBM's
DB2 WWW Connection [lin95] is
a CGI gateway using the
Distributed Database Connection Service
(DDCS) to establish and maintain
connections (for the duration of a request) with
DBMS that reside on
DDCS-supported platforms.
DB2 or DDCS
installation. With all the performance and
security shortcomings of the CGI
mechanism, DB2 WWW Connection Version 1 provides enough flexibility
and is a reasonable first step in providing Web access to IBM databases.
CGI interface to
their on-line transaction processing system, known as
CICSTM
(fig. 4, [lin95]),
which was designed to interact with
IBM 3270TM terminals.
The gateway captures the
3270 stream and converts it to
HTML. However, the problems of compromising
between the session-based CICS and
stateless HTTP protocols remain.
In addition, numerous invocations of CGI
programs (one per request) may create real performance problems.
The initial versions of the
CICS Internet Gateway for AIX 2.1 and OS/2 version 2.0.1 or later
have been officially released and are available as IBM products.
DBMS vendors supply their
own gateways to the proprietary database systems (e.g.,
Oracle
,
Informix and
O2 Technology).
Most of these products store the sign-up and state information
at the server, in their databases. This approach helps solve the
HTML page caching problem
but each access now requires an additional database query, and without
a reliable way of enforcing the sign-off procedure, new
problems of discarding and refreshing the state and authorization
information arise.
CGI
gateway for communicating with their applications. There is some confusion
in that the O2 Web Server is not an
HTTP server, but an O2 server
enhanced to establish connections with the O2
CGI gateway and to convert results of
the Object Query Language (OQL) queries to
HTML. Nevertheless, the product is
fairly robust and sports nice object visualization capabilities.
TM, quite a few
steps beyond their current
CGI-based product. The new product is
supposed to include the following components:
The product is not yet available and it is impossible to say whether it
will live up to its expectations, but it is worth looking into once it
reaches the market.
TM, which is their
own http server that doubles as an Oracle gateway.
TM.
HTML-based server maintenance tools.
5.0 Web Front-Ends to Existing Applications
HTML front-ends to existing applications
alone may cause considerable anxiety on the part of end-users. In this
section, we comment on some of the Web-specific user interface issues.
A detailed discussion of the subject may be found in
[mil95].
HTML and the
HTTP protocol.
The HTML limitations were very
severe with the version 1.0, but have
been continuously improving with the introduction of
forms,
frames,
etc. (fig. 5, [mil95]).
HTTP limittations make submitting
requests similar to running a batch job: data validation and error checking
are only available after submitting a request [mil95].
It is possible to define separate data validation requests,
but such requests are not greatly favored by the end-users
who have to wait for the data validation results to submit a real request.
With Java, much of the data validation and error checking can be
performed at the browser, which mostly eliminates the problem.
HTML, but no action would follow the
selection. To retain the functionality, the
menu must either be complemented with a submit button or replaced
with a set of hyperlinks that point to menu items
(fig. 6). Any such choice should
be based on the application context. It would not generally
suffice to simply map legacy graphical elements into the ones that
are supported by HTML.
Instead, user interfaces should be carefully analyzed and redesigned
to make the best out of the available technology.
HTML
and HTTP, it is possible to design
successful graphic user interfaces for the legacy applications. However,
the flexibility of the emerging mobile code systems make them more
appropriate for the task.
6.0 Conclusions
Java and other emerging mobile code
systems provide new and exciting opportunities in presenting information
on the Web, we see them as complementary to the new developments in
the traditional HTML presentation
and the HTTP protocol.
HTML pages,
are likely to become obsolete tomorrow.
These structures represent a tremendous investment of money and talent
and can not be recreated with every new step in the technological advance.
Consequently, methods that support cutting-edge presentation of existing
heterogeneous information have to progress as well. Similar methods
should be applicable to creating virtual Webs of the future, with
both navigation and presentation controlled by personalized
meta-information.
7.0 Acknowledgments
Many of the issues raised in this overview were discussed at the
Workshop on Web Access to Legacy Data that was organized at the
Fourth International WWW Conference'95 in Boston. I would like
to thank my workshop co-chair, Khitij Shah, and workshop presenters:
Chumki Basu, Donna Dillenberger, David Eichmann, William LeBlanc,
Yi-Jing Lin, Tim McCandless, Richard Miller, Louis Perrochon,
and Marco Ronchetti for stimulating discussions.
References