back up on

Web Access to Legacy Data

Leon Shklar

Computer Science Department, Rutgers University, New Brunswick, NJ 08903
Bell Communications Research, 445 South St., Morristown, NJ 07960

shklar@cs.rutgers.edu

1.0 Introduction

A side-effect of the universal acceptance of the World-Wide Web is an urgent need to provide Web access to the vast legacy of existing heterogeneous information. This information ranges from documents in a variety of proprietary (and sometimes obscure) representation formats to engineering and financial databases, and often may only be accessed through specialized vendor tools and locally developed applications. Moreover, rapidly increasing sophistication in presenting information on the Web is already forcing us to treat ftp and gopher information sources, and even early HTML pages as parts of the same legacy.

In this overview, our main focus is on the current state of technology rather than prospective research. We discuss existing tools, methods and architectures, but we also mention functioning prototypes. To date, the main efforts in providing Web access to legacy data have centered around the following directions:

  1. Advancing the Web gateway technology.
  2. Providing direct Web access to heterogeneous information [and95, eic94, ssk95].
  3. Building Web interfaces to stateful information systems [per95, pfi95, ron95].
  4. Designing HTML and Java graphic user interfaces for existing applications [mil95, rfd95].

In Section 2, we discuss the use of alternative technologies for building Web interfaces. In Section 3, we discuss methods and tools for providing direct Web access to existing heterogeneous information. We believe that the most promising approach is building logical data models and using them to support all kinds of sophisticated presentation of the original information on the World-Wide Web. Section 4 describes current solutions to the problem of building Web interfaces to stateful information systems. We discuss a number of popular tricks that provide partial solutions. The emerging mobile code systems should help us to do better than that in the near future. In Section 5, we discuss some of the common problems that arise when using HTML to build graphic front-ends for legacy applications. We believe that the mobile code systems should solve many of these problems. A brief summary of the overview and our conclusions are presented in Section 6, followed by acknowledgements.

2.0 Web Interfaces

Data presentation on the Web is performed by the Web browsers and controlled by HTML. When presenting legacy data, the HTML has to be generated dynamically by a gateway program that either accesses the existing data directly or establishes a TCP/IP connection with a legacy application. Interaction between gateways programs and the Web servers is supported by the Common Gateway Interface (CGI) mechanism (fig. 1).

Fig. 1. CGI and NPH/CGI mechanisms.

The CGI mechanism supports passing information between the gateways and the browsers via HTTP servers. Output information, generated by a CGI gateway, is passed on to an associated HTTP server that adds an HTTP header and passes everything to the browser (fig. 1). Special No-Parse-Header (NPH) gateways pass their outputs directly to the browsers. Programmers are responsible for building HTTP headers (fig. 1). This provides the greatest flexibility and is not much of a chore. With the early versions of CGI, the NPH/CGI mechanism was the only way for programmers to influence the treatment of information by the Web browsers by assigning the MIME [bor93] types, specifying error conditions, etc. Recently, CGI programmers have been given limited control over the generation of HTTP headers by the Web servers. Still, NPH/CGI gateways provide slightly better performance and allow full control over the generation of HTTP headers, which may be important for passing meta-information, etc. For more information on the CGI mechanism see [far96].

An interesting attempt to automate the process of building CGI interfaces is described in [pfi95]. Authors have designed a declarative Interface Definition Language (IDLE) to support the generation of their CGI-based translation servers. Such translation servers may be used for building Web interfaces to stateful legacy applications (Section 4). A prototype version of the IDLE interpreter is available.

Another way of passing information between the gateways and the browsers is by using the Server-Side Includes ( SSIs) that force Web servers to parse documents and execute embedded commands. However, letting users execute commands at the server is a security risk, not to mention the cost of having the server parse each requested HTML document. Such cost may become quite significant under heavy load. Consequently, few of the Web servers have the SSI mechanism activated, further limiting its usefulness. In our opinion, SSIs are less secure, less efficient, and do not present a viable alternative to the CGI mechanism.

The most efficient way of interfacing a legacy system is to make a gateway its own Web server. This job is greatly simplified by the public availability of the W3C reference library, which provides a general-purpose code base for implementing Web clients and servers. The greater efficiency is due to eliminating communications between the Web servers and application gateways. This approach is the best choice for building commercial products though the simplicity and flexibility of the CGI mechanism make it preferable for research and prototyping efforts.

Fig. 2. Mobile code gateways.

The emerging Java technology [gos95] provides another viable alternative by letting programmers establish TCP/IP connections from a mobile program executing at the browser (applet) directly to a legacy application (fig. 2). The complication is that such applets are responsible for all communications with end-users, and considerable effort may be required in setting up even a simple prototype. In such an architecture, programmers are completely on their own, because they are not taking any advantage of the World-Wide Web other than presenting the applet's graphic user interface through the browser.

3.0 Direct Access to Heterogeneous Information

There have been numerous attempts to provide partial remedies for data heterogeneity by implementing a variety of ever-changing filters for format conversions. The filters are used to generate HTML documents either off-line or at run-time, using the CGI mechanism. The off-line approach requires substantial human and computing resources for the initial conversion and maintenance of information. Maintaining the repositories presents the additional dilemma of either creating new and updating existing information in HTML, or continuously managing evolving data in multiple formats. The run-time approach helps to postpone the conversion until the information is requested and eliminates problems with the initial conversion and maintenance of information. However, the run-time conversion may not be appropriate for some document formats (framemaker, etc.) for the following reasons:

  1. slowness of the conversion,
  2. need to "tune" the filters to support local standards,
  3. low quality of generated HTML (may often require human post-processing).

Using the Multipurpose Internet Mail Extensions (MIME) [bor93], supported by most Web browsers, helps to avoid data conversion through the use of third-party presentation tools. However, it may require renaming the original files because MIME's type recognition mechanism relies on file extensions. Adding support for new MIME types often requires end users to obtain and install third-party tools. Further, there is still a problem of logically linking together individual documents and of accessing arbitrarily formatted data.

The OMNIS system [cla95] has been designed to provide access to library information that includes both catalogs and digitized texts. The scanned-in documents may contain images, postscript or other formatted information, and are stored in a database. At presentation time, the OMNIS gateway converts textual information to HTML, while images are converted to common MIME types before being passed to the browser. This is quite feasible because OMNIS has full control over the format and representation of information that is stored in its database.

Harvest [bow94] provides support for extracting summaries from distributed heterogeneous information and for executing searches over these summaries. Once the resources have been identified, the responsibility of accessing them is handed over to the Web browsers. Harvest provides efficient and flexible methods of indexing widely distributed information. MIME mappings are used to provide access to the wide variety of information and the problems that were described earlier still persist.

There have been a number of attempts to create logical models of distributed heterogeneous information and use these models to support advanced Web presentation. The Multimedia-Oriented Repository Environment (MORE) [eic94] was designed as a set of CGI programs that operate in conjunction with a stock httpd server to provide access to a relational database containing meta-information, which is used at run-time to retrieve physical data. The meta-information is entered into the database off-line by the human librarians.

WebMake [bae95] introduces methods for building Web structures over existing software, e.g. source and object code for software systems. In WebMake, meta-level structural documents are used to create abstractions by logically combining software modules or other structural documents. A set of tools has been developed to provide a distributed software development environment by utilizing the CGI mechanism. A specialized Web client is required to obtain full access to the WebMake functionality.

HyperG [and95] uses an object-oriented database layer to provide information modeling and model maintenance facilities in addition to integrated attribute and content-based search. HyperG supports logical grouping of documents into collections that may span multiple HyperG servers. Special cluster collections are used to group together related multimedia and multi-lingual information. HyperG uses its own HyperG Text Format (HTF) that is converted to HTML by the HyperG servers when they respond to HTTP requests.

Fig. 3. Simplified Harness Architecture.

The objective of the Harness system [shk94, ssk95] is to provide rapid access to large amounts of heterogeneous information in a distributed environment without any relocation, restructuring, or reformatting of data. Like MORE and HyperG, the Harness system uses metadata for search and retrieval of heterogeneous information (fig. 3). It provides advanced search and browsing capabilities without imposing constraints on information suppliers or creators. Harness utilizes a stable abstract class presentation hierarchy, which need not be modified to add terminal classes that accommodate new types of information and new indexing technologies. Harness provides tools for the automatic generation of meta-data based on user inputs and the analysis of existing information.

Closely related to this effort is the work on defining an Information Repository Definition Language (IRDL) [shk95] - a high-level language for describing information resources and the desired logical structure of information repositories. The language provides high flexibility in imposing abstractions on heterogeneous information. Presently, the IRDL interpreter generates Harness metadata entities. With the emergence of active objects, it should become possible to perform the direct generation of the Web data structures.

4.0 Interfacing Legacy Information Systems

As in [per95], we define a legacy system as a system that strongly resists modification and evolution. Database management systems represent an important subset of legacy information systems. In Section 2, we described different ways of interfacing independent gateway programs with the Web. In this section, we discuss using such programs to provide Web access to legacy information systems.

The most common problems that arise when building Web interfaces to legacy information systems are caused by the statelessness of the HTTP protocol and servers. HTTP's statelessness creates obvious problems in communicating with stateful information systems and in supporting proper authorization mechanisms.

The most popular solution is to either pass state and encoded authorization information in gateway URLs, or as hidden fields. This, of course, is only possible with the dynamic HTML pages. One trouble with this solution is that it does not work with cached pages that do not contain timely state and authorization information. Most latest browsers either routinely re-evaluate dynamic URLs or can be forced to always perform the re-evaluation. Still, caching remains a problem.

Even assuming there was an acceptable way of maintaining state information, there remains a more difficult problem of supporting transactions that span multiple requests. Once the request is serviced, the connection between the Web server and the browser breaks. When a follow-up request comes through, the information system's thread associated with the previous request may not be around. Starting a new thread and retracing earlier steps within the same transaction to bring the thread into a right state is not always possible, and quite expensive when it is.

The CGI mechanism is widely used by DBMS vendors for building interfaces to their products. IBM's DB2 WWW Connection [lin95] is a CGI gateway using the Distributed Database Connection Service (DDCS) to establish and maintain connections (for the duration of a request) with DBMS that reside on DDCS-supported platforms.

Fig. 4. CICS Internet Gateway.

IBM DB2 WWW Connection Version 1 is now generally available for OS/2 and AIX operating systems via a download from the Web. New beta versions for Windows NT, Sun Solaris and HP-UX are also available from the same site. Of course, it only makes sense to install this gateway product on a machine that is also running a Web server and has a local DB2 or DDCS installation. With all the performance and security shortcomings of the CGI mechanism, DB2 WWW Connection Version 1 provides enough flexibility and is a reasonable first step in providing Web access to IBM databases.

According to [lin95], IBM's current work is on providing a Web CGI interface to their on-line transaction processing system, known as CICSTM (fig. 4, [lin95]), which was designed to interact with IBM 3270TM terminals. The gateway captures the 3270 stream and converts it to HTML. However, the problems of compromising between the session-based CICS and stateless HTTP protocols remain. In addition, numerous invocations of CGI programs (one per request) may create real performance problems. The initial versions of the CICS Internet Gateway for AIX 2.1 and OS/2 version 2.0.1 or later have been officially released and are available as IBM products.

Other DBMS vendors supply their own gateways to the proprietary database systems (e.g., Oracle , Informix and O2 Technology). Most of these products store the sign-up and state information at the server, in their databases. This approach helps solve the HTML page caching problem but each access now requires an additional database query, and without a reliable way of enforcing the sign-off procedure, new problems of discarding and refreshing the state and authorization information arise.

O2 Technology provides a CGI gateway for communicating with their applications. There is some confusion in that the O2 Web Server is not an HTTP server, but an O2 server enhanced to establish connections with the O2 CGI gateway and to convert results of the Object Query Language (OQL) queries to HTML. Nevertheless, the product is fairly robust and sports nice object visualization capabilities.

Oracle has announced the upcoming release of its Oracle WebServer 2.0TM, quite a few steps beyond their current CGI-based product. The new product is supposed to include the following components:

  1. Web Request BrokerTM, which is their own http server that doubles as an Oracle gateway.
  2. A development kit for interfacing Oracle applications with the Web Request BrokerTM.
  3. HTML-based server maintenance tools.
The product is not yet available and it is impossible to say whether it will live up to its expectations, but it is worth looking into once it reaches the market.

Fig. 5. The Evolution of Computer-Human Interaction.

5.0 Web Front-Ends to Existing Applications

Apart from the purely technical concerns, the layout and functionality of the HTML front-ends to existing applications alone may cause considerable anxiety on the part of end-users. In this section, we comment on some of the Web-specific user interface issues. A detailed discussion of the subject may be found in [mil95].

Problems with applying traditional interface design technics to building interactive applications on the Web are rooted in limitations of both HTML and the HTTP protocol. The HTML limitations were very severe with the version 1.0, but have been continuously improving with the introduction of forms, frames, etc. (fig. 5, [mil95]). HTTP limittations make submitting requests similar to running a batch job: data validation and error checking are only available after submitting a request [mil95]. It is possible to define separate data validation requests, but such requests are not greatly favored by the end-users who have to wait for the data validation results to submit a real request. With Java, much of the data validation and error checking can be performed at the browser, which mostly eliminates the problem.

Fig. 6. Menus in HTML.

Consider a sample legacy application with the user interface that utilizes a pop-down menu. Selecting an item from the menu results in an immediate action. Pop-down menus are also available in HTML, but no action would follow the selection. To retain the functionality, the menu must either be complemented with a submit button or replaced with a set of hyperlinks that point to menu items (fig. 6). Any such choice should be based on the application context. It would not generally suffice to simply map legacy graphical elements into the ones that are supported by HTML. Instead, user interfaces should be carefully analyzed and redesigned to make the best out of the available technology.

A successful attempt of re-engineering the graphic user interface to provide the Web front-end for an existing application is reported in [rfd95]. One interesting result of this effort is providing additional functionality that was not available from the original system. This, of course, is due to the fact that the old graphic user interface did not take full advantage of the available functionality. To summarize, even within the limitations imposed by the use of the current versions of HTML and HTTP, it is possible to design successful graphic user interfaces for the legacy applications. However, the flexibility of the emerging mobile code systems make them more appropriate for the task.

6.0 Conclusions

In this overview, we have discussed existing technology for building Web interfaces to legacy information systems and for providing direct Web access to legacy information. While Java and other emerging mobile code systems provide new and exciting opportunities in presenting information on the Web, we see them as complementary to the new developments in the traditional HTML presentation and the HTTP protocol.

The World-Wide Web technology is advancing so rapidly that today's Web data structures, including images, applets, and static and dynamic HTML pages, are likely to become obsolete tomorrow. These structures represent a tremendous investment of money and talent and can not be recreated with every new step in the technological advance. Consequently, methods that support cutting-edge presentation of existing heterogeneous information have to progress as well. Similar methods should be applicable to creating virtual Webs of the future, with both navigation and presentation controlled by personalized meta-information.

7.0 Acknowledgments

Many of the issues raised in this overview were discussed at the Workshop on Web Access to Legacy Data that was organized at the Fourth International WWW Conference'95 in Boston. I would like to thank my workshop co-chair, Khitij Shah, and workshop presenters: Chumki Basu, Donna Dillenberger, David Eichmann, William LeBlanc, Yi-Jing Lin, Tim McCandless, Richard Miller, Louis Perrochon, and Marco Ronchetti for stimulating discussions.

References

[and95]
K. Andrews, F. Kappe, and H. Maurer. Serving Information to the Web with Hyper-G WWW'95, Darmstadt, Germany. Computer Networks and ISDN Systems 27(6), Elsevier Science, 1995, pp. 919-926.
[bae95]
M. Baentsch, G. Molter, and P. Sturm. WebMake: Integrating distributed software development in a structure-enhanced Web, WWW'95, Darmstadt, Germany. Computer Networks and ISDN Systems 27(6), Elsevier Science, 1995, pp. 789-800.
[bor93]
N. Borenstein, N. Freed, and K. Moore. MIME (Multipurpose Internet Mail Extensions), Network Working Group. RFC 1521 and 1522.
[bow94]
C. Bowman, P. Danzig, D. Hardy, U. Manber, and M. Schwartz. The Harvest information discovery and access system, Proceedings of WWW'94, Chicago, IL. NCSA/UIUC, October 1994, pp. 763-772.
[cla95]
A. Clausnitzer, P. Vogel, and S. Wiesener. A WWW Interface to the OMNIS/Myriad Literature Retrieval Engine , WWW'95, Darmstadt, Germany. Computer Networks and ISDN Systems 27(6), Elsevier Science, 1995, pp. 1017-1026.
[eic94]
D. Eichmann. T. McGregor, and D. Danley. Integrating Structured Databases Into the Web: The MORE System , WWW'94, Zurich, Switzeland. Computer Networks and ISDN Systems 27(2), Elsevier Science, 1994.
[far96]
R. Farrell. CGI Programming with Perl 5.0. IDG Books, 1996.
[gos95]
J. Gosling and H. McGilton. The Java Language Environment: A White Paper, Sun Microsystems, Mountain View, CA, May 1995.
[lin95]
Yi-Jing Lin, Jen-Yao Chung, John Leung, Donna Dillenberger, and Nick Bowen. Web Access to IBM Legacy Systems Data, Workshop on Web Access to Legacy Data, Boston, MA, December 1995.
[mil95]
Richard Miller. Web Interface Design: Learning from our Past, Workshop on Web Access to Legacy Data, Boston, MA, December 1995.
[pfi95]
L. Perrochon and R. Fischer. IDLE: Unified W3 Access to Interactive Servers, WWW'95, Darmstadt, Germany. Computer Networks and ISDN Systems 27(6), Elsevier Science, 1995, pp. 927-938.
[per95]
Louis Perrochon. W3 "Middleware": Notions and Concepts, Workshop on Web Access to Legacy Data, Boston, MA, December 1995.
[ron95]
M. Ronchetti. Face Lift: Using WWW technology for an external re-engineering of old applications, In the Poster Proceedings of WWW'95, Darmstadt, Germany. Fraunhofer Institute for Computer Graphics, 1995, pp. 145-148.
[rfd95]
M. Ronchetti, D. Feltrin, V. D'Andrea, and G. Succi. External reengineering of "Catalogo Bibliografico Trentino": lessons learned, Workshop on Web Access to Legacy Data, Boston, MA, December 1995.
[shk94]
L. Shklar, S. Thatte, H. Marcus, and A. Sheth. The "InfoHarness" Information Integration Platform, Proceedings of WWW'94, Chicago, IL. NCSA/UIUC, October 1994, pp. 809-820.
[ssk95]
L. Shklar, A. Sheth, V. Kahyap, and K. Shah. Use of Automatically Generated Metadata for Search and Retrieval of Heterogeneous Information, CAiSE'95, Jyvaskyla, Finland. Lecture Notes in Computer Science #932, Springer-Verlag, 1995, pp. 217-230.
[shk95]
L. Shklar, K. Shah, and C. Basu. Putting Legacy Data on the Web: A Repository Definition Language , WWW'95, Darmstadt, Germany. Computer Networks and ISDN Systems 27(6), Elsevier Science, 1995, pp. 939-951.

back up on