Scholarly Publishing and the Fluid World Wide Web

Andrew Treloar, School of Computing and Mathematics, Deakin University, Rusden Campus, 662 Blackburn Road, Clayton, 3168, Australia. Phone +61 3 9244 7461. Fax +61 3 9244 7460.

Email: Andrew.Treloar@deakin.edu.au.
Home Page: Andrew
Treloar.

Version information

This version was last modified on October 30, 1995. The original version as delivered at the Asia-Pacific World Wide Web conference is also available online.

Abstract:: Scholarly publishing has traditionally taken place using the medium of print. This paper will discuss the transition to electronic scholarly publishing and how the fluid nature of the Web introduces special problems. These problems include document linking, document invariance, document durability, and changing standards.
Keywords:: Electronic publishing; scholarly publishing; SGML; HTML; URL; URN; durability; fixity; standards.

Note on Citations:

As the citations need to be accessible on the Web and in print, the following procedure has been followed. All bibliographic citations in the text are hyper-linked to an anchor in the reference list at the end of this paper. This allows the reader easily to check the detail of the citation and determine if they wish to access it. Cited works for which there exists a URL can then be accessed by clicking on the text of the URL itself, which will be given in full after the citation information. Selecting Go back will, in most Web browsers, return to the original location in the main text.

Introduction

Electronic Scholarly Publishing

Scholarly publishing has traditionally taken place using the technology of print. Despite the seemingly inevitable rush into the digital age, print is still the primary publishing technology for the overwhelming majority of scholarly disciplines. It is also the technology that still provides the official archival record for almost all publications. However, print publications suffer from a number of disadvantages:

Journals tend to be slow to appear. [Har91] identifies the lag between writing and publication as their major disadvantage.
Print cannot be directly searched, leading to a large market for secondary abstracting and indexing services.
Print publications are limited to information that can be represented statically in print.
Mechanisms for hyper-linking (i.e. traditional citations) are clumsy at best.
Print is costly to produce, distribute and store ([Odl95]).

For all these reasons, as soon as the available technology made it practicable, pioneering scholars began to use whatever means they could to produce and distribute their writings electronically. Such electronic publishing is sometimes referred to as epublishing, by analogy with email. For an excellent selective bibliography on the subject of scholarly electronic publishing over networks, consult [Bai95]. For an analysis of the views of academics on electronic publishing, consult [Sch94]. In roughly chronological order, the technologies adopted for epublishing on the Internet were:

listserv,
anonymous file transfer protocol (aftp),
gopher,
and the Web.

The Fluid Web

New technologies tend to be used in addition to older technologies, rather than supplanting them. Thus, in the field of electronic publishing it is not unusual to find journals that were initially distributed by listserv, and which then added aftp, and later perhaps gopher access. These older distribution technologies are now being augmented or (increasingly) replaced by the Web. The non-hierarchical document-based networked hypermedia architecture of the Web provides a much richer environment for electronic publishing on the Internet than any of the previous technologies.

Unfortunately, the Internet is an inherently impermanent medium, characterised by anarchy and chaos. While this dynamic information environment has many advantages, it poses some real problems as a publishing medium. The Web inherits all of this by using the Internet as a transport mechanism, but then adds to it some challenges of its own. The Web is the area of the Internet that is currently developing at the greatest rate. This speed of change brings with it a large degree of fluidity as technologies progress, mechanisms are developed and standards become fixed. The difficulties for scholarly publishing on the Web are particularly severe with respect to document linking, document fixity, document durability, and changing standards. This paper will consider each of these in turn and try to provide some solutions or ways forward.

Document Linking

To be useful to scholars, publications need to be readily accessible. As directory hierarchies are re-organised or servers moved, old URLs no longer work. This is of course a general problem with Internet-based electronic information systems, but the problem of broken URLs is particularly acute on the Web, where documents often contain multiple links to other documents. Breaks in the electronic spider-web of links can be extremely frustrating, and detract markedly from the feeling of immersion in a seamless information environment. A range of solutions to the problem of broken URLs are available or under development. They include manual fixes, semi-automated assistance, managing the links themselves separately from the documents, and rethinking the link mechanism altogether.

Fix it yourself?

A manual solution is to ensure that when directory hierarchies are re-organised on servers, links are placed from old locations to new locations. For Unix ftp servers this can be done with link files. On the Macintosh, aliases perform a similar function. On Web servers, a small document that

indicates the file has moved
states the new URL
and provides a clickable link

is usual and sufficient.

This sort of fix works, but has a number of deficiencies:

It requires the active involvement of server administrators and is therefore prone to errors.
Such link files often are given only a limited life. After they expire, there is no way for a user to know where to manually redirect the HREF.
The authors of documents which point to altered locations are not (and cannot be) notified that the target of their HREFs has moved.

Most importantly, this technique does not scale well to large complex hyper-linked document sets.

Semi-automated assistance would be nice...

A partial improvement would be some mechanism that helped identify links that were broken. It is technically feasible to provide some form of Web robot that would periodically walk all the links out of a site and ensure that they still work. Unfortunately, I know of no commercial vendors of Web publishing software who provide such a facility. The closest thing available is the recently introduced URL-minder service. This allows a user to register document URLs and receive an email message whenever the document moves or changes. This can be very useful, especially if one is concerned about the content of a linked-to document, as well as its location.

Don't embed the links!

Of course, one of the reasons that the Web has difficulties with broken links is that the links are embedded in the document stream. This makes automation of link repair difficult. A number of information systems have made a conscious decision to separate documents and links. Two usefully illustrative examples are Hyper-G and PASTIME.

Hyper-G

The Hyper-G information system ([Kap95]) uses a separate database engine to maintain meta-information (including, but not restricted to, links) about documents as well as their relationships to each other. Hyper-G servers automatically maintain referential integrity for their local documents and communicate with other servers to maintain integrity for the entire system. In contrast to the Web, links are bidirectional, enabling one to find the source of a link from its destination (as well as vice versa). Hyper-G has native clients for a range of platforms, or can be accessed via a WWW to Hyper-G gateway.

PASTIME

The PASTIME project ([Thi95]) has abandoned altogether the practice of embedding fixed hyperlinks into documents. The links are generated on the fly, based on sophisticated pattern processing, as the HTML document is served to the remote client. To add or remove hyperlinks only the pattern-processing needs to be altered. Fixed hyperlinks can be entered if desired. In contrast to Hyper-G, the currency of hyperlinks is not dependent on maintaining a separate database of link information. The runtime efficiency of the pattern matching is claimed to be very high.

Names not locations

Ultimately, the most satisfactory solution will be to rethink what is meant by a link. The most appropriate model is to adapt the method used for scholarly links to other documents for centuries - the scholarly citation. At the end of this paper is the References section. This provides links to other documents in the form of standardised bibliographic citations. These citations do not make reference to the location of the document - they only specify its name in some unambiguous form.

The Web equivalent, of course, is the distinction between Uniform Resource Locators (URLs) and Uniform Resource Names ( URNs). These are part of a proposed wider Internet infrastructure in which URNs are used for identification and URLs for locating or finding resources. As in the print world, what scholars want to be able to link to is the contents of other identifiable documents - the locations of those documents should be irrelevant. URLs, with their dependence on a particular machine and directory path, are a transitional kludge. URNs, with their intended ability to refer to a known resource and have the system take care of locating it and accessing it, are the long term solution.

Of course, building and distributing a robust URL to URN resolution mechanism is far from trivial, although some prototype implementations are starting to appear, such as that at Saint Joseph's College in the U.S.A.

As an alternative transitional solution, [Fre95] proposes using standard message broadcasting mechanisms, such as those provided by Usenet News and Open Distribution Lists services, to announce Uniform Resource Identifiers (URIs), and Uniform Resource Characteristics (URCs), as well as URNs and URLs.

Link Granularity

In the area of the granularity of the object linked to Web publishing provides a way to improve on traditional print citations, which point to a specific page at best. Of course, the concept of page is not particularly relevant to electronic documents, particularly given the way Web browsers can dynamically reformat text as the size of the display window is changed. It would be better if the unit linked to was both smaller than a page, and more closely-linked to the structure of the document. Obvious alternatives for the Web are named anchors for sections, or numbered paragraphs. URLs can then point to the correct section or even paragraph of a document, rather than just to its beginning. Naturally, if individual Web documents are small, this is less important. But many documents produced at this transitional stage in the move to electronic-only publishing are still structured for printing. This document, and the others produced for this conference, are good examples of this. The style guide requires that submissions be a single document for ease of management.

All Web authors of long documents should bear the needs of fine-grained linking in mind and provide named anchors for others to link to. Leading by example, this document has pre-defined names for each heading. To stop URLs growing to ridiculous lengths, these names are 2 or 3 character identifiers that are unique (obviously) within this document. To see what they are for linking purposes, readers can select View Source in their browser.

Document Fixity

In the print world, we are used to documents remaining fairly static. Journal articles, once published, are not normally updated. They become fixed in time and part of the historical record of scholarship. Monographs may appear in more than one edition, but are then clearly recognisable as new products, often with years between successive versions. Electronic documents on the Internet can change so quickly that they sport version numbers and the date they were last updated. Scholarly Web publishing has a range of possibilities with respect to fixity. At one end of the spectrum are document that follow the print model and remain fixed once published. Somewhere in the middle are documents where only minor change is allowed. At the other end are documents that are continuously updated as more information becomes available, or as the author's views change.

Minor changes to documents

As the first step away from static documents, the High Energy Physics community has moved to a model of electronic publishing which allows for ongoing corrections and addenda. The hep-th e-print archive which provides this facility serves over 20,000 users from more than 60 countries, and processes over 30,000 messages per day ([Gin94]). JAIR, the Journal of Artificial Intelligence Research has also just introduced the ability to make comments on published papers and read other's comments.

A number of electronic scholarly journals are intending to add pointers from earlier published articles to related later ones on the author's behalf. The assumption is that pointers in the reverse direction (that is, to prior published work) will already have been included by the authors.

Continuously updated documents

Tony Barry from the ANU has suggested in [Bar95] that continuously updated documents might be viewed as being more like a computer database. Such continuous updating may be desirable to cope both with the broken URL problem discussed above and the exponentially increasing amount of information online that can be linked to. The first of these problems may go away in the near future - the second is unlikely to anytime soon.

In [Bar94] he has also suggested that scholars should get recognition for the currency of their documents rather than their number. While attractive on the surface, I am profoundly sceptical that universities who are currently just starting to grapple with recognising the validity of electronic publications are ready for this visionary proposal. Nor am I convinced that the implications of this model for the workloads of scholars have been fully thought through. I can imagine an academic trying (and failing) to keep a number of her articles in different content areas up to date. As the number of such articles increase (as one might hope they would, particularly in a 'publish or perish' environment), the workload would increase to crippling levels. This is particularly a problem in fields undergoing rapid change.

Perhaps the only reasonable interim solution is to distinguish somehow between fixed documents (print-like) and continously updated documents (database-like), or at least to make it clear at the top of a document into which category it falls. This approach has been used, for example by Bailey in [Bai95]. The HTML version of this document is continuously updated - the ASCII version is fixed and permanently archived. Public-Access Computer Systems Review (PACS-R), where these articles were published, has until this year only published in fixed ASCII. As a sign of the changing times, it now accepts articles in fixed ASCII, fixed HTML and author-updated HTML.

So which document is which?

If documents are continuously changing and evolving over time, which version should be cited? Which version is the 'publication of record' (assuming this means anything any more)?

Two solutions are used to the problem of permanence on the Web at present.

Every time the document changes, its name changes also. If the older version is replaced by the newer, then all URLs pointing to the older version break. Moving to URNs will not help in this case.
Every time the document changes, the name is kept the same and the contents updated. Existing URLs will still work, although the target of the URL may have changed its content significantly. In this case, what if one scholar cites a section in a document that disappears in the next revision? The URL-minder service already mentioned will only be of limited help here. Presumably under this model, a new URN will need to be assigned, as is done with successive versions of print monographs.

An alternative solution with wider application than just the Web is discussed in [Gra95]. The nature of all digital documents means that a mechanism is needed to ensure the authenticity of a given document, or to track multiple versions of a digital original. His proposed solution for such authentication and version tracking is Digital Time Stamping (DTS). With DTS a one-way algorithm is used to generate a key that can be produced only by the original document. These keys would then be made public, thus ensuring that anyone could verify the version of a given document.

Document Durability

Document durability ([Kau93]) refers to the length of time the article is available for communicative transactions. Paper documents printed on paper that is not acid-free have a durability of some 100 years unless corrective action is taken. The durability of Web documents is entirely unknown, but there are no technological reasons for their life to be limited in any way, provided they are archived in some systematic way. At present there are no mechanisms to ensure that this will occur.

The three preservations

Graham ([Gra94]) divides the problem of ensuring the preservation of documents into:

medium preservation
technology preservation
intellectual preservation

Medium preservation relates to the problem of preserving the medium on which the information is recorded. This has been traditionally been discussed in terms of environmental and handling concerns for storage media. This will continue to be important, but the rate of technological change is such that preserving media like 8" floppies is of little use if nothing can read them. [Les92] has suggested instead that attention should instead be directed to technological preservation. That is, the obsolescence of technologies is much more of a problem than the decay of storage media. In his words, for electronic information, "preservation means copying, not physical preservation." The third of Graham's preservation requirements, intellectual preservation, addresses the integrity and authenticity of the originally information. In other words, what assurance do we have that what is retrieved is what was originally recorded. [Gra95] proposes the DTS technology discussed above as a solution to this problem. [Rot95] also discusses a range of mechanisms to deal with each of these three preservation problems.

Lost documents

In many ways, the digital nature of all electronic publishing can be both a strength and a weakness in the area of durability. A strength, because digital documents can easily be copied and replicated at multiple sites around the world. A weakness, because destroying a digital document is far easier than destroying a physical document. It is easy to assume that the document will exist elsewhere on the Net and that the fate of a single copy is irrelevant. Of course, there is no mechanism to prevent everyone making this assumption and causing the loss for ever of a piece of scholarship. In some ways, the analogy of the single manuscript forgotten on top of a cupboard in a mediaeval monastery may well be a forgotten directory on a rarely used hard-disk somewhere in a university department. Unfortunately, it is all too easy to unconsciously delete a directory - throwing away a manuscript without realising it is somewhat harder. Given the lack of any mechanism to ensure the archiving of print publications, it seems unlikely (although technologically relatively easy) that anything will be done about the situation for digital documents.

Changing Standards, or the ever-moving HTML target

The explosive development and adoption of the Web has been paralleled by the evolution of the HTML standard. Starting from what is now called HTML level 1, we have moved through level 2 and got nearly as far as level 3 (for most browsers, at least). At the time of writing, HTML level 3 is still not finalised, although this is not stopping the developers of Web browsers from adding proposed or likely level 3 features to their products. The W3O and others are no doubt already thinking about the sorts of things that might appear in HTML level 4.

As HTML has grown and mutated it has added a whole range of things that were not envisaged at its birth, back when the world was a simpler place. These include at least detailed layout control, tables, and equation support. The question is, should the process of accreting features to HTML continue unabated into the future?

Price-Wilkins ([Pri94a], [Pri94b]) has argued that HTML as used on the Web has a number of fundamental deficiencies as a scholarly markup language. Some of those he lists are:

The range of HTML tags available to users is still too limited. As a consequence, authors are unable to differentiate important elements with HTML.
Where the author wishes to define a bounded segment of text, such as a stanza or chapter, no tag is available for this purpose. Instead, authors rely on dividing documents into files representing major structural divisions.
HTML tagging often confuses function and appearance.
There are very few HTML tags that define structural relationships.

He argues that those currently coding their documents in HTML may come to regret their short-sightedness in a few years. Instead, he argues for coding complex documents in SGML and converting this into HTML on the fly for delivery on the Web as well as through other means. As it turns out, one of the characteristics of the evolution of HTML is precisely in the direction of greater SGML compliance.

Phillip Greenspun, from MIT, has also written on the deficiencies of HTML ([Gre95]). His preferred solution is to make much wider use of the META tag included in HTML level 2.

HTML is certainly evolving towards full SGML compliance, but betrays its origin as a formatting language rather than a structuring language at every turn. It may not be possible to migrate entirely seamlessly towards SGML. Indeed it may not be necessary. Many types of publishing do not require the range of features listed by Price-Wilkins. SGML to HTML gateways may only be required for particular kinds of large complex documents.

One way out of this mess is to clearly separate the internal representation of a document from its ultimate delivery format. We already do this with computer-generated documents that are delivered in print. In the rapidly-changing world of electronic publishing, documents may be delivered in HTML, as Adobe Acrobat PDF files, as Postscript files, and in print (to name only the most obvious options). Documents may be prepared in a wide range of word or document processors for ultimate delivery using these formats. Provided the amount of structuring and layout information about the document is richer than the ultimate delivery format, than conversion is relatively simple. A number of manufacturers are already facilitating this with their products. Framemaker version 5.0 from Frame Technology Corporation and Pagemaker 6.0 from Adobe, both provide support for output to PDF and HTML, as well as Postscript and print. Microsoft Word's Internet Assistant provides users of the latest version of Word for Windows with the ability to save in HTML as well as the native document format. Given the rate of change in electronic delivery technologies, it is probably best to keep files in a format that can be output in a range of forms. This provides the maximum flexibility and is reasonably future-proof.

Conclusion

The Web provides an attractive and accessible environment for scholarly electronic publishing (provided the content is compatible with the limitations of HTML). It seems likely that the use of the Web in this context will increase. In light of the nature of the Web, care needs to be taken to ensure that the potential dangers of the Web's fluidity are overcome. This paper has outlined a number of the areas of difficulty and some possible solutions.

References

[Bai95]: C. Bailey Jr., "Network-Based Electronic Publishing of Scholarly Works: A Selective Bibliography", The Public-Access Computer Systems Review, Vol. 6, Number 1, http://info.lib.uh.edu/pr/v6/n1/bail6n1.html.
[Bar94]: T. Barry, " Publishing on the Internet with World Wide Web ", in Proceedings of CAUSE '94 in Australia, CAUDIT/CAUL, Melbourne.
[Bar95]: T. Barry, "Network Publishing on the Internet in Australia", in The Virtual Information Experience - Proceedings of Information Online and OnDisc '95, Information Science Section, Australian Library and Information Association, pp. 239-249.
[Fre95]: V. Freitas, "Supporting a URI Infrastructure by Message Broadcasting", in Proc. INET '95, http://inet.nttam.com/HMP/PAPER/116/abst.html.
[Gin94]: P. Ginsparg, "First Steps towards Electronic Research Communication", Computers in Physics, August.
[Gra94]: Peter S. Graham, " Intellectual Preservation: Electronic Preservation of the Third Kind", Commission on Preservation and Access, Washington, D. C., http://aultnis.rutgers.edu/texts/cpaintpres.html.
[Gra95]: Peter S. Graham, "Long-Term Intellectual Preservation ", Proc. RLG Symposium on Digital Imaging Technology for Preservation, http://aultnis.rutgers.edu/texts/dps.html.
[Gre95]: P. Greenspun, "We have Chosen Shame and Will Get War", http://www-swiss.ai.mit.edu/philg/research/shame-and-war.html.
[Har90]: S. Harnad, "Scholarly Skywriting and the Prepublication Continuum of Scientific Inquiry", in Psychological Science, Vol. 1, pp. 342 - 343 (reprinted in Current Contents 45: 9-13, November 11 1991), ftp://ftp.princeton.edu/pub/harnad/Harnad/harnad90.skywriting.
[Har91]: S. Harnad, "Post-Gutenberg Galaxy: The Fourth Revolution in the Means of Production of Knowledge", in The Public-Access Computer Systems Review, Vol. 2, No.1, pp. 39-53, ftp://cogsci.ecs.soton.ac.uk/pub/harnad/Harnad/harnad91.postgutenberg.
[Kap95]: F. Kappe, "Maintaining Link Consistency in Distributed Hyperwebs", in Proc. INET '95. Originally referenced at http://inet.nttam.com/HMP/PAPER/073/html/paper.html.
[Kau93]: D. S. Kaufer & K. M. Carley, Communication at a Distance - The Influence of Print on Sociocultural Organization and Change, Lawrence Erlbaum Associates.
[Les92]: M. Lesk, Preservation of New Technology: A Report of the Technology Assessment Advisory Committee to the Commission on Preservation and Access, Washington, DC: CPA. Available from the Commission at $5: 1400 16th S. NW, Suite 740, Washington, DC 20036-2217.
[Odl95]: A. Odlyzko, "Tragic loss or good riddance? The impending demise of traditional scholarly journals" in Electronic Publishing Confronts Academia: The Agenda for the Year 2000, Robin P. Peek and Gregory B. Newby, eds., MIT Press/ASIS monograph, MIT Press, ftp://netlib.att.com/netlib/att/math/odlyzko/tragic.loss.txt.
[Pri94a]: J. Price-Wilkin, "Using the World-Wide Web to Deliver Complex Electronic Documents: Implications for Libraries" in The Public-Access Computer Systems Review, Vol. 5, No. 3, pp. 5-21, gopher://info.lib.uh.edu:70/00/articles/e-journals/uhlibrary/pacsreview/v5/n3/pricewil.5n3.
[Pri94b]: J. Price-Wilkin, "A Gateway Between the World-Wide Web and PAT: Exploiting SGML Through the Web.", in The Public-Access Computer Systems Review, Vol. 5, No. 7 , pp. 5-27, gopher://info.lib.uh.edu:70/00/articles/e-journals/uhlibrary/pacsreview/v5/n7/pricewil.5n7.
[Rot95]: J. Rothenburg, "Ensuring the Longevity of Digital Documents", in Scientific American, January, pp. 24 - 29.
[Sch94]: D. Schauder, Electronic Publishing of Professional Articles: Attitudes of Academics and Implications for the Scholarly Communication Industry, Unpublished Ph. D. Dissertation, University of Melbourne.
[Thi95]: P. Thistlewaite, "Managing Large Hypermedia Information Bases: a case study involving the Australian Parliament", in Proc. AusWeb'95, http://www.scu.edu.au/ausweb95/papers/management/thistlewaite/.

�