Scholarly Publishing and the Fluid World Wide Web
Andrew Treloar, School of Computing and Mathematics, Deakin University, Rusden
Campus, 662 Blackburn Road, Clayton, 3168, Australia. Phone
+61 3 9244 7461. Fax +61 3 9244 7460.
Email: Andrew.Treloar@deakin.edu.au.
Home Page: Andrew
Treloar.
Version information
This version was last modified on October 30, 1995. The
original version as delivered at the Asia-Pacific
World Wide Web conference is also available online.
- Abstract:
- Scholarly publishing has traditionally taken place using
the medium of print. This paper will discuss the transition
to electronic scholarly publishing and how the fluid nature
of the Web introduces special problems. These problems
include document linking, document invariance, document
durability, and changing standards.
- Keywords:
- Electronic publishing; scholarly publishing; SGML; HTML;
URL; URN; durability; fixity; standards.
Note on Citations:
As the citations need to be accessible on the Web and in
print, the following procedure has been followed. All
bibliographic citations in the text are hyper-linked to an
anchor in the reference list at the end of this paper. This
allows the reader easily to check the detail of the citation
and determine if they wish to access it. Cited works for which
there exists a URL can then be accessed by clicking on the text
of the URL itself, which will be given in full after the
citation information. Selecting Go back will, in most
Web browsers, return to the original location in the main
text.
Scholarly publishing has traditionally taken place using the
technology of print. Despite the seemingly inevitable rush into
the digital age, print is still the primary publishing
technology for the overwhelming majority of scholarly
disciplines. It is also the technology that still provides the
official archival record for almost all publications. However,
print publications suffer from a number of disadvantages:
- Journals tend to be slow to appear. [Har91] identifies the lag between writing and
publication as their major disadvantage.
- Print cannot be directly searched, leading to a large
market for secondary abstracting and indexing services.
- Print publications are limited to information that can be
represented statically in print.
- Mechanisms for hyper-linking (i.e. traditional citations)
are clumsy at best.
- Print is costly to produce, distribute and store ([Odl95]).
For all these reasons, as soon as the available technology
made it practicable, pioneering scholars began to use whatever
means they could to produce and distribute their writings
electronically. Such electronic publishing is sometimes
referred to as epublishing, by analogy with email. For an
excellent selective bibliography on the subject of scholarly
electronic publishing over networks, consult [Bai95]. For an analysis of the views of
academics on electronic publishing, consult [Sch94]. In roughly chronological order, the
technologies adopted for epublishing on the Internet were:
- listserv,
- anonymous file transfer protocol (aftp),
- gopher,
- and the Web.
New technologies tend to be used in addition to older
technologies, rather than supplanting them. Thus, in the field
of electronic publishing it is not unusual to find journals
that were initially distributed by listserv, and which then
added aftp, and later perhaps gopher access. These older
distribution technologies are now being augmented or
(increasingly) replaced by the Web. The non-hierarchical
document-based networked hypermedia architecture of the Web
provides a much richer environment for electronic publishing on
the Internet than any of the previous technologies.
Unfortunately, the Internet is an inherently impermanent
medium, characterised by anarchy and chaos. While this dynamic
information environment has many advantages, it poses some real
problems as a publishing medium. The Web inherits all of this
by using the Internet as a transport mechanism, but then adds
to it some challenges of its own. The Web is the area of the
Internet that is currently developing at the greatest rate.
This speed of change brings with it a large degree of fluidity
as technologies progress, mechanisms are developed and
standards become fixed. The difficulties for scholarly
publishing on the Web are particularly severe with respect to
document linking, document fixity, document durability, and
changing standards. This paper will consider each of these in
turn and try to provide some solutions or ways forward.
To be useful to scholars, publications need to be readily
accessible. As directory hierarchies are re-organised or
servers moved, old URLs no longer work. This is of course a
general problem with Internet-based electronic information
systems, but the problem of broken URLs is particularly acute
on the Web, where documents often contain multiple links to
other documents. Breaks in the electronic spider-web of links
can be extremely frustrating, and detract markedly from the
feeling of immersion in a seamless information environment. A
range of solutions to the problem of broken URLs are available
or under development. They include manual fixes, semi-automated
assistance, managing the links themselves separately from the
documents, and rethinking the link mechanism altogether.
A manual solution is to ensure that when directory
hierarchies are re-organised on servers, links are placed from
old locations to new locations. For Unix ftp servers this can
be done with link files. On the Macintosh, aliases perform a
similar function. On Web servers, a small document that
- indicates the file has moved
- states the new URL
- and provides a clickable link
is usual and sufficient.
This sort of fix works, but has a number of
deficiencies:
- It requires the active involvement of server
administrators and is therefore prone to errors.
- Such link files often are given only a limited life.
After they expire, there is no way for a user to know where
to manually redirect the HREF.
- The authors of documents which point to altered locations
are not (and cannot be) notified that the target of their
HREFs has moved.
Most importantly, this technique does not scale well to large
complex hyper-linked document sets.
A partial improvement would be some mechanism that helped
identify links that were broken. It is technically feasible to
provide some form of Web
robot that would periodically walk all the links out of a
site and ensure that they still work. Unfortunately, I know of
no commercial vendors of Web publishing software who provide
such a facility. The closest thing available is the recently
introduced URL-minder
service. This allows a user to register document URLs and
receive an email message whenever the document moves or
changes. This can be very useful, especially if one is
concerned about the content of a linked-to document,
as well as its location.
Of course, one of the reasons that the Web has difficulties
with broken links is that the links are embedded in the
document stream. This makes automation of link repair
difficult. A number of information systems have made a
conscious decision to separate documents and links. Two
usefully illustrative examples are Hyper-G and PASTIME.
The Hyper-G information system ([Kap95])
uses a separate database engine to maintain meta-information
(including, but not restricted to, links) about documents as
well as their relationships to each other. Hyper-G servers
automatically maintain referential integrity for their local
documents and communicate with other servers to maintain
integrity for the entire system. In contrast to the Web, links
are bidirectional, enabling one to find the source of a link
from its destination (as well as vice versa). Hyper-G has
native clients for a range of platforms, or can be accessed via
a WWW to Hyper-G
gateway.
The PASTIME project ([Thi95]) has abandoned altogether the
practice of embedding fixed hyperlinks into documents. The
links are generated on the fly, based on sophisticated pattern
processing, as the HTML document is served to the remote
client. To add or remove hyperlinks only the pattern-processing
needs to be altered. Fixed hyperlinks can be entered if
desired. In contrast to Hyper-G, the currency of hyperlinks is
not dependent on maintaining a separate database of link
information. The runtime efficiency of the pattern matching is
claimed to be very high.
Ultimately, the most satisfactory solution will be to rethink
what is meant by a link. The most appropriate model is to adapt
the method used for scholarly links to other documents for
centuries - the scholarly citation. At the end of this paper is
the References section. This provides links
to other documents in the form of standardised bibliographic
citations. These citations do not make reference to the
location of the document - they only specify its
name in some unambiguous form.
The Web equivalent, of course, is the distinction between
Uniform Resource Locators (URLs)
and Uniform Resource Names (
URNs). These are part of a proposed wider Internet
infrastructure in which URNs are used for identification and
URLs for locating or finding resources. As in the print world,
what scholars want to be able to link to is the
contents of other identifiable documents - the
locations of those documents should be irrelevant.
URLs, with their dependence on a particular machine and
directory path, are a transitional kludge. URNs, with their
intended ability to refer to a known resource and have the
system take care of locating it and accessing it, are the long
term solution.
Of course, building and distributing a robust URL to URN
resolution mechanism is far from trivial, although some
prototype implementations are starting to appear, such as that
at Saint
Joseph's College in the U.S.A.
As an alternative transitional solution, [Fre95] proposes using standard message
broadcasting mechanisms, such as those provided by Usenet News
and Open Distribution Lists services, to announce
Uniform Resource Identifiers (URIs), and Uniform
Resource Characteristics (URCs), as well as URNs and
URLs.
In the area of the granularity of the object linked to Web
publishing provides a way to improve on traditional print
citations, which point to a specific page at best. Of course,
the concept of page is not particularly relevant to electronic
documents, particularly given the way Web browsers can
dynamically reformat text as the size of the display window is
changed. It would be better if the unit linked to was both
smaller than a page, and more closely-linked to the structure
of the document. Obvious alternatives for the Web are named
anchors for sections, or numbered paragraphs. URLs can then
point to the correct section or even paragraph of a document,
rather than just to its beginning. Naturally, if individual Web
documents are small, this is less important. But many documents
produced at this transitional stage in the move to
electronic-only publishing are still structured for printing.
This document, and the others produced for this conference, are
good examples of this. The style
guide requires that submissions be a single document for
ease of management.
All Web authors of long documents should bear the needs of
fine-grained linking in mind and provide named anchors for
others to link to. Leading by example, this document has
pre-defined names for each heading. To stop URLs growing to
ridiculous lengths, these names are 2 or 3 character
identifiers that are unique (obviously) within this document.
To see what they are for linking purposes, readers can select
View Source in their browser.
In the print world, we are used to documents remaining fairly
static. Journal articles, once published, are not normally
updated. They become fixed in time and part of the historical
record of scholarship. Monographs may appear in more than one
edition, but are then clearly recognisable as new products,
often with years between successive versions. Electronic
documents on the Internet can change so quickly that they sport
version numbers and the date they were last updated. Scholarly
Web publishing has a range of possibilities with respect to
fixity. At one end of the spectrum are document that follow the
print model and remain fixed once published. Somewhere in the
middle are documents where only minor change is allowed. At the
other end are documents that are continuously updated as more
information becomes available, or as the author's views change.
As the first step away from static documents, the High Energy
Physics community has moved to a model of electronic publishing
which allows for ongoing corrections and addenda. The hep-th e-print archive which
provides this facility serves over 20,000 users from more than
60 countries, and processes over 30,000 messages per day ([Gin94]). JAIR,
the Journal of Artificial Intelligence Research has
also just introduced the ability to make comments on published
papers and read other's comments.
A number of electronic scholarly journals are intending to
add pointers from earlier published articles to related later
ones on the author's behalf. The assumption is that pointers in
the reverse direction (that is, to prior published work) will
already have been included by the authors.
Tony
Barry from the ANU has suggested in [Bar95] that continuously updated documents
might be viewed as being more like a computer database. Such
continuous updating may be desirable to cope both with the
broken URL problem discussed above and the exponentially
increasing amount of information online that can be linked to.
The first of these problems may go away in the near future -
the second is unlikely to anytime soon.
In [Bar94] he has also suggested that
scholars should get recognition for the currency of
their documents rather than their number. While attractive on
the surface, I am profoundly sceptical that universities who
are currently just starting to grapple with recognising the
validity of electronic publications are ready for this
visionary proposal. Nor am I convinced that the implications of
this model for the workloads of scholars have been fully
thought through. I can imagine an academic trying (and failing)
to keep a number of her articles in different content areas up
to date. As the number of such articles increase (as one might
hope they would, particularly in a 'publish or perish'
environment), the workload would increase to crippling levels.
This is particularly a problem in fields undergoing rapid
change.
Perhaps the only reasonable interim solution is to
distinguish somehow between fixed documents (print-like) and
continously updated documents (database-like), or at least to
make it clear at the top of a document into which category it
falls. This approach has been used, for example by Bailey in [Bai95]. The HTML version
of this document is continuously updated - the
ASCII version is fixed and permanently archived. Public-Access Computer
Systems Review (PACS-R), where these articles were
published, has until this year only published in fixed ASCII.
As a sign of the changing times, it now accepts articles in
fixed ASCII, fixed HTML and author-updated HTML.
If documents are continuously changing and evolving over time,
which version should be cited? Which version is the
'publication of record' (assuming this means anything any
more)?
Two solutions are used to the problem of permanence on the
Web at present.
- Every time the document changes, its name changes also.
If the older version is replaced by the newer, then all URLs
pointing to the older version break. Moving to URNs will not
help in this case.
- Every time the document changes, the name is kept the
same and the contents updated. Existing URLs will still work,
although the target of the URL may have changed its content
significantly. In this case, what if one scholar cites a
section in a document that disappears in the next revision?
The URL-minder
service already mentioned will only be of limited help here.
Presumably under this model, a new URN will need to be
assigned, as is done with successive versions of print
monographs.
An alternative solution with wider application than just the
Web is discussed in [Gra95]. The nature
of all digital documents means that a mechanism is needed to
ensure the authenticity of a given document, or to track
multiple versions of a digital original. His proposed solution
for such authentication and version tracking is Digital Time
Stamping (DTS). With DTS a one-way algorithm is used to
generate a key that can be produced only by the original
document. These keys would then be made public, thus ensuring
that anyone could verify the version of a given document.
Document durability ([Kau93]) refers
to the length of time the article is available for
communicative transactions. Paper documents printed on paper
that is not acid-free have a durability of some 100 years
unless corrective action is taken. The durability of Web
documents is entirely unknown, but there are no
technological reasons for their life to be limited in
any way, provided they are archived in some systematic way. At
present there are no mechanisms to ensure that this will
occur.
Graham ([Gra94]) divides the problem of
ensuring the preservation of documents into:
- medium preservation
- technology preservation
- intellectual preservation
Medium preservation relates to the problem of preserving the
medium on which the information is recorded. This has been
traditionally been discussed in terms of environmental and
handling concerns for storage media. This will continue to be
important, but the rate of technological change is such that
preserving media like 8" floppies is of little use if nothing
can read them. [Les92] has suggested
instead that attention should instead be directed to
technological preservation. That is, the obsolescence of
technologies is much more of a problem than the decay of
storage media. In his words, for electronic information,
"preservation means copying, not physical preservation." The
third of Graham's preservation requirements, intellectual
preservation, addresses the integrity and authenticity of the
originally information. In other words, what assurance do we
have that what is retrieved is what was originally recorded. [Gra95] proposes the DTS technology
discussed above as a solution to this
problem. [Rot95] also discusses a range
of mechanisms to deal with each of these three preservation
problems.
In many ways, the digital nature of all electronic publishing
can be both a strength and a weakness in the area of
durability. A strength, because digital documents can easily be
copied and replicated at multiple sites around the world. A
weakness, because destroying a digital document is far easier
than destroying a physical document. It is easy to assume that
the document will exist elsewhere on the Net and that the fate
of a single copy is irrelevant. Of course, there is no
mechanism to prevent everyone making this assumption and
causing the loss for ever of a piece of scholarship. In some
ways, the analogy of the single manuscript forgotten on top of
a cupboard in a mediaeval monastery may well be a forgotten
directory on a rarely used hard-disk somewhere in a university
department. Unfortunately, it is all too easy to unconsciously
delete a directory - throwing away a manuscript without
realising it is somewhat harder. Given the lack of any
mechanism to ensure the archiving of print publications, it
seems unlikely (although technologically relatively easy) that
anything will be done about the situation for digital
documents.
The explosive development and adoption of the Web has been
paralleled by the evolution of the HTML standard. Starting from
what is now called HTML level 1, we have moved through level 2
and got nearly as far as level 3 (for most browsers, at least).
At the time of writing, HTML level 3 is still not finalised,
although this is not stopping the developers of Web browsers
from adding proposed or likely level 3 features to their
products. The W3O and others
are no doubt already thinking about the sorts of things that
might appear in HTML level 4.
As HTML has grown and mutated it has added a whole range of
things that were not envisaged at its birth, back when the
world was a simpler place. These include at least detailed
layout control, tables, and equation support. The question is,
should the process of accreting features to HTML continue
unabated into the future?
Price-Wilkins ([Pri94a], [Pri94b]) has argued that HTML as used on the
Web has a number of fundamental deficiencies as a scholarly
markup language. Some of those he lists are:
- The range of HTML tags available to users is still too
limited. As a consequence, authors are unable to
differentiate important elements with HTML.
- Where the author wishes to define a bounded segment of
text, such as a stanza or chapter, no tag is available for
this purpose. Instead, authors rely on dividing documents
into files representing major structural divisions.
- HTML tagging often confuses function and appearance.
- There are very few HTML tags that define structural
relationships.
He argues that those currently coding their documents in HTML
may come to regret their short-sightedness in a few years.
Instead, he argues for coding complex documents in SGML and
converting this into HTML on the fly for delivery on the Web as
well as through other means. As it turns out, one of the
characteristics of the evolution of HTML is precisely in the
direction of greater SGML compliance.
Phillip Greenspun, from MIT, has also written on the
deficiencies of HTML ([Gre95]). His
preferred solution is to make much wider use of the META tag
included in HTML level 2.
HTML is certainly evolving towards full SGML compliance, but
betrays its origin as a formatting language rather than a
structuring language at every turn. It may not be possible to
migrate entirely seamlessly towards SGML. Indeed it may not be
necessary. Many types of publishing do not require the range of
features listed by Price-Wilkins. SGML to HTML gateways may
only be required for particular kinds of large complex
documents.
One way out of this mess is to clearly separate the internal
representation of a document from its ultimate delivery format.
We already do this with computer-generated documents that are
delivered in print. In the rapidly-changing world of electronic
publishing, documents may be delivered in HTML, as Adobe Acrobat PDF
files, as Postscript files, and in print (to name only the most
obvious options). Documents may be prepared in a wide range of
word or document processors for ultimate delivery using these
formats. Provided the amount of structuring and layout
information about the document is richer than the ultimate
delivery format, than conversion is relatively simple. A number
of manufacturers are already facilitating this with their
products. Framemaker version 5.0 from Frame Technology Corporation and
Pagemaker 6.0 from Adobe,
both provide support for output to PDF and HTML, as well as
Postscript and print. Microsoft Word's Internet
Assistant provides users of the latest version of Word for
Windows with the ability to save in HTML as well as the native
document format. Given the rate of change in electronic
delivery technologies, it is probably best to keep files in a
format that can be output in a range of forms. This provides
the maximum flexibility and is reasonably future-proof.
The Web provides an attractive and accessible environment for
scholarly electronic publishing (provided the content is
compatible with the limitations of HTML). It seems likely that
the use of the Web in this context will increase. In light of
the nature of the Web, care needs to be taken to ensure that
the potential dangers of the Web's fluidity are overcome. This
paper has outlined a number of the areas of difficulty and some
possible solutions.
- [Bai95]
- C. Bailey Jr., "Network-Based Electronic Publishing of
Scholarly Works: A Selective Bibliography", The
Public-Access Computer Systems Review, Vol. 6, Number
1, http://info.lib.uh.edu/pr/v6/n1/bail6n1.html.
- [Bar94]
- T. Barry, " Publishing on the Internet with World Wide
Web ", in Proceedings of CAUSE '94 in Australia,
CAUDIT/CAUL, Melbourne.
- [Bar95]
- T. Barry, "Network Publishing on the Internet in
Australia", in The Virtual Information Experience -
Proceedings of Information Online and OnDisc '95,
Information Science Section, Australian Library and
Information Association, pp. 239-249.
- [Fre95]
- V. Freitas, "Supporting a URI Infrastructure by Message
Broadcasting", in Proc. INET '95, http://inet.nttam.com/HMP/PAPER/116/abst.html.
- [Gin94]
- P. Ginsparg, "First Steps towards Electronic Research
Communication", Computers in Physics,
August.
- [Gra94]
- Peter S. Graham, " Intellectual Preservation: Electronic
Preservation of the Third Kind", Commission on
Preservation and Access, Washington, D. C., http://aultnis.rutgers.edu/texts/cpaintpres.html.
- [Gra95]
- Peter S. Graham, "Long-Term Intellectual Preservation ",
Proc. RLG Symposium on Digital Imaging Technology for
Preservation, http://aultnis.rutgers.edu/texts/dps.html.
- [Gre95]
- P. Greenspun, "We have Chosen Shame and Will Get War",
http://www-swiss.ai.mit.edu/philg/research/shame-and-war.html.
- [Har90]
- S. Harnad, "Scholarly Skywriting and the Prepublication
Continuum of Scientific Inquiry", in Psychological
Science, Vol. 1, pp. 342 - 343 (reprinted in Current
Contents 45: 9-13, November 11 1991),
ftp://ftp.princeton.edu/pub/harnad/Harnad/harnad90.skywriting.
- [Har91]
- S. Harnad, "Post-Gutenberg Galaxy: The Fourth Revolution
in the Means of Production of Knowledge", in The
Public-Access Computer Systems Review, Vol. 2, No.1,
pp. 39-53,
ftp://cogsci.ecs.soton.ac.uk/pub/harnad/Harnad/harnad91.postgutenberg.
- [Kap95]
- F. Kappe, "Maintaining Link Consistency in Distributed
Hyperwebs", in Proc. INET '95, http://inet.nttam.com/HMP/PAPER/073/html/paper.html.
- [Kau93]
- D. S. Kaufer & K. M. Carley, Communication at a
Distance - The Influence of Print on Sociocultural
Organization and Change, Lawrence Erlbaum
Associates.
- [Les92]
- M. Lesk, Preservation of New Technology: A Report
of the Technology Assessment Advisory Committee to the
Commission on Preservation and Access, Washington, DC:
CPA. Available from the Commission at $5: 1400 16th S. NW,
Suite 740, Washington, DC 20036-2217.
- [Odl95]
- A. Odlyzko, "Tragic loss or good riddance? The impending
demise of traditional scholarly journals" in Electronic
Publishing Confronts Academia: The Agenda for the Year
2000, Robin P. Peek and Gregory B. Newby, eds., MIT
Press/ASIS monograph, MIT Press,
ftp://netlib.att.com/netlib/att/math/odlyzko/tragic.loss.txt.
- [Pri94a]
- J. Price-Wilkin, "Using the World-Wide Web to Deliver
Complex Electronic Documents: Implications for Libraries" in
The Public-Access Computer Systems Review, Vol.
5, No. 3, pp. 5-21,
gopher://info.lib.uh.edu:70/00/articles/e-journals/uhlibrary/pacsreview/v5/n3/pricewil.5n3.
- [Pri94b]
- J. Price-Wilkin, "A Gateway Between the World-Wide Web
and PAT: Exploiting SGML Through the Web.", in The
Public-Access Computer Systems Review, Vol. 5, No. 7 ,
pp. 5-27,
gopher://info.lib.uh.edu:70/00/articles/e-journals/uhlibrary/pacsreview/v5/n7/pricewil.5n7.
- [Rot95]
- J. Rothenburg, "Ensuring the Longevity of Digital
Documents", in Scientific American, January, pp.
24 - 29.
- [Sch94]
- D. Schauder, Electronic Publishing of Professional
Articles: Attitudes of Academics and Implications for the
Scholarly Communication Industry, Unpublished Ph. D.
Dissertation, University of Melbourne.
- [Thi95]
- P. Thistlewaite, "Managing Large Hypermedia Information
Bases: a case study involving the Australian Parliament", in
Proc. AusWeb'95,
http://www.scu.edu.au/ausweb95/papers/management/thistlewaite/.
ÿ