Andrew Treloar's personal website
Search WWW Search

Hypermedia Online Publishing: the Transformation of the Scholarly Journal

4.5.4 Document oriented solutions

These consist of markup languages which are focused on describing entire documents rather than individual pages. They either make no explicit reference to formatting or do not specify a particular page size. The three most significant languages are all interrelated: SGML, HTML and XML.

Standard Generalised Markup Language (SGML)

Standard Generalised Markup Language (SGML) is best thought of as a document markup definition language. Defined as ISO Standard 8879:1986, it provides a formal notation for definition of generalized markup languages [Goldfarb, 1991]. Such languages are defined as a Document Type Definition (DTD). The DTD defines the allowable tags for a particular document type, and the permissible sequence of these tags. In effect SGML is "a metalanguage, in which tag sets, as well as usage rules for these tags, can be defined" [Marcoux and Sévigny, 1997, p. 586]. A wide range of existing standard DTDs have already been created, and organisations using SGML will often have their own in-house DTD.

The important thing to note about SGML is that it encodes the structure of the document and does not define its appearance. This is handled by the Document Style Semantics and Specification Language (DSSSL). Because of this explicit focus on document semantics rather than expression, SGML lends itself very well to repurposing of content. An SGML-encoded document can easily be rendered for print output, delivery on the Web, or distribution on CD-ROM.

Although SGML predates the Web (and indeed was hinted at as early as 1970 [Goldfarb, 1997]) it has been taken up fairly slowly. The Online Journal of Current Clinical Trials was an early e-journal user of SGML technology [Keyhani, 1995]. This is because it is a fairly complex system and the benefits are most accessible to large organisations with complex sets of technical documents. It is currently being adopted more enthusiastically by organisations wishing to reuse their content in a variety of media and forms [Marcoux and Sévigny, 1997]. Work is also proceeding apace on further development of SGML and its related standards: DSSSL, HyTime (the Hypermedia Time-Based encoding language) and SPDL (Structured Page Description Language) [Mason, 1997].

HyperText Markup Language (HTML)

HyperText Markup Language (HTML) is formally defined in terms of the ISO Standardised Generalised Markup Language (SGML) as a specialised DTD. It provides a standardised way to create structured textual documents for delivery on the Web (and increasingly elsewhere). In the context of the WWW initiative, HTML is used to encode Web documents and embed the links that together to form the web. Non-HTML documents that are pointed to lie at the periphery of the web - they cannot themselves point to anything else.

HTML defines a small but growing number of constructs which can be used to build up documents of considerable flexibility and power. All these constructs are included in the body of the document and delimited with the < and > characters. Such a delimiter is called a tag. Many of the tags are paired: <X> starts a construct and </X> ends it. This system of tags plus text is similar to other superseded and current systems for marking text to control output like Runoff, troff, and Tex. Another way of thinking about it is that HTML documents are programs, and the client programs 'run' these programs to generate the final document. The range of possible tags covers both structuring elements, and a range of formatting commands. The current version of the HTML tag set, HTML 4.0 has just been released by the World Wide Web Consortium.

Structuring elements govern the logical (as opposed to physical) structure of the document. The two main constructs here are inline images (referred to already) and anchors. Anchors are pieces of text which mark the beginning and/or the end of a hypertext link. They allow links inside a document or to another document. Links within documents are commonly used to provide a table of contents at the start of a long HTML document. The user can jump to a particular section by clicking on an internal hyperlink. Anchors may also be referenced in URLs, allowing links into the middle of documents. Links to another document invoke the full power of the URL mechanism. This means that a single HTML document can refer to other HTML documents on other servers, to Gopher servers, to Usenet newsgroups, FTP sites, and the like. The structuring features of HTML are much more primitive than full SGML allows and are not binding on the author of the document.

Formatting commands allow the designer of a HTML document to control the layout and appearance of the text. The interpretation of HTML documents normally ignores line feeds, form feeds and carriage returns. This requires explicit marking of document formatting. This formatting includes up to six levels of headings, paragraph breaks, various types of lists including numbered and bullet points, and character highlighting - bold, italic, monospace text and the like.

The original HTML was not very SGML compliant, although it was SGML-like. With each iteration of the HTML standards it is moving towards closer SGML compliance. The long-term goal is to move HTML into something that is entirely SGML-compliant. This is XML (eXtensible Markup Language).


XML is best thought of as generic SGML delivered over the Web (or 'SGML-Lite'). Its design goals were to provide 80% of the benefits of SGML for 20% of its complexity. The problem is that the fullSGML specification is both hard to implement and more than most Web users need. XML will enable an ISO-compliant subset of SGML to be served, received and processed on the Web. Of course, this will require upgraded servers and browsers to be able to manage documents, their associated DTD's and one or more stylesheets for display. The components of XML are DTD's, XSL, and XLL [Bray et al., 1998].

As with full SGML, the Document Type Definition (DTD) specifies the logical structure (or grammar) of the document. In particular it defines a page's elements and attributes, and the relationships among those elements and attributes. Developers can use existing DTDs or provide no DTDs. In this case the XML parser will only check the document for 'well-formedness'.

The eXtensible Style Language (XSL) specifies style sheets for XML documents. The browser can change the appearance of the document by switching the style sheet. XSL is less complex than SGML's (DSSSL) and provides a subset of its functionality. A mechanical mapping from DSSSL to XSL will be possible.

The eXtensible Link Language (XLL) is a significant enhancement to the linking capabilities provided by HTML, which supports a tiny fraction of all possible hypertextual links. XLL is basically a subset of HyTime (the Hypermedia/Time-based Structuring Language) and will support:

XML is best suited for applications that:

The world of the World-Wide Web will gradually make the transition from HTML encoded documents to XML-encoded documents. Support for XML is starting to appear in Web authoring tools and should appear in the next versions of Web browsers.

Last modified: Monday, 18-Sep-2017 03:29:21 AEST

© Andrew Treloar, 2001. * *