Six years ago, after some quarter century working in information system development, I found myself the latest casualty of high-tech downsizing. I found myself applying for a “Document Engineer” job in the newspaper.
Much to my astonishment I found this led me to a new world involving SGML, XML and the new breed of IT systems which are enabled by these standards. This column is devoted to sharing some insights gained along the way.
MX2 is a reference to a 1945 Atlantic Monthly article by then director of the Office of Scientific Research and Development, Dr. Vannevar Bush. The article covered what activity scientists could involve themselves in, now that the war was over. Bush wrote: “Instruments are at hand which, if properly developed, will give man access to and command over the inherited knowledge of the ages.” Bush described an information processing “device” which he called a “memex” and outlined its operation, which reads remarkably like a description of today’s Web. See www.theatlantic.com/unbound/flashbks/computer/bushf.htm for a copy of this article.
Memex modernized is MX, and the XML start tag is an appropriate metaphor for this column, which will cover a range of topics related to how we will “give man access to and command over the inherited knowledge of the ages.”
Lets start with SGML, HTML and XML. Readers may be excused for being confused reading headlines about these. Is XML an extended HTML? Will XML replace SGML? Let’s begin our journey at the beginning.
SGML stands for Standard Generalized Markup Language, an ISO Standard (ISO-8879:1986). It is a meta-language that specifies how a markup language or tag set can be constructed for a particular application domain. It says nothing about the specifics of the tag sets. Definition of tag sets is the domain of the application that will use those tag sets to carry out some useful processing.
Why do we need tag sets or markup language? Most text-related applications use some form of markup language. For example, both WordPerfect and Word use embedded information in your text that direct the application to carry out processing (such as when to bold a character or start a new paragraph). This markup is proprietary and different between applications or even different versions of the same product. SGML markup is non proprietary. Information encoded with SGML markup is processable by any SGML-compliant application. Your information is captured in a vendor-, platform- and application-independent way. This is important for sharing your information between applications now, and of critical importance if you need to share with an application in the distant future.
HTML stands for HyperText Markup Language. HTML is an SGML application. The familiar
,
,
,
- , are part of a standard SGML tag set designed for HTML applications. This surprises many people who think SGML and HTML are somehow worlds apart. HTML contributed to the success of the Web by providing a means by which information can be tagged in a standard way, displayed successfully by any HTML-compliant browser, and providing a useful implementation of hypertext, a major component of the memex described by Bush.
The weakness of HTML is that the tags are designed to describe how information transmitted over the Web should appear, and says nothing at all about what it is. Thus a financial transaction, a request for an inventory update, a grade school pupil’s Web page and the HTML version of Alice in Wonderland all use the same markup. Hence, it is very difficult to get a computer to carry out processing beyond displaying on the screen, short of trying to deduce what the object is by “reading” and “understanding” the content. If we are going to carry out more useful processing over the Web, HTML is not semantically rich enough. Enter XML, the eXtensible Markup Language.
XML is a meta language, and a simplified subset of SGML. What has changed is those parts of SGML that make the development of SGML-compliant applications difficult to develop are removed. Hence XML allows for the development of semantically rich tag sets and easier development of XML-compliant applications.
Will XML replace SGML? Charles Goldfarb in the XML Handbook points out that they don’t even compete. Whereas SGML has typically been used for very large-scale commercial publishing, the XML subset is optimized for the Web and is targeted at having computers do useful things with Web transactions (like completing a banking transaction). XML, used this way, is much like a schema description for a transaction. The information might have been generated by one application, transmitted over the Web, and consumed by a second application, causing a desired behaviour. For this reason you will often read that an application domain has designed both a tag set (for the semantics) of the transaction, and a protocol (for the behaviour of the application).
W. Hugh Chatfield, ISP, is an advisory consultant at Microstar Software in Ottawa. He can be reached at hchatfie@microstar.com.