Anfang des Inhaltsbereichs

Hintergrunddokumentation A sample XML document  Dokument im Navigationsbaum lokalisieren

OK, I know your time is limited and you haven't read all the stuff you found on the previous page.

Representing an order status in HTML

"Why HTML?" you might be asking yourself and your question is perfectly valid. There are of course a lot of nice things about using HTML, the most important one being that you are able to get the status of your order at Amazon.com over the Internet and can view it in your Web browser.

To understand why XML is the better choice, let's first take a look at a typical order status Web page as you might find it somewhere out there:

<html>
<head>
<title>Order Status</title>
</head>
<h1>Order Status</h1>
<p
<br>
<ol> <li>Part No. 0815, Units 4, Unit Pr
<li>Part No. 0815, Units 4, Unit Price 12.95 USD
<
<li>Part No. 0816, Units 2, Unit Price 14.50 USD
<li>Part No. 1504, Units 5, Unit Price 09.95 USD
</ol>
<p>Status: <em>Confirmed</em>
<p>Delivery Date: 08/15/1999
</body>
</html>

If you are familiar with HTML all this is not very exciting. Neither is the fact, that the HTML code above will be displayed similar as pictured below (not nice but we get the information we are looking for):

 Diese Grafik wird im zugehörigen Text erklärt

"Tell me something new!" your are meanwhile thinking, and I will ask you to change your perspective from being a intelligent human being to being a simple-minded software agent, robot or other software system like a business application. Let's assume you want to automatically retrieve the order status of an order you are keeping in your database. You know how to communicate over the Internet: there is a standard called HTTP and your programmer was wise enough to include an HTTP library into your executable code. Let's further assume that you know the URL of Amazon.com, where you placed the order for your user and the Amazon's Web server is willing to return the above HTML page to you. Now it's your task as a program to find the relevant data in this code full of HTML displaying information. You might use a heuristic approach and assume (or know) that right in front of the delivery date is the text "Delivery Date: " and that the following characters up to the next space form the actual date (you of course know that the format is "mm/dd/yyyy" or do you?). Now let's sit down and pray that the Web designers of Amazon don't change their mind about the page layout and dont' tell you that they have removed the text "Delivery Date:" in favour of an image of a calendar page...

Representing an order status in XML

To make a long story short, the example above was very simple and you might argue that there are ways to cope with the problems mentioned. I will argue that there are always solutions for software problems (in the end it's only software) and that real-world data exchange between two business applications usually involves more complex data structures, data formats and allows for less heuristic approaches.

Or you might argue that the approach above is absolutely ridiculous and nobody in the world would ever try to HTML-"scrape" business information from Web pages. Totally wrong! There are tools out there (e.g. webMethods) that help you doing exactly that. "(Business) Life always finds a way..." and if there is a need for that kind of functionality, people will build such systems (ups, has something like that ever happend at SAP ;-).

Fortunately, both you and I understand that there should be better ways than pattern matching to find data in documents.

The answer lies in marking up information or data in a way that helps identifying the semantics of the data, not the display properties like "format this as a table in the Web brower" or "render the following text with a 16pt font". In the example above we would obviously like markup that identifies business relevant data like the order number to which the status belongs, the status itself or the delivery date. This is where XML comes in handy...

XML allows you to define your own set of tags (the names surrounded by "<" and ">") specifically for your kind of information. Whereas HTML is made up from a set of fixed tags (ok, there are several versions: 2.0, 3.2, 4.0), we can't or shouldn't define one language to cover all possible tag names and structures somebody might need to markup data. Instead, XML allows you to choose tag names and structures depending on the sematics of your data (the "X" from "XML" comes from "eXtensible"): this time you want to markup "order status" documents, the next time it's "purchase orders" or your collection of CDs.

Now let's redo the example above, this time using XML instead of HTML:

<orderStatus>
<orderNo>4711</orderNo>
<items>
<item>
<partNo>0815</partNo>
<units>4</units>
<price currency="USD">12</price>
</item>
<item>
...
</item>
...
</items>
</orderStatus>

The HTML tags have been replaced by tags that describe the con

The HTML tags have been replaced by tags that describe the content and structure of data instead of the display properties. Now a software agent can look for the appropriate markup, e.g. the tag "<delDate>" to find the enclosed delivery date. The additional attribute format="mm/dd/yyyy" can even help you identifying the underlying date format.

"Ok, but...", you will say now, and you are right with all your objections:

First, who keeps somebody from changing the tag names? Nobody. But there is a difference between exposing Web pages for human viewers on a Web site and placing business documents containing data on a server: One can assume that you are smart enough to get the information you want even if the layout of the Web page completely changes. Software usually isn't that smart (the AI guys might disagree here). So changing data formats without prior notice is a classical "no-no" of the data exchange domain.

Second, who says that "<delDate>" means "delivery date" and not "date of deletion"? Nobody. In fact it could well be that in two different documents the same tag name is used to represent "delivery date" in one document and "date of deletion" in the other one. But in order to distinguish these cases, there are two additional concepts in XML which address these issues:
a) XML document type definitions (or DTDs): each document belongs to one document type. A document type defines the tag names and the allowed structure of an instance of this document type (i.e. a document). Furthermore the document type classifies the content of a document semantically.
b) XML namespaces (something I don't want to cover in more detail).

To put it right, XML does not provide any means to syntactically guarantee what is only semantically agreed on: if your application interprets <delDate> as "date of deletion" even though it is meant to be "delivery date", XML can't prevent it.

Obviously, you and your communication partners have to somehow agree on the names of the tags used, their semantics and the structure of the document. Ideally, there will be industry standards for specific data structures, e.g. XML/EDI for EDI over XML, standards for electronic data exchange in eCommerce applications etc. As of today, things are still under development.

Back to the simple things in life: you probably wonder if you can still display our "order status" document for a human viewer sitting in front of a Web browser, and the answer is "yes". The magic can be done using XSL - the eXtensible Stylesheet Language -, but this is another story and all I will say about it is, that using XSL and IE 5.0 you can display our XML version of the "order status" document like that:

Diese Grafik wird im zugehörigen Text erklärt

Looks familiar, eh?