Tutorial on XHTML
Software Engineering Group 4

Stijn Coene
stijn.coene@vub.ac.be

"March 19, 2004"

1  Introduction

This tutorial will describe the difference between XHTML and HTML 4.0.

2  Basic differences

I split this in two parts. The basic differences are those that we will use more. The advanced differences can also be useful, but I think the're not of crucial importance for our project.

2.1  Documents must be well-formed

Essentially this means that all elements must either have closing tags or be written in a special form (as described below), and that all the elements must nest properly.
Example

CORRECT: nested elements.

<p>here is an emphasized <em>paragraph</em>.</p>

INCORRECT: overlapping elements

<p>here is an emphasized <em>paragraph.</p></em>

2.2  Element and attribute names must be in lower case

XHTML documents must use lower case for all HTML element and attribute names. This difference is necessary because XML is case-sensitive e.g. <li> and <LI> are different tags.

2.3  For non-empty elements, end tags are required

In SGML-based HTML 4 certain elements were permitted to omit the end tag; with the elements that followed implying closure. XML does not allow end tags to be omitted. All elements other than those declared in the DTD as EMPTY must have an end tag. Elements that are declared in the DTD as EMPTY can have an end tag or can use empty element shorthand
Example

CORRECT: terminated elements

<p>here is a paragraph.</p><p>here is another paragraph.</p>

INCORRECT: unterminated elements

<p>here is a paragraph.<p>here is another paragraph.

2.4  Attribute values must always be quoted

All attribute values must be quoted, even those which appear to be numeric.
Example

CORRECT: quoted attribute values

<td rowspan=\char`\"{}3\char`\"{}>

INCORRECT: unquoted attribute values

<td rowspan=3>

2.5  Attribute Minimization

XML does not support attribute minimization. Attribute-value pairs must be written in full. Attribute names such as compact and checked cannot occur in elements without their value being specified.
Example

CORRECT: unminimized attributes

<dl compact=\char`\"{}compact\char`\"{}>

INCORRECT: minimized attributes

<dl compact>

2.6  Empty Elements

Empty elements must either have an end tag or the start tag must end with />. For instance, <br/> or <hr></hr>.
Example

CORRECT: terminated empty elements

<br/><hr/>

INCORRECT: unterminated empty elements

<br><hr>

3  Advanced differences

3.1  Script and Style elements

In XHTML, the script and style elements are declared as having #PCDATA content. As a result, < and & will be treated as the start of markup, and entities such as &lt; and &amp; will be recognized as entity references by the XML processor to < and & respectively. Wrapping the content of the script or style element within a CDATA marked section avoids the expansion of these entities.
<script type=\char`\"{}text/javascript\char`\"{}>

<!{[}CDATA{[}

... unescaped script content~...

{]}{]}>

</script>

CDATA sections are recognized by the XML processor and appear as nodes in the Document Object Model, see Section 1.3 of the DOM Level 1 Recommendation [DOM] (http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html#ID-E067D597).
An alternative is to use external script and style documents.

3.2  SGML exclusions

SGML gives the writer of a DTD the ability to exclude specific elements from being contained within an element. Such prohibitions (called "exclusions") are not possible in XML.
For example, the HTML 4 Strict DTD forbids the nesting of an 'a' element within another 'a' element to any descendant depth. It is not possible to spell out such prohibitions in XML. Even though these prohibitions cannot be defined in the DTD, certain elements should not be nested. A summary of such elements and the elements that should not be nested in them is found in the normative Element Prohibitions.

3.3  The elements with 'id' and 'name' attributes

HTML 4 defined the name attribute for the elements a, applet, form, frame, iframe, img, and map. HTML 4 also introduced the id attribute. Both of these attributes are designed to be used as fragment identifiers.
In XML, fragment identifiers are of type ID, and there can only be a single attribute of type ID per element. Therefore, in XHTML 1.0 the id attribute is defined to be of type ID. In order to ensure that XHTML 1.0 documents are well-structured XML documents, XHTML 1.0 documents MUST use the id attribute when defining fragment identifiers on the elements listed above.
Note that in XHTML 1.0, the name attribute of these elements is formally deprecated, and will be removed in a subsequent version of XHTML.

3.4  Attributes with pre-defined value sets

HTML 4 and XHTML both have some attributes that have pre-defined and limited sets of values (e.g. the type attribute of the input element). In SGML and XML, these are called enumerated attributes. Under HTML 4, the interpretation of these values was case-insensitive, so a value of TEXT was equivalent to a value of text. Under XML, the interpretation of these values is case-sensitive, and in XHTML 1 all of these values are defined in lower-case.

3.5  Entity references as hex values

SGML and XML both permit references to characters by using hexadecimal values. In SGML these references could be made using either &#Xnn; or &#xnn;. In XML documents, you must use the lower-case version (i.e. &#xnn;)


File translated from TEX by TTH, version 3.40.
On 13 Jun 2004, 11:10.