Why CDATA Matters in XML

You’ve seen it before, but you may not know what it means. Wikipedia describes CDATA as meaning “Character Data” which makes sense. w3school goes one step further and points out that this text should not be parsed by the XML Parser. The general idea is when you want to display straight textual data, without needing to encode characters or wanting them interpreted by the parser, you can just wrap that data inside of a CDATA tag.

Needless to say this is clearly the ugliest tag currently in existence (lets leave room for the future though):

<![CDATA[ ... ]]>

I promised to tell you why it mattered

Yes, words in the title become a promise. You can hold me to that in the future. Why does CDATA matter? Well, I’ll actually side-step the question for a minute and show you what looks like a perfectly fine looking XHTML document (keep in mind that XHTML is a subset of XML):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
  <title>Simple Example</title>
</head>
<body>
  <h1>Welcome</h1>
  <script type="text/javascript">
    document.write('<p>Hello, World</p>');
  </script>
</body>
</html>

Looks simple enough. We can ignore the fact that its not really the best way of doing things, but who cares… its a Hello World example right? Well, technically this is not Valid XHTML. The validator shows we’ve got a single error:

Line 11, Column 20: document type does not allow element "p" here.
document.write('<p>Hello, World</p>');

The element named above was found in a context where it is not allowed. This could mean that you have incorrectly nested elements — such as a “style” element in the “body” section instead of inside “head” — or two elements that overlap (which is not allowed).

Well, the validator hints at the cause of the error, but it is hard to understand unless you really know your XML! We are inside a “<script>” tag, we’ve got some text, and all of a sudden a “<p>” pops up! XML Parsers don’t care that its in the middle of a string, it sees another tag and that tag doesn’t make sense.

So, here is why CDATA matters. You want your Javascript to be left alone by the XML Parser. In HTML, Javascript is interpreted as text, so it can just be left alone as plain text by wrapping it in a CDATA tag:

<script type="text/javascript">
  <![CDATA[
    document.write('<p>Hello, World</p>');
  ]]>
</script>

Don’t get excited yet. Yes, that passes the W3C validator but the javascript fails to run. Why? Well, I actually haven’t got a clue. My guess would be its not stripped out of the javascript and invalidates the javascript when it tries to run. In any event, lets check our steps… Is it valid xml? Check. Is it being served as xml? Well, actually I don’t think so.

Here is a nice resource that talks about Understanding HTML, XML, and XHTML. If you haven’t read that article either read it now or once you’re done here; its important, no matter how old it is. To pull a quote:

to really send xhtml, an xhtml page must be served as xml and therefore have one of the following Content-Type’s (text/xml, application/xml, application/xhtml+xml) to a browser.

This is a simple one liner in php. I added the following code to the top of the page, and resent it to my browser:

<?php header('Content-type: application/xhtml+xml'); ?>

Doh. Well, we’ve covered all the bases and it still doesn’t work. Its valid, its sent as xhtml, now everything is left up to the browser and it doesn’t seem to work. If you know why, drop a comment. Again my suspicion is that the browsers don’t properly handle XHTML and CDATA completely. However, there is a pretty nice trick that we can make use of to get this to work and validate (even sent as text/html):

<script type="text/javascript">
  // <![CDATA[
    document.write('<p>Hello, World</p>');
  // ]]>
</script>

Well there you go. A 100% valid page, that runs in all browsers, that properly tells the XML Parser “hey, leave these characters alone” and it works. The problem is identifying when this is necessary. For most people, having the original page, which rendered correctly but didn’t validate would be enough. Browser developers are watching out for you and working around mistakes in HTML and XHTML. However, that isn’t always the case.

Real World Example

Here I’ll pull a real world example. Some XML Specifications allow the ability to send XML under a different namespace as content inside of an existing XML tag. Some do so in a “psuedo” way. Take a look at the Atom Publishing Protocol (commonly referred to as Atompub or APP for short). Here is a snippet from the RFC describing the Atom Syndication Format, specifically the structure for an atom:title tag with type=”html” within of an atom:entry:

...
<title type="html">
  Less: &lt;em> &amp;lt; &lt;/em>
</title>
...

If the value of "type" is "html", the content of the Text construct MUST NOT contain child elements and SHOULD be suitable for handling as HTML [HTML]. Any markup within MUST be escaped; for example, "<br>" as "&lt;br>". HTML markup within SHOULD be such that it could validly appear directly within an HTML <DIV> element, after unescaping. Atom Processors that display such content MAY use that markup to aid in its display.

Okay, sorry for the long setup, but we have finally arrived at the point of this post. That type=”html” element cannot have child elements. The XML parser will identify child elements based on a “<" character. Assuming whatever project you would be working on takes that input from the user that means you would have to pass it through a filter, encoding HTML characters like ampersands, less than and greater than signs, the list goes on. That operation is expensive and may even cause problems in itself. I ran into a situation just the other day where an ampersand for an encoded character (like the & inside of an &amp;) was causing errors by itself. The solution is to make the XML Parsers ignore the data by wrapping it in a CDATA tag. Lets take the above example and show how it could be done much easier:

<title type=html>
  <![CDATA[ Less: <em> &lt; </em> ]]>
</title>

Easier to understand? You betcha. Less costly for developers? Of course. So CDATA is there to help, not hurt. Don’t look at its ugly face and think of it as a hack, look deeper and you will see its purpose and power. Okay, I admit that sounds a little corny, but it could have been worse.

Side note, Javascript

As managers everywhere throw out buzz words like AJAX and encourage you to participate in new web 2.0 project ideas you’re going to end up sending and receiving XML requests with a server using the good old XMLHttpRequest object. Well if encoding isn’t enough of a problem (and I’m still wrapping my head around it) you might get struck with a problem like the above case and want to make use of your knowledge with CDATA.

Well, you’re in luck. xmlDocument.createCDATASection(…) is part of the Level 2 DOM ECMAScript Spec. Use it just like a createTextNode():

//
// Create an atom:title element with html content
// assume xmlDocument is already an XML Document object
// and entry is an atom:entry element in that document
//
// <entry>
//   <title type="html"><![CDATA[<em> &lt; </em>]]></title>
// <entry>
//
ATOM_NS = "http://www.w3.org/2005/Atom";
var node = xmlDocument.createElementNS(ATOM_NS, "title");
node.setAttribute("type", "html");
var cdata = xmlDocument.createCDATASection("<em> &lt; </em>");
node.appendChild(cdata);
entry.appendChild(node);

Now all I have to learn is encoding, and how each browser deals with it differently. That is an entirely new realm that I don’t expect to cover in single week, but I’ll report back with my findings. Until then, don’t get caught up on the little things!

2 Responses

1

Ricardo Martins on September 7, 2008 at 5:45 pm  #

“Its valid, its sent as xhtml, now everything is left up to the browser and it doesn’t seem to work. If you know why, drop a comment. Again my suspicion is that the browsers don’t properly handle XHTML and CDATA completely.”

Actually, it’s because document.write isn’t allowed by XML parsers. Since you’re sending the xhtml with the proper mimetype, browsers will use the XML parser and ignore the document.write. Ian Hixie explains it in detail[1].

[1] http://ln.hixie.ch/?count=1&start=1091626816

2

Joseph Pecoraro on September 7, 2008 at 5:58 pm  #

@Ricardo: Okay! Thats good to know. That shows how little testing and research I did. Thanks for the heads up and the link!

Add a Comment

search