Advertisement
Advertisement


Is it "bad practice" to be sensitive to linebreaks in XML documents?


Question

I'm generating some XML documents and when it comes to the address part I have fragments that look like this:

<Address>15 Sample St
Example Bay
Some Country</Address>

The XSLT that I have for converting this to XHTML has some funky recursive template to convert newline characters within strings to <br/> tags.

This is all working fine; but is it considered "bad practice" to rely on linebreaks within XML documents? If so, is it recommended that I do this instead?

<Address><Line>15 Sample St</Line>
<Line>Example Bay</Line>
<Line>Some Country</Line></Address>

Seems like it'd be really awkward to wrap every place where my text may be multiple lines with tags like that..

2008/09/26
1
8
9/26/2008 8:20:08 PM

Accepted Answer

It's generally considered bad practice to rely on linebreaks, since it's a fragile way to differentiate data. While most XML processors will preserve any whitespace you put in your XML, it's not guaranteed.

The real problem is that most applications that output your XML into a readable format consider all whitespace in an XML interchangable, and might collapse those linebreaks into a single space. That's why your XSLT has to jump through such hoops to render the data properly. Using a "br" tag would vastly simplify the transform.

Another potential problem is that if you open up your XML document in an XML editor and pretty-print it, you're likely to lose those line breaks.

If you do keep using linebreaks, make sure add an xml:space="preserve" attribute to "address." (You can do this in your DTD, if you're using one.)

Some suggested reading

XML applications often seem to take a cavalier attitude toward whitespace because the rules about the places in an XML document where whitespace doesn't matter sometimes give these applications free rein to add or remove whitespace in certain places.

2008/08/10
9
8/10/2008 7:05:08 PM

What about using attributes to store the data, rather than text nodes:

<Address Street="15 Sample St" City="Example Bay" State="" Country="Some Country"/>

I know the use of attributes vs. text nodes is an often debated subject, but I've stuck with attributes 95% of the time, and haven't had any troubles because of it.

2008/08/10

Few people have said that CDATA blocks will allow you to retain line breaks. This is wrong. CDATA sections will only make markup be processed as character data, they will not change line break processing.

<Address>15 Sample St
Example Bay
Some Country</Address>

is exactly the same as

<Address><![CDATA[15 Sample St
Example Bay
Some Country]]></Address>

The only difference is how different APIs report this.

2008/08/24

I think the only real problem is that it makes the XML harder to read. e.g.

<Something>
    <Contains>
        <An>
            <Address>15 Sample St
Example Bay
Some Country</Address>
        </An>
    </Contains>
</Something>

If pretty XML isn't a concern, I'd probably not worry about it, so long as it's working. If pretty XML is a concern, I'd convert the explicit newlines into <br /> tags or \n before embedding them in the XML.

2008/08/10

It depends on how you're reading and writing the XML.

If XML is being generated automatically - if newlines or explicit \n flags are being parsed into
- then there's nothing to worry about. Your input likely doesn't have any other XML in it so it's just cleaner to not mess with XML at all.

If tags are being worked with manually, it's still cleaner to just have a line break, if you ask me.

The exception is if you're using DOM to get some structure out of the XML. In that case line breaks are obviously evil because they don't represent the heirarchy properly. It sounds like the heirarchy is irrelevant for your application, though, so line breaks sound sufficient.

If the XML just looks bad (especially when automatically generated), Tidy can help, although it works better with HTML than with XML.

2008/08/10

This is probably a bit deceptive example, since address is a bit non-normalized in this case. It is a reasonable trade-off, however since address fields are difficult to normalize. If you make the line breaks carry important information, you're un-normalizing and making the post office interpret the meaning of the line break.

I would say that normally this is not a big problem, but in this case I think the Line tag is most correct since it explicitly shows that you don't actually interpret what the lines may mean in different cultures. (Remember that most forms for entering an address has zip code etc, and address line 1 and 2.)

The awkwardness of having the line tag comes with normal XML, and has been much debated at coding horror. http://www.codinghorror.com/blog/archives/001139.html

2008/08/11

Source: https://stackoverflow.com/questions/7277
Licensed under CC-BY-SA with attribution
Not affiliated with Stack Overflow
Email: [email protected]