Encoding Source in XML
A Sample

Eric Armstrong
6 Oct '00

Overview

This document includes a sample Java program that includes some of the interesting problem areas described in Encoding Source in XML: A Strategic Analysis. It shows the results produced by simplistic encoding/decoding strategies, and suggests the results achievable with more sophisticated mechanisms. It also gives the DTD for the XML, and shows how the code would appear in an XML-based "outlining" utility. [4,800 words]

Note:
To make the most sense out of this document, you'll want to print two copies, or else view it in two windows, so that you can compare listings side by side.

Sample Program

Here is the sample program:

package myPkg;
// Import statements import java.io.*; import java.util.*; // Local variables String myVar; String myInt; /** * An API comment with HTML. * <p> * With a list: * <ul> * <li>Item 1 * <li>Item 2 * </ul> * * @author eric armstrong * @version 1.0 */ public class MyClass extends SomeClass implements Serializable { /** * The constructor. * * @param foo one arg * @param bar 2nd arg * @see #method1 */ public MyClass(int foo, String bar) { if (foo == 1) { // Do one thing myVar = bar; } else if (foo == 2) { // Do second thing myVar = Integer.parseint(foo); } myInt = foo; } /* * A normal comment that explains how * the method works: * a. It does one thing. * b. It does another. */ String method1() { /* Commented out code if (myVar == "") { System.out.println("Unexpected error"); System.out.println(" --No value"); } */ if (myInt >= 0 && myInt < 5) { // This conditional includes && and < symbols // And it uses my favorite line breaks return "fly"; } return myVar; } // A comment that contains multiple lines // of text, to explain how the method works. // a. It does one thing. // b. It does another. String method2() { // Another way of commenting out code // if (myVar == "") { // System.out.println("Unexpected error"); // System.out.println(" --No value"); // } return myVar; } }

Input Conversion

Here is the result of the simplest posssible conversion into XML. Notes are indicated in parentheses, in bold. The original text is bold to make it easier to spot. (The file is sample.xml. The DTD is xmlsource.dtd.)

<?xml version='1.0' encoding='ISO-8859-1'?>
<!DOCTYPE xmlsource SYSTEM "xmlsource.dtd">

<xmlsource>
<node><content line="1">package myPkg</content></node> (1)
<node><content line="3">// Import statements</content></node> <node><content line="4">import java.io.*</content></node> <node><content line="5">import java.util.*</content></node> <node><content line="7">// Local variables</content></node> <node><content line="8">String myVar</content></node> <node><content line="9">String myInt</content></node> <node><content line="11"><![CDATA[/** (2) An API comment with HTML. (3) (4) <p> With a list: <ul> <li>Item 1 <li>Item 2 </ul> @author eric armstrong @version 1.0 ]]></content></node> <node><content line="23"><![CDATA[public class MyClass extends SomeClass (5) implements Serializable]]></content> <node><content line="26"><![CDATA[/** (6) The constructor. @param foo one arg @param bar 2nd arg @see #method1 ]]></content></node> <node><content line="34">public MyClass(int foo, String bar)</content> <node><content line="35">if (foo == 1)</content> <node><content line="36">// Do one thing</content></node> (7) <node><content line="37">myVar = bar</content></node> </node> <!--(if)--> <node><content line="38">else if (foo == 2)</content> (8) <node><content line="39">// Do second thing</content></node> <node><content line="40">myVar = Integer.parseint(foo)</content></node> </node> <!--(else)--> <node><content line="42">myInt = foo</content></node> </node> <!--(Constructor)--> <node><content line="45"><![CDATA[/* (6) A normal comment that explains how the method works: a. It does one thing. b. It does another. ]]></content></node> <node><content line="51">String method1()</content> <node><content line="52">/* Commented out code</content> <node><content line="53">if (myVar == "")</content> <node><content line="54">System.out.println("Unexpected error") </content></node> <node><content line="55">System.out.println(" --No value") </content></node> </node> <!--(if)--> </node> <!--(comment)--> <node><content line="58"><![CDATA[ (6) if (myInt >= 0 && myInt < 5) ]]></content> <node><content line="60"><![CDATA[// (6) This conditional includes && and < symbols And it uses my favorite line breaks ]]></content></node> <node><content line="62">return "fly"</content></node> </node> <!--(if)--> <node><content line="64">return myVar</content></node> </node> <!--(method1)--> <node><content line="67"><![CDATA[// A comment that contains multiple lines of text, to explain how the method works. a. It does one thing. b. It does another. ]]></content></node> <node><content line="71">String method2()</content> <node><content line="72"><![CDATA[// Another way of commenting out code if (myVar == "") { (9) System.out.println("Unexpected error"); System.out.println(" --No value"); } ]]></content></node> <node><content line="77">return myVar</content></node> </node> <!--(method2)--> </node> <!--(MyClass)--> </xmlsource>

Notes:

  1. The semi-colons are not necessary in the XML, since the extent of each node's contents is clearly demarcated. Also: The line numbers are not needed if the compiler can process XML, but are necessary if a standard compiler is to be used.

    The editor has several choices for handling line numbers, none of which are wonderful. One choice is to keep all line numbers up to date as the file is edited. Ugly. Another is to only update them when converting to text. But that means the "goto line" function is useless much of the time. The third option is for the "goto line" function to invoke the conversion-to-text routines, discarding the results but updating line numbers as it goes. That is probably the only viable option. For greater efficiency, the editor could keep a "dirty" flag on each node. It can then start at the topmost level, skipping over whole sections as long as they haven't changed. But, given a compiler that processes XML, this whole mess can be avoided.
     
  2. A CDATA section is needed for any entry that has a NL, <, or & character. I've added a NL after the start-comment marker, too, whenever the comment spans multiple lines. That makes the XML version more readable, but it requires the editor to undo that action for the best display. (More on that later. Odds are it won't make sense to add those NLs, which means it will pretty well impossible to use the XML version without an XML editor.)
     
  3. At this stage, the XML is going to be horribly unreadable very quickly, without an XML-based editor to make sense of it. To make it easier to read, I'm indenting succeeding lines of the text. In actual practice, though, those lines would be flush left. [Note: It's easy to think of "plain text" as therefore superior, because it is more "readable". It's like looking at a plain text file with a binary editor, or a word processing document with a normal editor -- when the display tool doesn't match the encoding, the results are less than readable.]
     
  4. If you imagine the file displayed in a hierarchical XML editor, like an outliner, successive lines are displayed with the indentation shown here, and the result is more useful. However, the simplistic comment-becomes-a-CDATA-section encoding used here still impairs the utility of such editors. Imagine if your directory tree had a directory with a 12-line name! The sense of hierarchy would diminish considerably.

    There are two solutions. One is build hierarchical editors that are capable of displaying only the first line of each entry. That is a useful tool in a variety of situations. (Doug Englebart's NLS system provided that capability in 1968, and many outliners built in the 80's did so, as well.) The other alternative is to do a better job of encoding such comments in XML. There are a couple of options, which we'll explore later.
     
  5. This entry has a NL, so it gets a CDATA section.
     
  6. Again, I've added a NL character for readability, so the XML-version is easier to compare to the original source. It's tempting to prescribe that it should be there. But once again, that precludes the use of a generic editor.
     
  7. Since we are parsing the file, we know that the if statement is contained in the constructor, so we can create the appropriate hierarchy. Note that the comment for a block of code could contain the code under it -- but the conversion program probably does not want to be responsible for doing that. Here, we're assuming a simplistic conversion. In the next section, we'll see what the XML looks after editing to produce more aggressive indenting.
     
  8. As noted, the else-clause needs to be treated separately from the if-statement it is part of. That turns out to be nice when there are multiple else-ifs -- they can easily be reordered in a hierarchical editor. Note that the closing-brace that starts this line (} else ...) must be removed. The editor would probably need some indication that tells it to preserve that style when converting back to text. On the other hand, The Elements of Java Style (one great book!) sets the standard as having the } and the else on different lines. (It does the same with respect to } catch, and similar instances. With that style, the issues surrounding a leading close-brace disappear. (Although I was vacilating before, I am now totally sold on that standard as a result.)
     
  9. Here is the one place where braces and semi-colons occur in the xml, because they are part of the comment.

The Hierarchy-Editing View

In an XML-based hierarchy editor that understands the node/content pairing (as described in Design Notes for an XML Editor), the sample source code would appear like this, where [+] indicates a node with children and [-] indicates a node that has none:

[-] package myPkg
[-] // Import statements 
[-] import java.io.*
[-] import java.util.*
[-] // Local variables   (1)
[-] String myVar
[-] String myInt
[-] /** An API comment with HTML. (2)
    <p>
    With a list:
    <ul>
      <li>Item 1
      <li>Item 2
    </ul>
   
    @author  eric armstrong
    @version 1.0
[+] public class MyClass extends SomeClass
    implements Serializable
    [-] /** The constructor.
   
        @param foo one arg
        @param bar 2nd arg
        @see #method1
    [+] public MyClass(int foo, String bar)
        [+] if (foo == 1)
            [-] // Do one thing
            [-] myVar = bar
        [+] else if (foo == 2)
            [-] // Do second thing
            [-] myVar = Integer.parseint(foo)
        [-] myInt = foo
    [-] /* A normal comment that explains how
        the method works:
          a. It does one thing.
          b. It does another.
    [+] String method1()
        [+] /* Commented out code
            [+] if (myVar == "")
                [-] System.out.println("Unexpected error")
                [-] System.out.println("  --No value")
        [+] if (myInt >= 0 
            && myInt < 5)
            [-] // This conditional includes && and < symbols 
                And it uses my favorite line breaks
            [-] return "fly"  
        [-] return myVar 
    [-] // A comment that contains multiple lines
        of text, to explain how the method works.
        a. It does one thing.
        b. It does another.
    [+] String method2()
        [-] // Another way of commenting out code
            if (myVar == "") {
              System.out.println("Unexpected error");
              System.out.println("  --No value");
            }
        [-] return myVar

Notes

  1. Color-coding would make a huge difference in the appearance of this text. It's almost impossible to think of editing any other way these days, even with a plain-text editor. Having comments in a different color from source code makes a huge difference in readability. When you combine the advantages of coloring to the ability to collapse, expand and easyily rearrange outline entries, the result is a terrific editing experience. (And it will get better when we add more hierarchy in the next section.)
     
  2. Here's how the comment should appear in a generic editor. So putting a NL after the comment-start marker won't work. If the XML has a NL after the comment mark, then the editor has to know to remove it. That makes it more difficult to edit the source with a generic XML-editor. But the idea is to take advantage of the growing market for XML-aware editors, so we don't want to force the developer to use an editor that builds in any explicit understanding of our format.

Given that in the XML representation of the source code, comment-continuation lines will be flush left, and that the first line will start far over to the right (since adding a NL is inadvisable), it seems clear that the XML is going to be pretty well unreadable, when viewed in a plain text editor. It is so ugly, in fact, that the concept begins to seem like an unfortunate idea. However, it is worth recalling that in the days when text editors were first coming into existence, readable text would have looked very weird in a binary editor that showed 40x20 blocks of characters. (Looking at this HTML document in a plain text editor isn't all that much fun, either.)

Finally, not that the structure shown can only be displayed by an editor that understands the need to hide the <node> elements and show only the <content> elements. If it weren't for the need to introduce the <content> elements, the displayed structure could be presented by any XML editor. XML requires those <content> elements, though, as described in Shortcomings of XML for Source Encoding so the editor has to understand that <node> elements should remain invisible in order to produce the correct display.

Adding More Hierarchical Nesting

At the moment, that version of the file is a lot less readable than the plain text version, for two reasons. First, the comments are taking up too much space in this display. Second, we have lost many vertical whitespace separators, without gaining a lot of benefit from the hierarchical capabilities of XML.

We can fix both those problems by adding more hierarchical nesting. Here is what the XML might look like after some editing to take advantage of XML's tree structure. (It might also be possible to generate this structure with a very intelligent input converter):

[-] package myPkg
[+] // Import statements (1)
    [-] import java.io.*
    [-] import java.util.*
[+] // Local variables   (2)
    [-] String myVar
    [-] String myInt
[+] /** An API comment with HTML.             (1)
    [-] <p>                                   (2)
    [+] With a list:<ul>
        [-] <li>Item 1
        [-] <li>Item 2
        [-] </ul> 
    [+] <!--tags-->  
        [-] @author  eric armstrong
        [-] @version 1.0
[+] public class MyClass extends SomeClass    
    implements Serializable
    [+] /** The constructor.
        [+] <!--tags-->  
            [-] @param foo one arg
            [-] @param bar 2nd arg
            [-] @see #method1
    [+] public MyClass(int foo, String bar)
        [+] if (foo == 1)
            [-] // Do one thing
            [-] myVar = bar
        [+] else if (foo == 2)
            [-] // Do second thing
            [-] myVar = Integer.parseint(foo)
        [-] myInt = foo
    [+] /* A normal comment that explains how
        the method works:
        [-] a. It does one thing.
        [-] b. It does another.
    [+] String method1()
        [+] /* Commented out code
            [+] if (myVar == "")
                [-] System.out.println("Unexpected error")
                [-] System.out.println("  --No value")
        [+] if (myInt >= 0 
            && myInt < 5)
            [-] // This conditional includes && and < symbols 
                And it uses my favorite line breaks
            [-] return "fly"  
        [-] return myVar 
    [-] // A comment that contains multiple lines   (3)
        of text, to explain how the method works.
         a. It does one thing.
         b. It does another.
    [+] String method2()
        [-] // Another way of commenting out code
            if (myVar == "") {
              System.out.println("Unexpected error");
              System.out.println("  --No value");
            }
        [-] return myVar

Notes

  1. Note that the comment-headers and the method-headers are now side by side. You can expand the one you are interested in, or expand both if you really want to look at them. But comments and methods you are not interested in can be hidden.
     
  2. Since everything under a /* or /** heading is commented out, the comment-entries can be indented any way you see fit.
     
  3. The // comment ends at the end of the node. Entries under it are not automatically commented out. So this comment either needs to remain as is, or converted to a /* comment so that the explanatory list can be turned into a list of subnodes.

Examining the Benefits of Hierarchy

At this point, it is worth noting that this example doesn't even begin to demonstrate the real advantages of hierarchical structuring. Let's take a quick look.

Collapse/Expand, Drag and Drop

In the first place, you're looking at on the printed page or in an HTML browser, rather than in a hierarchical editor. So you have to imagine that you were viewing the code in the equivalent of your directory browser. Imagine having the ability to collapse and expand. Imagine being able to see all the top-most entries at the same time, on one screen. You can then expand the parts you want to see, and drag things to new locations easily. Then you begin to get the idea.

Sectioning

But more importantly, this is a small example that focuses on converting a typical example of existing code. Because it is small, it doesn't show how the hierarchical structure makes larger sections more manageable. Suppose the class implements several interfaces, for example. In plain source, one typically creates a "block header" like this:

//========================================================
// AnInterface
//========================================================

The methods that implement that interface then follow that header, and a new block header is created for a differenent interface, or another logical "block" of code. But in a hierarchical editor, all of those methods can be tucked away under the the interface header. It is then easy to view the list of interface headers, and expand the one you want.

The larger the class gets, the more valuable this feature becomes. (Although it is desirable to make small classes, it is frequently impractical to do so. A hierarchical structure does the next best thing -- it makes the class seem smaller.)

Literate Style

Putting method definitions under a comment-heading is one step towards a more literate coding style, a concept pioneered by Donald Knuth. Here is an example of an if statement coded in such a style:

[+] // If we are in the viable range, return the value
[+] // Otherwise, return a placeholder

That's the collapsed version. Here's the expanded version:

[+] // If we are in the viable range, return the value
    [+] if (a >= 0 && a < 5) {
        [-] int i = indexArray[a];
        [-] return value[i]
[+] // Otherwise, return a placeholder
    [+] else
        [-] return "*****"
still doesn't
Of course, this example doesn't rise to the level of Knuth's concept, where you can invoke a function by giving its literate name. But it begins to show how the program's detail can be condensed out of sight, so that it "reads". It is then easy to acquire a high-level understanding of the code, because a high-level view is available.

Creating Readable Text

The lack of vertical white space is most apparent when a file is printed out and displayed as a whole, as above. When editing in a small window, the indentation and ability to collapse and expand provide sufficient clues to the logical structure that the missing vertical whitespace goes unnoticed. In fact, it becomes easier to manipulate the file, because you spend less time scrolling.

When printed, a "Smart Spacing" heuristic can be used to provide the logical breaks that improve readability. With that heuristic (first defined in the StreamLine outliner in the mid-80's), and extra line is added whenever the next line is "outdented" relative to the current line -- that is, whenever the next item represents an outer outliner level.

Using that heuristic produces a printed version of the code that looks like this, using a 2-space indent (characters added by the pretty printer are shown in bold):

package myPkg
                                              (1)
// Import statements 
  import java.io.*
  import java.util.*

// Local variables   
  String myVar
  String myInt

/** An API comment with HTML.                  (2)     
 *  <p>                                          
 *  With a list:<ul>
 *    <li>Item 1
 *    <li>Item 2
 *    </ul> 
 *
 *  <!--tags-->  
 *    @author  eric armstrong
 *    @version 1.0

public class MyClass extends SomeClass    
implements Serializable
  /** The constructor.
   *  <!--tags-->  
   *    @param foo one arg
   *    @param bar 2nd arg
   *    @see #method1

  public MyClass(int foo, String bar)
    if (foo == 1)
      // Do one thing
      myVar = bar

    else if (foo == 2)
      // Do second thing
      myVar = Integer.parseint(foo)

    myInt = foo

  /* A normal comment that explains how       (3)
   * the method works:
   *   a. It does one thing.
   *   b. It does another.

  String method1()
    /* Commented out code
     *   if (myVar == "")
     *     System.out.println("Unexpected error")
     *     System.out.println("  --No value")

    if (myInt >= 0 
    && myInt < 5)
      // This conditional includes && and < symbols  (4)
      // And it uses my favorite line breaks
      return "fly"  

    return myVar 

  // A comment that contains multiple lines   
  // of text, to explain how the method works.
  //   a. It does one thing.
  //   b. It does another.
  String method2()
    // Another way of commenting out code
    // if (myVar == "") {
    //   System.out.println("Unexpected error");
    //   System.out.println("  --No value");
    // }
    return myVar

Notes

  1. As long as the pretty-printer understands //, /*, and /**, it might as well understand "package" and add an extra NL after that statement. Just for aesthetics.
     
  2. The leading * characters are added on subsequent lines for readability. But braces and semi-colons don't need to be added unless the output is intended for compilation. (We'll get to that in a moment.)
     
  3. The pretty-printer can compensate for the leading //, /* or /** characters and do the extra-indentation necessary to make the following text line up.
     
  4. Here, the pretty-printer has added the // characters at the beginning of subsequent lines.

Just to spell it out, it's worth noting that while the editor can be a generic XML-editiing tool, the process of converting to and from text is necessarily specific to the programming language. On input, the converter has to parse the source code. On output, it needs to add semi-colons and braces it removed (discussed in the next section), and it needs to add the comment-continuation marks shown above. (If you remove them, it becomes difficult to distinguish comments from code, as shown by the comment-sections in method1() and method2().)

Creating Compilable Text

Here's what that code might look like when converted to text for the purpose of compilation (additional characters added at this stage are shown in bold):

package myPkg;
                             
// Import statements 
  import java.io.*;
  import java.util.*;

// Local variables   
  String myVar;
  String myInt;

/** An API comment with HTML.             
 *  <p>                                          
 *  With a list:<ul>
 *    <li>Item 1
 *    <li>Item 2
 *    </ul> 
 *
 *  <!--tags-->  
 *    @author  eric armstrong
 *    @version 1.0
 */
public class MyClass extends SomeClass    
implements Serializable {
  /** The constructor.
   *  <!--tags-->  
   *    @param foo one arg
   *    @param bar 2nd arg
   *    @see #method1
   */
  public MyClass(int foo, String bar) {
    if (foo == 1) {
      // Do one thing
      myVar = bar;
    }
    else if (foo == 2) {
      // Do second thing
      myVar = Integer.parseint(foo);
    }
    myInt = foo;
  }
                                                (1)
  /* A normal comment that explains how       
   * the method works:
   *   a. It does one thing.
   *   b. It does another.
   */
  String method1() {
    /* Commented out code
     *   if (myVar == "")                   (2)
     *       System.out.println("Unexpected error")
     *       System.out.println("  --No value")
     */
    if (myInt >= 0 
    && myInt < 5) {
      // This conditional includes && and < symbols  
      // And it uses my favorite line breaks
      return "fly";  
    }
    return myVar; 
  }

  // A comment that contains multiple lines   
  // of text, to explain how the method works.
  //   a. It does one thing.
  //   b. It does another.
  String method2() {
    // Another way of commenting out code
    // if (myVar == "") {
    //   System.out.println("Unexpected error");
    //   System.out.println("  --No value");
    // }
    return myVar;
  }
}

Notes

  1. When a closing brace is added, to end a class or method, an extra blank line is added, as well. (Extra lines are not added when closing other language statements.)
     
  2. These lines are not recognized as "code", so braces and semicolons are not added. To preserve them, the // comment style would be used, as shown in method2.

At this point, the output looks very similar to that espoused by The Elements of Java Style (a highly recommended manual). The only issues are the indentation of the import statements, global variables, and the parameter tags in the documentation comments. We'll take that up next.

Refining the Plain Text Output

The compilable version of the XML hierarchy includes additional indentation of the import statements, global variables, and the parameter tags in the documentation comments. The hierarchical version is much easier to use when that indentation is present., so the indentation needs to be preserved when outputting to text (as long as that text is going to be read back in again, perhaps after having been edited by someone else). On the other hand, preserving the indentation on output still leaves several open issues:

The only way to produce a plain text version that has the indentation characteristics that are normal to plain text, and which allows reconstructing the intelligent nesting of the XML version, is to add nesting indicators. For example, /* start */ and /* end */ flags might demarcate a block of code for the input converter.

Here is an example that shows how it would work for a block of import statements:

// Import statements /* start */
import java.io.*;
import java.util.*;
/* end */

Note that this startegy works for simple structures like import blocks, but may run into difficulties when deeply nested structures are processed. Such structures will have multiple /* end */ markers in a row. Those markers will not only be intrusive when you are reading the plain source, but they will also make it more difficult to preserve the structure during edits. The hierarchy might then be damaged as result.

The resulting system could therefore be fairly brittle, so perhaps the use of /* start */ and /* end */ markers isn't a good idea. But if we throw that idea away, we are left without a mechanism for converting the useful hierarchical nesting in XML into a text form that the plain-text contingent can live with.

Improving Input of /** Comments

As mentioned in the Strategy document, there is more than one way to handle /** comments. In fact, there are three.

The simplest method (used so far in this document) is simply to convert them to CDATA sections. That saves having to deal with the vagaries of HTML tags, but produces nodes that take up a lot of vertical space, which mitigates the collapse/expand benefits of having the source code in an XML hierarchy. The only way to compensate for that is to manually edit the file. As can be seen from the examples, the result is rather different from the original text.

A second option is to use a parser that is fully aware of HTML and Javadoc tags. Such a parser could produce an intelligently nested version of the comments (using XHTML) that was contained inside the <content> element. Taking this step requires extending the DTD to recognize XHTML elements. New elements then need to be defined that correspond to Javadoc tags, and those need to be added to the <content> definition as well. (Namespaces should be used to keep the xhtml and javadoc tags fully separated.)

That option requires a good HTML-to-XML converter. Otherwise, the many valid HTML tag sequences that are not well-formed as XML will generate parser errors. For example: <p> is fine by itself in HTML, but in XML it must be either <p></p> or <p/>. Another result of using that option is that code will be changed when converting to text. So the <p> in the example shown might become <p/>. Such conversions could conceivably become problematic downstream when the code was processed by Javadoc.

An intermediate option is to do enough processing to identify levels in the hierachy, using text indentation and HTML tags as clues, but use standard <node><content> tags to encompass them. Here is one possible result of encoding the class comment with a strategy like that:

<node><content line="11">/**
  An API comment with HTML.</content>
  <node><content line="13"><![CDATA[<p>]]></content></node>  
  <node><content line="14">With a list:</content></node>
  <node><content line="15"><![CDATA[<ul>]]></content></node>           (1)
    <node><content line="16"><![CDATA[<li>Item 1</content></node>
    <node><content line="17"><![CDATA[<li>Item 2</content></node>
    <node><content line="18"><![CDATA[</ul></content></node>           (2)
  </node> --(ul)
  <node><content line="20">@author  eric armstrong</content></node>   (3)
  <node><content line="21">@version 1.0</content></node>
</node> --(/**)

Notes:

  1. The <ul> tag could also be combined with the text above it make an even more a reasonable hierachy.
     
  2. Another difficulty with this strategy is shown here. The </ul> tag occurs in its own node, contained within the <ul> node that initiates it. It's a fairly small issue, though. In return for living with the presence of the <ul> and </ul> tags in the text displayed in the XML tree, you get the advantages of hierarchy (collapse/expand, fast drag) without having to do full HTML parsing.
     
  3. With a bit more processing, the Javadoc tags might also be nested inside a section-heading constructed for that purpose. (An exercise for future activity.)
With that mechanism, plain source could be converted into a fairly reasonable hierarchy without manual editing. It still wouldn't indent the import statements, for example, but it would allow the really long comments to be condensed.

Summary

In the ideal world, I suspect that the XML version of the source should only be converted to plain text for the sake of compilation, and then only until XML-aware compilers exist. The XML form allows for great tree-based differencing, so that format should supplant CVS and its various plain-text cousins.

With that scenario, the conversion from plain text to XML is a one-time affair, and virtually any conversion that works is acceptable, since it can be modified into a more nested form over time. Better conversion utilities can be constructed over time, as well, to minimize the amount of manual effort needed to produce well-nested structures.

If round-trip conversion is anticipated, from XML to text and then back again, then there does not seem to be any encoding option which really "works". If the XML is left in a state that lets plain text editors use it without undue concern for structure modifications, then the XML version becomes less useful. On the other hand, if the XML version is modified to produce better structuring, then plain text version is less readable and more easily damaged.

In particular, the ability to tuck code away under comments, in an effort to generate more readable, "literate" code, is something that works really well in XML, but not in plain text. If "round trip" processing is the norm, then either one of the main benefits of hierarchical structuring must go unused, or else the plain text version will be very weird.

The conclusion seems to be that an effective system will generate XML from plain text, and then remain as an XML-based system from that point forward. Although occasional exports to plain text form might be valuable, the concept of seriously integrating plain-text users with XML users seems ill-advised.