Eric Armstrong
This document represents a preliminary outline of the data structures for the collaborative document system known as OHS -- Open Hyperdocument System. The data structures described here are represented in XML, since that is most likely the format we will use for interchange.
Note:
Since we have not yet worked through the use cases, this document is necessarily speculative. It also needs to be checked against the requirements document, since I have undoubtedly overlooked a few things.
ToDo:
Work through use cases.
Check against requirements document.
The data structure design has several main goals. The first is to serve as a "normal form" for documents that come from multiple sources, such as:
The second goal is to make the design of the data structures as simple as possible. Real simplicity is obviously impossible, however. The large number of interacting requirements makes it unattainable. What is attainable, though, is regularity. It is hoped that the entire system can be built up from a small number of "atomic" nodes, each of which has a regular, consistent structure. With that in mind, we start by defining a "template" for nodes in the system.
The third goal, and possibly the most important, is to provide the foundation for the capabilities and functionality defined in the Requirements document.
Here is the template for the basic nodes in the system (explanations follow):
<nodetype>
<CONTENT>...</CONTENT>
<DATA>...</DATA>
<LISTS>...</LISTS>
</nodeType>
The nodetype value is one of several predefined possibilities that are defined for the system. The initial set of node types consists of:
- VERNODE (version node)
- Since the system must support versioning, some mechanism is needed for indirect linking. In some cases, of course, a link might be created that points directly to an existing node. But in cases where the most recent version of a node is the target of the link, the link must point to a "virtual" node -- a version node -- which contains a link to the real information. When pursuing a link, the system will automatically step through version nodes to retreive the most recent version of the information.
Note:
It is likely that there will be a large number of indirect links of this type. The result, if printed, will be a system that defies comprehension by the human mind. Computers, on the other hand, will be able to handle the complexity quite easily. While computers will be unable to comprehend the content of the nodes in the forseeable future, they will be able to present that content to human users in way that makes the information interactively usable. The result is a man/machine symbiosis intended to augment human thinking in ways that will increase our ability to collaboratively investigate and solve complex problems.
Design Note:
The version node could simply contain a pointer to the current version, or it could contain a list of pointers to all versions, with the most recent at the top. If it contains a single pointer, then each node must chain to its previous version. If it contains a list of pointers, then each node must link back to the version node, which must be consulted when a link to the previous version is created. The second alternative requires more computation when visiting a previous version of a node, but it creates a simpler "tree" structure.
Implication:
The NLS system demonstrated that a link must be able to include extra information that determines the view of the target information -- which subcategories to display, how many levels, and so forth. In addition, the link must be able to specify the version it points to. (No version => latest version is desired)
- INFNODE (information node)
- Information nodes are the primary content carriers in the system. Every paragraph and heading in a document converts to a single information node in the system, under the principle that (in theory, at least) each paragraph expresses one main idea.
- STRLIST (structure node)
- This node serves as the list header for a node's substructure. For a heading in a document like this, the heading is represented by an INFNODE. The text of the heading is the heading's CONTENT. One of the LISTS under that heading is contains the introductory paragraphs and subheadings that are directly under that heading. The head of that (substructure) list is a STRLIST. (The separation between CONTENT, DATA, and LISTS will be discussed in the next section, along with the need for distinguishing them.)
Q: Do the contents of the STRLIST need to be STRNODES, or can they be INFNODES?? (STRNODES may be desirable. See discussion of styles in the next to last section.)- RTGNODE (rating node)
- A rating node captures the sum of evaluations under it. It's DATA segment contains the average of ratings below it, as well as the number of entries. For the list header, the CONTENT section is empty. For entries in the rating list, the CONTENT section could contain a single-paragraph explanation for the rating. Deeper discussions would require a structure list. The CONTENT section could then be used for a one-paragraph summary.
Q: Alternative: The INFNODE adds RATING_AVERAGE and RATING_NUMBER elements to it's DATA section. RTGNODEs are then used to add a rating to a node. The DATA section of a RTGNODE then contains a RATING element, only.- CATNODE (category node)
- A category node defines an information category in the system. For an IBIS-style conversation, for example, categories include question, alternative, argument:for, and argument:against. Since a node can be categorized multiple ways, one of the a node's LISTS delineates the categories to which it belongs. That list will consist of pointers to category nodes (or possibly the virtual place holder for that node). Similarly, the category node will contain a list of pointers to nodes that belong to that category, allowing for fast search and compare operations.
- CATLIST (category list)
- The category list is the header for the list of categories an INFNODE belongs to. That list contains pointers to CATNODES.
Q: Is this nodetype necessary, or can it be eliminated by using INFNODEs for category defintions and using CATNODE as the list header?? What will the pointers look like, and do they need a separate node type??- RELNODE / RELLIST (relation node/list)
- The ability to define relations in the system requires a node that encapsulates the relation. Since a node can be part of multiple relationships, a list is required. A RELNODE will be very similar to a CATNODE, except that some relationships are one-way. That implies a need for two LISTS under a RELNODE -- a "from" list, and a "to list".
Q: How will this work, exactly? (Details TBD)- STYNODE (style node)
- This is a highly provisional concept. But when we investigate the concept of "structure" under a node, we begin to see that the subheadings and lists under a heading, for example, can really use some rudimentary style information. In a very real sense, such "style" attributes capture important information. When a numbered list is used, for example, it indicates a seriality: item #3 follows item #2, which follows #1. A bullet list indicates a "parallel" construction, where the items can be considered in any order. Although such information is typically thought of as "formatting", it also represents valuable substantive information. (In addition, the ability to serve as a normal form for HTML/xHTML documents and DocBook articles may well require some kind of style-related capabilities.)
Q: Can we agree that this is needed? (TBD)
The next section discusses the reason for segmenting the node into content, data, and lists sections.
The structure of the basic node has been divided into 3 sections: CONTENT, DATA, and LISTS. This section explains why. But first, a word about the basic node attributes.
At a bare minimum, a node needs an ID attribute so that other nodes can link to it unambigously: <nodetype id="identifying-value">. The ID value must be unique within the system. Note, however, that it is not globally unique. Since we intend to implement a distributed document object model (DDOM), that value will be shared across every system that has a copy of the node.
Note:
Element attributes are reserved for invisible, non-extensible aspects of nodes. The ID attribute, for example, is one which is used internally in the system, but is never exposed to the user, nor can the user add new attributes. Information which is represented in some visible form, which can be manipulated by the user, or which can be added by the user, is encoded in the DATA section of a node.
Now, at last, we get to the discussion of the node's structure. The need for the separation into content and structure arises from a major deficiency in XML, for our purposes. While XML is wonderful in many ways, it has no mechanism for distinguishing "tags which convey style information" from "tags which represent structure". For example, consider the following hypothetical segment of document markup:
<h2>An <i>Important</i> Heading
<p>An introductory paragraph</p>
<h3>A SubHeading
<p>A paragraph under the subheading</p>
</h3>
</h2>
This segment represents the kind of thing you would like to do with XML. However, XML's validation mechanisms don't allow you to restrict an XML document to a form like this.
The things to note about this format are:
It is the last observation in that list which is impossible to describe in XML. There is no way to say "text and style tags are allowed up until the first structure tag is seen, and no thereafter". There is not, in fact, any way to distinguish style tags from structure tags at all.
The inability to make such restrictions means that the "mixed content model" (text plus tags) which can be defined in XML would allow text and style tags to occur between the paragraph element (<p>...</p>) and the <h3> element, for example. Since that text would not be enclosed in any structure, it would be completely ambiguous. There is no way for a program to know what to do with it. Although programs could be required to detect such errors, that requirement defeats the automatic verification that constitutes one of XML's major advantages.
The solution to the problem is to introduce another layer of structure: a CONTENT layer. That is the solution used by DocBook when defining it's section elements. In DocBook, a section element contains a title element, like this:
<SECT1>
<TITLE>...title text here...</TITLE>
<SECT2>
<TITLE>...</TITLE>
...
</SECT2>
</SECT1>
Rather than having the text of the section heading belong to the section element directly, DocBook introduces the TITLE tag to hold it. That addition distinguishes the content of the heading from the structure under it, and allows the structure to be automatically validated for correctness.
The cost, however, is one of program complexity -- especially with respect to editors. The added element prevents a "natural" mapping of the structure to a display. Without that added element, you could simply display the XML as a tree and allow it to be edited. The result would be an outline-version of the document, with various inline elements like bold and italic adding a bit of style.
However, the addition of the extra element prevents any such natural mapping. The editor must now understand that the content of a <SECT1> element is, in reality, in it's TITLE element. And when the text is edited, it must know to store the change in the TITLE element, rather than as part of the SECT1 element.
The reason for taking the trouble to explain all this is to observe that if an mechanism similar to XML were found or constructed, that allowed structure tags to be distinguished from content tags, then editors could operate on the documents much more naturally and easily, without having to understand specific semantics like SECT1/TITLE.
Note:
We could always define a type attribute: <tag type="style"> or <tag type="struct">. We could then require that implementing programs verify that after the first node of type struct is seen, no more text or style tags are allowed. We could define the type so that it defaults to "style". That would prevent it from having to be specified as part of <b> and <i> tags, for example. However, that still shifts the burden of validation to every program that attempts to interact with the system, rather than allowing XML mechanisms to do it automatically.
An unstated assumption in the foregoing analysis is that the CONTENT section will be able to contain various style elements like bold and italic (<b>, <i>). Such tags are important because they impart information, as well as formatting that makes the text more readable.
Note:
In some cases, italics means italics. For example, in this sentence italic formatting is added for emphasis, so the text reads as you would hear it if I spoke it. In other cases, italics is a way of imparting information. For example the first use of a new term is typically italicized, in order to indicate that the current paragraph supplies a definition. In such cases, it undoubtedly makes sense to use XML's capacity to define new tags. For example, a <gls> or <def> tag might be defined to highlight glossary terms (definitions) in those paragraphs where a definition (or part of a definition) is supplied.
Q: How to add new tags to the system, how to set defaulting rendering (e.g. "def" => italics), and how to customize the rendering in the client browser.
Elements like bold and italics are defined in xHTML's Document Type Definition
(DTD), as inline tags. That definition forms the basis for the "mixed
content" (text and tags) definition of CONTENT, as it would defined in
a DTD for the system:
[ToDo: Insert the full list of inline tags from the
slide showing HTML style vs. structure tags]
[ToDo: Compare with the list of inline tags defined in the xHTML DTD.]
(PCDATA | inline)*Note:
"PCDATA" means "parsed character data". In other words, it is text -- data that is going to be read and parsed (inspected) for tags and other entities, unlike pure "CDATA", which is never parsed. (CDATA is like HTML's "preformatted" element -- <pre> -- only more so: nothing inside of a CDATA section is ever interpreted. So while <b> in an HTML <pre> element would cause the text to be bold, in an XML CDATA section, it causes nothing, and would simply be displayed as "<b>".
The DATA element provides a location for storing the data associated with a node. For a rating node, for example, data elements would include the <average> and <number> elements, containing the average rating value and the number of ratings it was constructed from. (More sophisticated systems might include ratings of individuals, which would then weight the results of their evaluations when constructing the "average". That is left for a future exercise.
Data for a virtual node could include links to the previous and following versions of the document. That information needs to be stored somewhere -- the virtual node seems to be the natural place to put it.
Q: Are CONTENT and DATA both required, or can one be eliminated?? (For some reason, in my original notes it seemed necessary to have both. At the moment, I don't see the justification for that, but I'm loathe to remove the distinction until I'm 100% sure it's not necessary.
The lists element contains all of the sublists associated wtih a node. Those lists include:
Having opened the pandora's box of style considerations with inline tags in CONTENT elements. It makes sense to consider the potential for defining styles for the nodes that constitute substructure. As previously noted, there is some level at which "style" constitues information: whether a list of items is ordered or unordered (bulleted), ir whether it is a list of headings, or a list of plain paragraphs. On the other hand, there is also a sense in which structure-style is definitely format-related. Format considerations include the size of the font to use, the particular bullet for an unordered list, or the kind of enumeration for an ordered list (numbers, alphabetics, roman numerals, etc.)
There would seem to be a good case for preserving the information-content of styles in the system, while allowing the actual formatting to be determined when the information is displayed or printed. The primary reason for allowing the actual format to be determined at "run time" (when displayed or printed) is the fact that the system as presently conceived is far more dynamic than the document systems we are used to. In an HTML document, for example, an H2 element is place under an H1, and there it sits. The author can declare it "H2", because it sits under an "H1" -- the levels are static.
In the proposed system, however, "levels" are much more dynamic. One document may include whole sections of another document. Tthe heading may now be at level 3 or level 4 in the new context, instead of at level 2. The ability to reuse nodes therefore defines a "dynamic context" for a node, rather than the static context defined by a traditional document. That being the case, itis unwise to specify the format for node as, for example, h2. Instead, the node can be identified simply as a heading. When actually displayed, the heading can be displayed using the font and size appropriate for it's currrent context.
Q: Which of several possible methods for encoding the style information, using some combination of new element types, new list types, and new attributes. One possibility is to use STRNODE as the header for a list of INFNODES. The STRNODE can then encode the type of entries in that list: paragraphs, ordered list, bullet list, or heading. In keeping with the principle that information which affects the display is encoded as data, that information would be stored in the DATA section of the STRNODE. The STRLIST, in turn, would consist of a list of STRNODES.Note:
The ability to dynamically dictate display format implies some form of stylesheet.capability. A stylesheet would let you specify the font sizes for headings at different levels, the bullet types for unordered lists at different levels, and the numbering style for ordered lists at different levels. So, for example, a bullet list might use dots at the first level, and diamonds for bullet entries under it. Numbered lists might use numbers at the first level, then lowercase alphabetics, then lowercase roman numerals.
Q: How/where stylesheets defined and used??
It is clear that the system needs to take advantage of XML's ability to "include" referenced nodes inline, rather than merely linking to them. XML's XInclude mechanism can be used for that purpose.
Q: It may be that the ability to specify inclusion vs. linking eliminates the need for typed links. Or it may be that the category information contained in the target of a link is sufficient for typing a link. Or it may be that typed links are really needed. For example, a typed link could distinguish a document pointer from a pointer to an author's home page. Typed links could be color code or shifted around on the page. In addition, using typed links could give the user control over the display. Links of one type might show up like HTML links. Another type (say "comments") could be displayed inline, as though they were part of the original document. Or they could be made to appear in a separate window that was automatically synchronized with the content window. Such dynamic, interactive control would be enabled by typed links.
Adding a concept of structural style to the system starts a path towards the slippery slope that HTML struggled with. At the outset, it contained only lists and headings and such. But the need for tables soon became apparent. (A need for frames, forms, and a dozen other things also seemed necessary for HTML, but one hopes that the sytstem can avoid those complexities. It has enough of its own.)
Like hierarchical structures, tables are familiar ways of organizing information. Any system that hopes to improve mankind's ability to deal with complexity must make allowance for tables, in one form or another. Interestingly, the capabilities of the system defined thus far allow an interesting table-facility to be implemented as a display mechanism, without adding any new structuring information to the system.
The existence of categories provides a wonderful opportunity to define dynamic tables. In essence, every column in the table represents a category. The first column in the table is a list of information nodes. Cells in the remaining columns are filled with the nodes that are linked-to or in a relation with the first-column nodes.
Q: Can links be used for this?? If so, typed links are needed to distinguish simple references from links that participate in the table. Or is it better to use relations??
Using categories, lists, and relations (or links) would allow tables to be constructed "on the fly", from information stored in the repository. In essence, such a table would be a "view" of the system.
Q: How to enshrine that view as a "document", and transmit it to others so they can see the organization you perceive.
A: Use the NLS technique -- include a specification of the "view" as part of a link. Extend view parameters to include link type as "inline", "visible", or "invisible". (An inline link includes the linked material in the current document view when it is displayed. A visible link shows up as underlined and in color, as in HTML. An "invisible" link changes the cursor shape to indicate a link is present, but is otherwise consistent with the text that surrounds it.
Using category-based tables would make it possible, for example, to view source code and comments on that code side by side. Similarly, documents could be viewed next to the comments and ratings they received.
Q: The information could be stored in separately scrolling
windows that synchornize with each other as needed. On the other hand, a single-window
system would allow for the kind of color-coded background and threading that
Ted Nelson displayed at the colloquium. In that vision, he had different background
colors for blocks of text, with a "thread" (a line) of that color
linking related items. The first node in your document might have a gray background,
for example. The 3 comments it received would be adjacent to it, also with a
gray background. A gray line would then link the two blocks. The next node in
the document might be in red, etc.