TREX - Tree Regular Expressions for XML
Language Specification

Author:
    James Clark (Thai Open Source Software Center) <jjc@thaiopensource.com>
Date:
    2001-02-13

Copyright © 2001 Thai Open Source Software Center Ltd

Table of contents

1 Introduction
2 Data Model
3 Primitive Syntax
4 Primitive Semantics
5 Restrictions
  5.1 Recursion
  5.2 Strings
  5.3 Non-restrictions
6 Full Syntax
7 Full Semantics
  7.1 Inclusion
  7.2 Namespace Expansion
  7.3 Transformation Rules
  7.4 Resolving Conflicts
8 TREX pattern for TREX patterns

1 Introduction

This document is the definitive specification of the TREX language.

The primary concept in TREX is the pattern. An unordered collection of attributes and an ordered sequence of elements and characters are matched jointly against a pattern with respect to an environment. An environment is a mapping from names to patterns together with a possibly null reference to a parent environment. The result of matching is true or false.

An XML document is valid with respect to a TREX pattern if an empty collection of attributes and a sequence containing just the document element of the document matches the pattern with respect to an empty environment.

TREX also has the concept of a name-class, which is a set of expanded-names.

2 Data Model

An element is a triple <name, attributes, children>, where name is an expanded-name, attributes is a unordered collection of zero or more attributes and children is an ordered sequence of zero or more elements or characters. An attribute is a pair <name, value>, where name is an expanded-name and value is a sequence of zero or more characters. An expanded-name is a pair <namespace URI, local name>, where namespace URI is a string containing a URI reference and local name is a string matching the NCName production of the XML Recommendation.

3 Primitive Syntax

We first of all define a primitive syntax, which allows only a subset of what the full syntax allows, but which can express everything that the full syntax can.

The primitive syntax is as follows:

pattern ::= <t:element> name-class pattern </t:element>
| <t:attribute> name-class pattern </t:attribute>
| <t:anyString/>
| <t:string white-space> string </t:string>
| <t:data type="NCName" ns="namespaceURI"/>
| <QName t:role="datatype" attributes> anything </QName>
| <t:empty/>
| <t:notAllowed/>
| <t:oneOrMore> pattern </t:oneOrMore>
| <t:group> pattern pattern </t:group>
| <t:choice> pattern pattern </t:choice>
| <t:interleave> pattern pattern </t:interleave>
| <t:concur> pattern pattern </t:concur>
| <t:ref name="NCName" parent/>
| <t:grammar> <t:start> pattern </t:start> definition* </t:grammar>
definition ::= <t:define name="NCName"> pattern </t:define>
name-class ::= <t:name ns="namespaceURI"> NCName </t:name>
| <t:anyName/>
| <t:nsName ns="namespaceURI"/>
| <t:choice> name-class name-class </t:choice>
| <t:difference> name-class name-class </t:difference>
white-space ::= whiteSpace="preserve"
| whiteSpace="normalize"
parent ::= parent="true"
| parent="false"

The above syntax assumes a namespace declations of xmlns:t="http://www.thaiopensource.com/trex" .

4 Primitive Semantics

We now define the semantics of the primitive syntax.

We use the following:

First, we define M.

M [[<t:element> name-class pattern </t:element>]] (a, c, e) if and only if

M [[<t:attribute> name-class pattern </t:attribute>]] (a, c, e) if and only if

M [[<t:anyString/>]] (a, c, e) if and only if

M [[<t:string whiteSpace="preserve"> string </t:string>]](a, c, e) if and only if

M [[<t:string whiteSpace="normalize"> string </t:string>]](a, c, e) if and only if

M [[<t:data type="NCName" ns="namespaceURI"/>]](a, c, e) if and only if

M [[<QName t:role="datatype" attributes> anything </QName>]](a, c, e) if and only if

M [[<t:empty/>]] (a, c, e) if and only if

M [[<t:notAllowed/>]] (a, c, e) is always false

M [[<t:choice> pattern1 pattern2 </t:choice>]] (a, c, e) if and only if

M [[<t:concur> pattern1 pattern2 </t:concur>]] (a, c, e) if and only if

M [[<t:oneOrMore> pattern </t:oneOrMore>]] (a, c, e) if and only if

M [[<t:group> pattern1 pattern2 </t:group>]](a, c, e) if and only if there exist a1, a2, c1, c2 such that

M [[<t:interleave> pattern1 pattern2 </t:interleave>]](a, c, e) if and only if there exist a1, a2, c1, c2 such that

M [[<t:ref parent="false" name="name"/>]](a, c, e) if and only if

M [[<t:ref parent="true" name="name"/>]](a, c, e) if and only if

M [[<t:grammar> <t:start> pattern </t:start> definition* </t:grammar>]](a, c, e) if and only if

Next, we define C.

C [[<t:name ns="namespaceURI"> NCName </t:name>]](n) if and only if

C [[<t:anyName/>]](n) is always true

C [[<t:nsName ns="namespaceURI"/>]](n) if and only if

C [[<t:choice> name-class1 name-class2 </t:choice>]](n) if and only if

C [[<t:difference> name-class1 name-class2 </t:difference>]](n) if and only if

5 Restrictions

5.1 Recursion

Recursion is restricted. Informally, recursive references are permitted only within an element. In other words, within any definition there must be an element enclosing any reference to that definition.

We write prohibit(e, NCName, n) to denote an environment that is the same as e except that NCName is bound to the pattern <t:prohibit>n</t:prohibit>.

We write R [[p]](e,d) to mean that pattern p has no illegal recursion with respect to environment e at element depth d. We can then define R as follows:

The restriction on recursion is simply that for the outermost pattern p it must be the case that R [[p]]({}, 0).

5.2 Strings

TODO Need to describe restriction on sequencing strings.

5.3 Non-restrictions

Apart from as above, there is no restriction on ambiguity/lookahead/determinism.

6 Full Syntax

Next, we define the full syntax.

pattern ::= <t:element> name-class pattern+ </t:element>
| <t:element name="QName"> pattern+ </t:element>
| <t:attribute> name-class pattern? </t:attribute>
| <t:attribute name="QName" global?> pattern? </t:attribute>
| <t:anyString/>
| <t:string white-space?> string </t:string>
| <t:data type="NCName"/>
| <QName t:role="datatype" attributes> anything </QName>
| <t:empty/>
| <t:notAllowed/>
| <t:oneOrMore> pattern+ </t:oneOrMore>
| <t:zeroOrMore> pattern+ </t:zeroOrMore>
| <t:optional> pattern+ </t:optional>
| <t:mixed> pattern+ </t:mixed>
| <t:group> pattern+ </t:group>
| <t:choice> pattern+ </t:choice>
| <t:interleave> pattern+ </t:interleave>
| <t:concur> pattern+ </t:concur>
| <t:ref name="NCName" parent?/>
| <t:grammar> grammar-element* </t:grammar>
| <t:include href="URIReference"/>
grammar-element ::= <t:define name="NCName" combine?> pattern+ </t:define>
| <t:start combine?> pattern+ </t:start>
| <t:start name="NCName" combine?> pattern+ </t:start>
| <t:include href="URIReference"/>
name-class ::= <t:name> QName </t:name>
| <t:anyName/>
| <t:nsName/>
| <t:choice> name-class+ </t:choice>
| <t:difference> name-class+ </t:difference>
| <t:not> name-class </t:not>
global ::= global="true"
| global="false"
white-space ::= whiteSpace="preserve"
| whiteSpace="normalize"
parent ::= parent="true"
| parent="false"
combine ::= combine="replace"
| combine="choice"
| combine="group"
| combine="interleave"
| combine="concur"

In addition, any element may have an ns attribute.

It is an error if a grammar element contains multiple grammar-elements with the same name.

7 Full Semantics

Now, we define the semantics of the full syntax by showing how to transform the full syntax into the primitive syntax.

7.1 Inclusion

The first stage of the transformation expands include elements.

When an include element appears as a pattern, it is replaced by the element referenced by the URI reference contained in the href attribute. The referenced element must specify a pattern. If an ns attribute is present on the include element but no ns attribute is present on the referenced element, then, before the include element is replaced, the ns attribute is added to the referenced element.

When an include element occurs as a child of a grammar element, then the element referenced by the URI reference contained in the href attribute must also be a grammar element, and the include element is replaced by the children of the referenced grammar element. If an ns attribute is present on the include element or on the referenced grammar element, then, before the include element is replaced, an ns attribute is added to each child element of the referenced grammar element that does not have an ns attribute; the value of the added ns attribute is equal to the value of the ns attribute on the referenced grammar element if it has one, and otherwise is equal to the value of the ns attribute on the include element.

It is an error if, before the replacement of any include elements, a grammar element has two or more child elements with the same name or has two or more start child elements, even if those child elements have combine attributes.

If the replacement of an include element also contains include elements, then those elements are replaced recursively. It is an error a pattern included by a some URI directly or indirectly includes a pattern with that same URI; relative URIs must be resolved to absolute URIs before testing whether they are the same URI.

7.2 Namespace Expansion

Namespace expansion makes namespaces explicit on all elements that specify names.

First, an ns attribute is added to any of the following categories of element if they do not already have an ns attribute:

The value of the added ns attribute is equal to the value of the ns attribute of the nearest ancestor that has an ns attribute, or the empty string if there is no such ancestor. Note that ancestor means ancestor in the tree after inclusions have been resolved. Thus, ns attributes inherit into included patterns.

Next, for

the prefix together with the colon is removed and an ns attribute is added, replacing any existing ns attribute. The value of the added ns attribute is equal to the namespace URI declared by the in-scope namespace declarations for that prefix. It is an error if there is no declaration for that prefix. Note that the in-scope namespace declarations of an element are not changed when it is included. Thus namespace declarations are not inherited into included patterns.

Next, an ns="" attribute is added to any attribute element with a name attribute that does not yet have an ns attribute.

Finally, any ns attribute on an element that is not one of

is removed. Also any global attribute on an attribute element is removed.

7.3 Transformation Rules

The following rules specify how to transform a pattern using the full syntax into a pattern using the primitive syntax.

<t:mixed> pattern <t:mixed> => <t:interleave> <t:anyString/> pattern </t:interleave>

<t:optional> pattern </t:optional> => <t:choice> <t:empty/> pattern </t:choice>

<t:zeroOrMore> pattern </t:zeroOrMore> => <t:optional> <t:oneOrMore> pattern </t:oneOrMore> </t:optional>

<t:group> pattern </t:group> => pattern

<t:interleave> pattern </t:interleave> => pattern

<t:choice> pattern </t:choice> => pattern

<t:concur> pattern </t:concur> => pattern

<t:element> name-class pattern1 pattern2+ </t:element> => <t:element> name-class <t:group> pattern1 pattern2+ </t:group> </t:element>

<t:oneOrMore> pattern1 pattern2+ </t:oneOrMore> => <t:oneOrMore> <t:group> pattern1 pattern2+ </t:group> </t:oneOrMore>

<t:optional> pattern1 pattern2+ </t:optional> => <t:optional> <t:group> pattern1 pattern2+ </t:group> </t:optional>

<t:zeroOrMore> pattern1 pattern2+ </t:zeroOrMore> => <t:zeroOrMore> <t:group> pattern1 pattern2+ </t:group> </t:zeroOrMore>

<t:mixed> pattern1 pattern2+ </t:mixed> => <t:mixed> <t:group> pattern1 pattern2+ </t:group> </t:mixed>

<t:not> pattern1 pattern2+ </t:not> => <t:not> <t:group> pattern1 pattern2+ </t:group> </t:not>

<t:start> pattern1 pattern2+ </t:start> => <t:start> <t:group> pattern1 pattern2+ </t:group> </t:start>

<t:define> pattern1 pattern2+ </t:define> => <t:define> <t:group> pattern1 pattern2+ </t:group> </t:define>

<t:attribute> name-class <t:/attribute> => <t:attribute> name-class <t:anyString/> <t:/attribute>

<t:start name="name"> pattern+ <t:/start> => <t:start> pattern+ <t:/start> <t:define name="name"> pattern+ <t:/define>

definitions1 <t:start> pattern <t:/start> definitions2 => <t:start> pattern <t:/start> definitions1 definitions2

<t:element name="NCName" ns="ns"> pattern+ <t:/element> => <t:element> <t:name ns="ns">NCName<t:/name> pattern+ <t:/element>

<t:attribute name="NCName" ns="ns"> pattern? <t:/attribute> => <t:attribute> <t:name ns="ns">NCName<t:/name> pattern? <t:/attribute>

<t:group> pattern1 pattern2 pattern3+ </t:group> => <t:group> <t:group> pattern1 pattern2 </t:group> pattern3+ </t:group>

<t:choice> pattern1 pattern2 pattern3+ </t:choice> => <t:choice> <t:choice> pattern1 pattern2 </t:choice> pattern3+ </t:choice>

<t:interleave> pattern1 pattern2 pattern3+ </t:interleave> => <t:interleave> <t:interleave> pattern1 pattern2 </t:interleave> pattern3+ </t:interleave>

<t:concur> pattern1 pattern2 pattern3+ </t:concur> => <t:concur> <t:concur> pattern1 pattern2 </t:concur> pattern3+ </t:concur>

<t:string> string </t:string> => <t:string whiteSpace="normalize"> string </t:string>

<t:ref name="NCName"/> => <t:ref parent="false" name="NCName"/>

<t:choice> name-class1 name-class2 name-class3+ </t:choice> => <t:choice> <t:choice> name-class1 name-class2 </t:choice> name-class3+ </t:choice>

<t:difference> name-class1 name-class2 name-class3+ </t:difference> => <t:difference> <t:difference> name-class1 name-class2 </t:difference> name-class3+ </t:difference>

<t:choice> name-class </t:choice> => name-class

<t:difference> name-class </t:difference> => name-class

<t:not> name-class </t:not> => <t:difference> <t:anyName/> name-class </t:difference>

7.4 Resolving Conflicts

Finally, conflicting define and start elements are combined. The combine attribute controls how the elements are combined. A define element conflicts with a preceding define element if it has the same name. A start element conflicts with any preceding start element. Each define and start child element is examined in turn. If the current element conflicts with a preceding element, then if the current element does not have a combine attribute, it is an error; otherwise, it is combined with the preceding element with which it conflicts according to the value of the combine attribute. If the value of the combine attribute is replace, then the current element replaces the conflicting element. Otherwise, the value of the combine attribute is the name of the element to use to combine the pattern contained in the current element with the pattern contained in the conflicting element. In other words,

<t:define name="x"> p1</t:define>

and

<t:define name="x" combine="e"> p2</t:define>

will be combined into

<t:define name="x"> <t:e> p1 p2 </t:e> </t:define>

If there is a combine attribute, but there is no conflicting preceding definition, then the combine attribute is ignored.

8 TREX pattern for TREX patterns

<grammar ns="http://www.thaiopensource.com/trex"
         xmlns="http://www.thaiopensource.com/trex">
  <start name="pattern">
    <choice>
      <element name="element">
	<choice>
	  <attribute name="name"/>
	  <ref name="open-name-class"/>
	</choice>
        <ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="attribute">
	<ref name="common-atts"/>
	<choice>
	  <group>
	    <attribute name="name"/>
	    <optional>
	      <attribute name="global">
		<choice>
		  <string>true</string>
		  <string>false</string>
		</choice>
	      </attribute>
	    </optional>
	  </group>
	  <ref name="open-name-class"/>
	</choice>
        <interleave>
          <ref name="other"/>
          <optional>
            <ref name="pattern"/>
          </optional>
        </interleave>
      </element>
      <element name="group">
        <ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="interleave">
        <ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="choice">
        <ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="concur">
        <ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="optional">
        <ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="zeroOrMore">
        <ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="oneOrMore">
        <ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="mixed">
        <ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="ref">
	<attribute name="name"/>
        <optional>
          <attribute name="parent">
            <choice>
              <string>true</string>
              <string>true</string>
            </choice>
          </attribute>
        </optional>
        <ref name="common-atts"/>
      </element>
      <element name="empty">
        <ref name="common-atts"/>
	<ref name="other"/>
      </element>
      <element name="anyString">
        <ref name="common-atts"/>
	<ref name="other"/>
      </element>
      <element name="string">
        <optional>
          <attribute name="whiteSpace">
	    <choice>
	      <string>preserve</string>
	      <string>normalize</string>
	    </choice>
          </attribute>
        </optional>
        <ref name="common-atts"/>
        <mixed>
          <ref name="other"/>
        </mixed>
      </element>
      <element name="data">
        <attribute name="type"/>
        <ref name="common-atts"/>
        <ref name="other"/>
      </element>
      <ref name="anonymous-datatype"/>
      <element name="notAllowed">
        <ref name="common-atts"/>
	<ref name="other"/>
      </element>
      <element name="include">
	<attribute name="href"/>
        <ref name="common-atts"/>
	<ref name="other"/>
      </element>
      <element name="grammar">
	<ref name="common-atts"/>
        <interleave>
          <ref name="other"/>
   	  <oneOrMore>
            <ref name="grammar-element"/>
	  </oneOrMore>
        </interleave>
      </element>
    </choice>
  </start>

  <define name="grammar-element">
    <choice>
      <element name="define">
	<attribute name="name"/>
        <ref name="combine-att"/>
	<ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="start">
	<optional>
	  <attribute name="name"/>
	</optional>
        <ref name="combine-att"/>
	<ref name="common-atts"/>
	<ref name="open-patterns"/>
      </element>
      <element name="include">
	<attribute name="href"/>
      </element>
    </choice>
  </define>

  <define name="anonymous-datatype">
    <element>
      <not>
	<nsName/>
      </not>
      <attribute name="role" global="true">
	<string>datatype</string>
      </attribute>
      <zeroOrMore>
	<choice>
	  <attribute>
	    <not>
	      <nsName/>
	    </not>
	  </attribute>
	  <ref name="any"/>
	  <anyString/>
	</choice>
      </zeroOrMore>
    </element>
  </define>

  <define name="combine-att">
    <optional>
      <attribute name="combine">
        <choice>
          <string>replace</string>
          <string>choice</string>
          <string>group</string>
          <string>interleave</string>
          <string>concur</string>
        </choice>
      </attribute>
    </optional>
  </define>
  
  <define name="open-patterns">
    <interleave>
      <ref name="other"/>
      <oneOrMore>
        <ref name="pattern"/>
      </oneOrMore>
    </interleave>
  </define>

  <define name="name-class">
    <choice>
      <element name="name">
        <ref name="common-atts"/>
        <mixed>
          <ref name="other"/>
        </mixed>
      </element>
      <element name="anyName">
        <ref name="common-atts"/>
	<ref name="other"/>
      </element>
      <element name="nsName">
        <ref name="common-atts"/>
	<ref name="other"/>
      </element>
      <element name="not">
        <ref name="common-atts"/>
	<ref name="open-name-class"/>
      </element>
      <element name="difference">
        <ref name="common-atts"/>
        <oneOrMore>
	  <ref name="open-name-class"/>
        </oneOrMore>
      </element>
      <element name="choice">
        <ref name="common-atts"/>
        <oneOrMore>
	  <ref name="open-name-class"/>
        </oneOrMore>
      </element>
    </choice>
  </define>

  <define name="open-name-class">
    <interleave>
      <ref name="other"/>
      <ref name="name-class"/>
    </interleave>
  </define>

  <define name="common-atts">
    <optional>
      <attribute name="ns"/>
    </optional>
    <zeroOrMore>
      <attribute>
        <not>
          <choice>
            <nsName/>
            <nsName ns=""/>
          </choice>
        </not>
      </attribute>
    </zeroOrMore>
  </define>

  <define name="other">
    <zeroOrMore>
      <element>
        <not>
          <nsName/>
        </not>
        <zeroOrMore>
          <choice>
            <attribute>
              <not>
                <nsName/>
              </not>
            </attribute>
            <anyString/>
            <ref name="any"/>
          </choice>
        </zeroOrMore>
      </element>
    </zeroOrMore>
  </define>

  <define name="any">
    <element>
      <anyName/>
      <zeroOrMore>
        <choice>
          <attribute>
            <anyName/>
          </attribute>
          <anyString/>
          <ref name="any"/>
        </choice>
      </zeroOrMore>
    </element>
  </define>

</grammar>