Extensible Dynamic Binary XML, Client/Server Binary XML Format (XDBX) Version 1.0 (July 14, 2010) Permission to copy and display the Extensible Dynamic Binary XML, Client/Server Binary XML Format (XDBX) (the "Specification"), in any medium without fee or royalty is hereby granted by IBM (collectively, the "Authors"), provided that you include the following on ALL copies of the Specification, or portions thereof, that you make: 1. A link or URL to the Specification at one of the Authors websites. 2. The copyright notice as shown in the Specification. The Authors each agree to grant you a royalty-free license, under reasonable, non- discriminatory terms and conditions to their respective patents that they deem necessary to implement the Specification. THE SPECIFICATION IS PROVIDED "AS IS," AND THE AUTHORS MAKE NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR TITLE; THAT THE CONTENTS OF THE SPECIFICATION ARE SUITABLE FOR ANY PURPOSE; NOR THAT THE IMPLEMENTATION OF SUCH CONTENTS WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. THE AUTHORS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATING TO ANY USE OR DISTRIBUTION OF THE SPECIFICATION. The name and trademarks of the Authors may NOT be used in any manner, including advertising or publicity pertaining to the Specification or its contents without specific, written prior permission. Title to copyright in the Specification will at all times remain with the Authors. No other rights are granted by implication, estoppel or otherwise. © Copyright IBM Corporation 2010. Abstract The solution that is presented in this document allows an encoder to produce the binary XML format using one or more of a set of attributes. The encoder can choose which attributes to include based on knowledge of the receiver. The receiver that reads the binary XML format can inspect the format header to determine the attributes with which it is encoded. This can be purely informational, or allow the receiver the opportunity to optimize its configuration to more efficiently process the attributes contained in the format. Table of Contents 1 Motivation 1 2 Encoding overview 2 3 Format Header 3 3.1 Layout of the format header 3 3.2 XDBX Major Version 3 3.3 Encoding Flags 4 3.3.1 Document Type 4 3.3.2 StringID Flags 4 3.3.3 Valid Flag 4 3.4 Example of a Format header 5 4 Format Content 6 4.1 Conventions 7 4.1.1 How Values and Lengths are Encoded 7 4.2 Encoding of Single Documents and Sequences 9 4.3 Encoding of XML Declarations 10 4.4 Encoding of Elements 11 4.5 Encoding of Attributes 12 4.6 Encoding of Namespace Mappings 13 4.7 Encoding of Text 13 4.8 Encoding of Comments 14 4.9 Encoding of Processing Instructions 14 4.10 Encoding of Other Information 14 4.11 Reserved Values for Tags 15 5 Format details 16 5.1 Encoding Single Documents and Sequences 16 5.2 StringIDs 16 5.2.1 Examples of StringID Usage 16 5.3 StringID Notes 20 5.4 Text Notes 20 5.4.1 White Space 20 5.5 XML Declaration Tag Notes 21 5.6 DTD and DOCTYPE 21 i 5.7 Namespace Notes 21 5.8 Hint Tag Notes 22 5.9 Empty Sequence 22 5.10 Escaping of Characters 22 5.11 Private Extensions 23 5.12 Reserved Tags 24 6 Examples 25 6.1 Example 1 – Default encoding 25 6.2 Example 2 – Sequence 26 6.3 Example 3 – StringIDs 27 6.4 Example 4 – Namespaces with StringIDs 28 6.5 Example 5 – Mixed Content 29 6.6 Example 6 – White Space 30 Appendix A Complete XDBX BNF 31 ii 1 Motivation Binary serialization of XML is desirable because it allows encoding of XML data in a smaller and more efficient form than textual XML format. The binary XML format is more efficient for various reasons. These include: • Multiple occurrences of repeated text are condensed through the use of StringIDs. StringIDs are integer identifiers that replace text strings. • When a parser processes data in a pretokenized format, the parser does not need to search for as many token delimiters in the content, or handle as many edge cases. • All values are prefixed with their length. When the parser has length information, it does not need to search for the ends of element names or values. • All entity references are expanded in binary XML format. The XML parser does not need to expand entity references. The binary XML format has the following disadvantages: • Loss of XML interoperability. Data that is in a proprietary format can be used only on systems that have the software to decode it. • The encoder must do extra processing to: o Perform validation o Perform well-formedness checking o Resolve all entity references o Identify repeated tags for replacement with StringIDs This binary XML format is not intended as a replacement for XML. It can provide better performance than XML when it is used in the implementation of some APIs. In general, the benefits of the binary XML format outweigh the disadvantages. The additional processing time that the encoder requires is usually less than the processing time that is used for parsing an XML document , especially when the XML document must be parsed more than once. 1 2 Encoding overview This binary XML representation contains a format header followed by a number of tags. The format header has encoding attributes which give the receiver some useful properties of the binary XML. The following characteristics of the binary encoding are constant, regardless of the source document or how the binary encoding is performed: • All text is encoded as UTF-8. • All entity references in the source document are replaced by their values. • Line breaks are normalized. • Attributes are normalized. • Where applicable, data is encoded in big-endian format. The binary XML format is made up of various tokens (tags) and values. When binary XML format is viewed with a standard text editor or as ASCII in a debugger, the tags display as single ASCII characters. This can aid in debugging while making the binary XML format more humanly readable. 2 3 Format Header 3.1 Layout of the format header The binary XML format contains a header with information about how the format was constructed. The header information allows the parser to configure itself in order to process the message most efficiently. To identify the format and its attributes, the following scheme is used for the first set of bytes of the document: (2 bytes) – Binary XML document identifier (“magic number”) (1 byte) – Header length (not including magic number or the length byte itself) (1 byte) – XDBX major version (4 byte Integer) – Encoding flags The “magic number” will always be this value in binary: 11001010 00111011 DocumentContent follows the Header. HeaderLength determines the length of the Header. BNF XDBX ::= Header DocumentContent Header ::= DocIdentifier HeaderLength MajorVersion EncodingFlags HeaderFill DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */ HeaderLength ::= #x5 MajorVersion ::= #x1 EncodingFlags ::= FourBytes HeaderFill ::= Byte* FourBytes ::= Byte Byte Byte Byte Byte ::= [#x0-#xFF] 3.2 XDBX Major Version There is just one major version of XDBX, identified by the XDBX major version value of 0x01 (version 1). In this version, the HeaderLength must be at least 5. XDBX version 1 streams contain any of the following tags: 'e', 'X', 'x', 'z', 'a', 'Y', 'y', 'b', 'm', 'T', 'U', 'C', ‘W’, 'V', 'L', 'D', 't', 'I', 'Z', '@', 'd', 'P', 'c', 'H'. The set of tags that an XDBX encoder generates is implementation defined. However, an XDBX encoder must assign a valid XDBX major version number to each generated stream, and ensure that each stream contains only tags that are allowed for that XDBX major version. 3 An XDBX decoder is required to fully support the tag set assigned to an implementation- defined XDBX major version level. It must be able to decode all valid tags from the corresponding tag set. However, XDBX decoders can reject XDBX streams that are identified by an XDBX major version that is higher than the version that the decoder supports. 3.3 Encoding Flags The format for encoding flags allows for future expansion. Encoding flags, or features, can be added as needed. The header consists of indicators that signal to a processor how the format is encoded. Each encoding flag is a bit in a four-byte integer field in the header. The following encoding flags can be used in the binary XML format. Each encoding flag is listed along with its value in the four-byte integer header field. 3.3.1 Document Type This attribute indicates whether the binary stream represents one complete well-formed XML document or a sequence of items, as defined by the XQuery 1.0 specification. • XML document (Value: x00000000) • XML sequence (Value: x00000001) 3.3.2 StringID Flags The flags that are associated with stringIDs are: • StringID flag • Dense stringIDs used 3.3.2.1 StringID Flag (required) This encoding flag (x00000002) must be set. 3.3.2.2 Dense StringIDs Used Flag Certain implementations might require the stringIDs that are used in the binary XML to be small numbers so that they can be used as indexes in an array (as opposed to a hash table). When specified (x00000020), this encoding flag notifies the receiver that the stringIDs are small numbers. In general, small numbers are monotonically increasing numbers. The stringID value 0 (zero) is reserved. 3.3.3 Valid Flag When specified (x00000080), this encoding flag notifies the receiver that the XML document or sequence of items conforms to a schema. This may have been determined by 4 the use of a validating XML parser, or by construction from objects that are associated with a schema. The use of this information by the receiver is beyond the scope of this specification. A receiver may choose to ignore this information. 3.4 Example of a Format header Binary XML Document Identifier: 11001010 00111011 Header Length: 00000101 XDBX major version: 00000001 Encoding flags: • Document Type (Bit 1): XML Document • StringID (Bit 2): On Magic Num Hdr Len Version Encoding flags 11001010 00111011 00000101 00000001 00000000 00000000 00000000 00000010 5 4 Format Content The following combinations of information are used in binary XML document encoding: • TLV - Tag-Length-Value • TV - Tag-Value • LV - Length-Value • TLVid - Tag-Length-Value-StringID • ID - StringID Some content is denoted via a TLV, while other content uses the shorter LV. This is done for compactness, where a second tag is unnecessary and can be inferred from the previous tag. The specification also uses TV when the length is known to be one. In addition, TLVid is used when StringIDs are used, and is how a first occurrence of a string value is assigned its ID. Finally, there is an ID format if only the stringID is needed. 6 4.1 Conventions All the lengths are expressed as a number of bytes. A summary of each tag in the format and its meaning is contained in the tables that follow. The values in the Tag column are the decimal values of the tags. The values in the ASCII column are the ASCII encoding of the tag values. The following conventions are used: • TLV(localname) - a TLV for the localname is defined, where 'Value' is the text of the localname. • TLV(localname) /LV(prefix)/LV(uri)- a TLV for the localname, followed by an LV for the namespace prefix, followed by an LV for the namespace URI. • TLVid(localname) - a TLVid for the localname is defined where stringID is the ID assigned to the text for localname. • Tid(localname)/id(prefix)/id(uri) - a Tag-StringID for the localname, followed by the stringID of the namespace prefix, followed by the stringID of the namespace URI. The StringID references a string in the dictionary. BNF LengthValue ::= Length Value Length ::= VariableInteger Value ::= Byte* /* Number of bytes governed by preceding length */ StringID ::= VariableInteger VariableInteger ::= (LongLeading | ShortLeading)? LastByte LongLeading ::= [#x81-#x8F] [#x80-#xFF]? [#x80-#xFF]? [#x80-#xFF]? ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]? LastByte ::= [#x0-#x7F] 4.1.1 How Values and Lengths are Encoded Encoding Attributes in the format header are always encoded as signed four-byte integers in big endian format For space efficiency, all other values and lengths are encoded as a variable number of bytes, with the first byte containing the highest order bits for the integer, the next byte containing the next highest order bits, and so on. This allows the encoding to represent any arbitrary integer in as few bytes as possible. However, this specification limits the integer to a value representable in a signed 32 bit integer, which is 2Gbytes. Each byte contains seven bits of the integer's value, with the highest order bit of each byte 7 designated as a flag bit. A byte's flag bit is off if the byte is the last byte (lowest order byte) of a variable length byte sequence for a number. Because only as many bytes as necessary to represent an integer are used, integers between 0 and 127 are represented in one byte with the flag bit off. Integers between 128 and 16,383 are represented in two bytes with the flag bit set in the first byte, and so on. Examples: • A length of binary 00000101 means 5 • A length of binary 10000101 00100001 means 673 (binary 1010100001) 8 4.2 Encoding of Single Documents and Sequences A binary stream can represent one complete well-formed XML document or a sequence of items, as defined by the XQuery specification. This information is encoded in the format header with the following encoding flags: • XML Document (Value: x00000000) • XML Sequence (Value: x00000001) Each item in the sequence can be a complete document, a subtree, or an atomic value. BNF DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd /* Which branch to choose is controlled by EncodingFlags */ DocumentEnd ::= 'Z' XMLDocument ::= (Anywhere XMLDecl)? Misc* (DocType | Misc*)? Element Misc* XMLSequence ::= (SequenceItem (SequenceSeparator SequenceItem)*)? SequenceItem ::= Anywhere (CompleteDoc | Comment | PI | AtomicValue | Element) Anywhere SequenceSeparator ::= '@' CompleteDoc ::= 'd' XMLDocument Anywhere ::= (SI | Hint | Reserved)* Misc ::= Comment | PI | SI | Hint DocType ::= 'F' StringID StringID StringID Tags Value ASCII Meaning 90 Z End of the binary stream 64 @ Separator for items in an XML sequence 100 d Document node (assumed for XML documents, not assumed in XML sequences) 70 F DOCTYPE in Tid(rootElementName) /id(systemID)/id(publicID) 9 4.3 Encoding of XML Declarations BNF XMLDecl ::= XMLVersion Encoding? Standalone? XMLVersion ::= 'L' LengthValue /* The value is a valid XML version. "1.0" or "1.1" for now */ Encoding ::= 'D' LengthValue Standalone ::= 't' BooleanValue BooleanValue ::= False | True False ::= #x0 True ::= #x1 Tags Value ASCII Meaning 76 L XML version in TLV(version) form. 68 D Encoding in TLV(encoding) form. 116 t Standalone in TV(standalone) form where the value of 'standalone' is either 0 or 1. 10 4.4 Encoding of Elements BNF Element ::= (ElementI | ElementSII | ElementIII) ElementContent EndElement ElementI ::= 'e' StringID ElementSII ::= 'X' LengthValue StringID StringID StringID ElementIII ::= 'x' StringID StringID StringID EndElement ::= 'z' ElementContent ::= NSDecls Attributes Children Children ::= (Misc | Element | Text)* Tags Value ASCII Meaning 101 e Tid(localname) Used when the element is not associated with a namespace. 88 X TLVid(localname) / id(prefix) / id(uri) Used when the stringID for the element name is not yet defined. If the element is in the default namespace, then the prefix stringID is zero. If the element is not in a namespace, then the URI stringID is zero. 120 x Tid(localname) / id(prefix) / id(uri) Used when the stringID for the element name is already defined. If the element is in the default namespace, then the prefix stringID will be zero. If the element is not in a namespace, then the URI stringID is zero. 122 z End Element 11 4.5 Encoding of Attributes BNF Attributes ::= (Anywhere Attribute)* Attribute ::= (AttributeI | AttributeSII | AttributeIII) AttributeValue AttributeI ::= 'a' StringID AttributeSII ::= 'Y' LengthValue StringID StringID StringID AttributeIII ::= ('y' | 'b') StringID StringID StringID AttributeValue ::= LengthValue /* If 'b' is used, then no &,',",<, >,#xD,#xA,#x9 can appear in value */ Tags Value ASCII Meaning 97 a Tid(localname) / LV(attribute-value) Used when the attribute is not associated with a namespace. 89 Y TLVid(localname) / id(prefix) / id(uri) / LV(attribute-value) Used when the stringID for the attribute name is not yet defined. If the attribute is not in a namespace, then the prefix stringID and URI stringID is zero. 121 y Tid(localname) / id(prefix) / id(uri) / LV(attribute-value) Used when the stringID for the attribute name is already defined. If the attribute is not in a namespace, then the prefix stringID and URI stringID is zero. 98 b Tid(localname) / id(prefix) / id(uri) / LV(attribute-value) Similar to the 'y' tag. Characters that cannot be used in the value are: • '<' (#x3c) • '>' (#x3e) • '&' (#x26) • carriage return (#x0d) • single quote (#x27) • double quote (#x22) • tab (#x09) • linefeed (#x0a) Because no characters need to be escaped when this attribute node is serialized, this feature should speed up serialization. 12 4.6 Encoding of Namespace Mappings BNF NSDecls ::= (Anywhere NSDecl)* NSDecl ::= NSDeclII NSDeclII ::= 'm' StringID StringID Tags Value ASCII Meaning 109 m Tid(prefix) /id(namespace-uri) Declares a namespace mapping of a prefix stringID to a namespace URI stringID. For default namespace declarations, the stringID for the prefix is zero. 4.7 Encoding of Text BNF Text ::= ('T' | 'U' | 'C' | 'W') LengthValue AtomicValue ::= 'V' LengthValue Tags Value ASCII Meaning 84 T Text node in TLV(text) form. 85 U Text node in TLV(text) form. The '<' (#x3c), '>' (#x3e), '&' (#x26), and carriage return (#x0d) characters cannot be used in the value. Because no characters need to be escaped when this text node is serialized, this feature should speed up serialization. 67 C CDATA string in TLV(text) form. 87 W Text node containing only white space in TLV(text) form. White space consists of one or more space (#x20) characters, carriage returns (#x0d), line feeds (#x0a), tabs (#x09), Unicode line separator characters (#x2028), or NELs (#x85). Used when a text node contains only white space, unless the nearest containing element with an xml:space attribute specifies xml:space='preserve'. 86 V Atomic Value in TLV(text) form. 13 4.8 Encoding of Comments BNF Comment ::= 'c' LengthValue Tags Value ASCII Meaning 99 c Comment in TLV(comment) form. 4.9 Encoding of Processing Instructions BNF PI ::= PII PII ::= 'P' StringID LengthValue Tags Value ASCII Meaning 80 P Processing instruction in Tid(target)/LV(value) form. The 'P' tag cannot declare an ID for the target of the processing instruction. Instead, an 'I' tag should be used to define the stringID for the target. Then the 'P' tag is used to define the processing instruction itself. Although this is unlike the behavior for element and attribute tags, this was done to avoid creating several tags to describe a processing instruction. 4.10 Encoding of Other Information BNF SI ::= 'I' LengthValue StringID Hint ::= 'H' LengthValue LengthValue Tags Value ASCII Meaning 73 I Definition of a stringID in TLVid(string) form. Used only when the StringID flag is set. 72 H Hint in TLV/LV form. 14 4.11 Reserved Values for Tags BNF Reserved ::= [#xC9 - #xFA] Byte* Tags Value ASCII Meaning 201 -250 Reserved for use by applications. Values 201 through 250 are reserved for use by applications, and will not be used as tags in future versions of this specification. These reserved values can be used to define private extensions to the format for features not accounted for in this version of the specification. See the Private Extensions section on page 23 for more information. 15 5 Format details This section provides additional details on the binary XML format. 5.1 Encoding Single Documents and Sequences Whether an XDBX instance represents an XML document or a sequence of items is encoded in the XDBX header. Most commonly, the binary stream represents an XML Document. In this case, the document node as defined by the XML data models is assumed. In other words, there is no need to start the document with a 'd' tag. If the binary stream represents an XML Sequence, then the document node is not assumed, and any document node in the stream needs to be denoted with a 'd' tag. Note that XPath behaves differently whether there is a document node or not. It is important to note that if stringIDs are used, the encoder must ensure that all stringIDs are valid from one item to the next. In other words, the stringIDs are global to the binary XML stream. Combining multiple documents together as items in a sequence could have a size advantage, because the stringIDs would need to be defined only once. 5.2 StringIDs Usage of stringIDs results in a smaller encoding, because the StringIDs are typically smaller than the text they represent. In addition, the use of StringIDs can allow the data in binary XML format to be processed more efficiently. The receiver must be prepared to manage the StringIDs that appear in the document. This requires establishing and managing lookup tables to efficiently reconcile StringIDs with the text they represent. In some encodings the first occurrence of the text is written as text, then where that text appears again, it is replaced with an ID that is computed during the processing of the first occurrence. In other encodings all text, or only a portion of the text, could be represented by an ID, where the ID is a reference to a dictionary that is contained in the message. A StringID can be used only after the tag that defines it. 5.2.1 Examples of StringID Usage The following shows example encodings of namespace declarations, elements, and attributes when StringIDs are used. Namespace Declaration: The namespace declaration portion of the element tag: is encoded as I3foo1I3bar2m12, where: • 'I' assigns the StringID '1' to "foo" and '2' to "bar" • 'm' declares the namespace mapping of "foo" to '1' and "bar" to '2'. 16 Suppose that the namespace prefix is reassigned to a different uri later in the document. For example:
The encoding of the namespace declaration is: I3baz3m13, where '3' is the StringID assigned to "baz". Element with no prefix and no namespace: The first occurrence of
is encoded as: X7Address100, where: • 'X' is the tag indicating an element name is encoded with StringIDs, and that a length/value/ID tuple follows defining the localname and its associated ID, followed by the stringIDs for the namespace prefix and namespace uri. • '7' is the length of the localname string "Address" and '1' is the assigned ID for that string. • '0' is the stringID for "no namespace prefix". • '0' is the stringID for "no namespace uri". Subsequent occurrences of
are encoded more compactly as e1, where '1' is the StringID for the string "Address". Element with no prefix and the default namespace: The first occurrence of
is encoded as: X7Address104 where: • 'X' is the tag indicating an element name is encoded with StringIDs, and that a length/value/ID tuple follows defining the localname and its associated ID. • '0' is the stringID for the namespace prefix (because there is none). • '4' is the stringID of the namespace uri. Subsequent occurrences of
are encoded more compactly as x104, where • '1' is the StringID for the string "Address". • '4' is the stringID for the namespace uri. Element with prefix: The first occurrence of is encoded as X7Address154, where: • '1' is the StringID assigned to the string "Address". • '5' is the stringID that was previously assigned to "foo". • '4' is the stringID that was previously assigned to the namespace uri. Subsequent occurrences of are encoded more compactly as x154, where '1' is the StringID for the string "Address". 17 Attribute with no prefix (and thus no namespace): The first occurrence of the attribute portion of is encoded as Y3mgr9002NO where: • 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a length/value/id tuple for the attribute name. • '3' is the length of the attribute name "mgr". • '9' is the StringID assigned the string "mgr". • '0' for the stringID of the prefix. • '0' for the stringID of the URI. • '2' is the length of the attribute value: "NO". Subsequent occurrences of the attribute portion of are encoded as a92NO, where: • 'a' indicates an attribute declaration with StringIDs. • '9' is the stringID of the attribute name. • '2' is the length/value of the attribute value: "NO". Attribute with prefix: The first occurrence of the attribute portion of is encoded as: Y3mgr9542NO where: • 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a length/value/id tuple for the attribute name. • '5' for the stringID for prefix. • '4' for the stringIDs for URI. • '3' is the length of the attribute name "mgr". • '9' is the StringID assigned the string "mgr". • '5' is the stringID for the prefix. • '4' is the stringID for the URI. • '2' is the length of the attribute value "NO". Subsequent occurrences of the attribute portion of are encoded more compactly as: y9542NO, where: • 'y' is the tag indicating an attribute declaration with StringIDs. • '9' the stringID for the attribute name • '5' is the stringID for prefix. • '4' is the stringID for URI. • '2' the length/value of the attribute value "NO". Elements, Text, and namespaceIDs: This section ties together some of the concepts described above and assumes StringIDs are used. For example: 18 ABC The namespace declaration in the above XML is encoded as: I3foo1I3bar2m12, where: • '1' represents the StringID for "foo". • '2' is the StringID for "bar". • 'm12' is the structure to identify a mapping of foo ('1') to bar ('2'). Therefore, the first occurrence of foo:Address is encoded as follows: X7Address912T3ABCz where: • 'X' indicates an element name expressed in LVid form. • '7Address' is the LV for the localname. • '9' is the StringID for "Address". • '12' is a reference to the namespace mapping of foo to bar. • 'T3ABC' is the TLV for the text node and 'z' represents the end element tag. The subsequent occurrence of foo:Address are encoded more compactly as follows: x912C3DEFz where: • 'x' indicates an element name expressed in id form. • '9' is the StringID for "Address". • '12' is a reference to the namespace mapping of foo to bar. • 'C3DEF' is the TLV for the CDATA. • 'z' represents the end element tag. (NOTE: The encoder could choose to encode the CDATA as a text node via 'T'.) The first occurrence of foo:Address must use the more expansive form of an element name 'X', where the second occurrence can use the more compact version 'x' because the element name is already encoded with a stringID. The following table summarizes the encoding of an element in various forms with StringIDs on: No Namespace Namespace First Occurrence Subsequent Occurrences First Occurrence Subsequent Occurrences
X7Address100 e1 X7Address902 x902 N/A N/A X7Address912 x912 The following table summarizes the encoding of an attribute in various forms with StringIDs on: No Namespace Namespace First Occurrence Subsequent Occurrences First Occurrence Subsequent Occurrences Y3mgr9002NO a92NO Y3mgr9022NO y9022NO N/A N/A Y3mgr9122NO y9122NO 19 5.3 StringID Notes StringIDs are considered global. For example, if the string "Person" is given the stringID 4, this value will exist for the entire binary XML document. It is invalid for "Person" to be given a different stringID, or for 4 to be assigned another string in the same binary XML document. The stringID value 0 (zero) is reserved and is used to mark "no namespace prefix" and "no namespace URI". 5.4 Text Notes Multiple text and/or CDATA tags can appear one after another in order to handle arbitrarily large amounts of data. They are also used to encode mixed content. It is up to the encoder whether to encode CDATA using the 'C' tag or a 'T' tag, because they are semantically identical. The 'C' tag exists for applications that want to preserve the CDATA syntax. Beyond the difference between CDATA and text as described in the XML specification, this binary XML specification treats them identical. The 'U' tag is similar to the 'T' tag, except that the encoder guarantees that none of the characters in the 'U' tag need to be replaced with entity references if this text is serialized as XML. In other words, none of the following four characters are present in the text node: less-then “<” [<], greater-than “>” [>], ampersand “&” [&], and carriage-return [ ]. 5.4.1 White Space The XMLPARSE function, which may be applied to an XML document that is passed to the receiver, offers the options of STRIP WHITESPACE and PRESERVE WHITESPACE. STRIP WHITESPACE removes text nodes that contain only white space unless the nearest containing element with an xml:space attribute specifies xml:space='preserve'. To facilitate the processing of STRIP WHITESPACE, text nodes that would be stripped by this operation must be identified by the 'W' tag. CDATA sections that contain white space that would be stripped by STRIP WHITESPACE must be identified by a 'W' tag rather than a 'C' tag. This is seen in the following examples: Serialized XML: Binary XML: X1a100T1 C3bcdT1 z Serialized XML: Binary XML: X1a100W1 W1 W1 z or X1a100W3 z If a processor determines that certain white space characters can be removed (e.g. ignorable whitespace SAX events), they should be removed instead of being encoded in a 'W' tag. 20 5.5 XML Declaration Tag Notes Typically, there is no XML declaration in binary XML. After all, the binary XML encoding is always UTF-8. However, if the XML version is not 1.0, then the XML declaration is mandatory, just like in serialized XML. If the XML declaration tags are present in the binary XML, the tags must include the version tag, however, the encoding and standalone tags are optional. Example encodings: Serialized XML: Binary XML: L31.0D5UTF-8t0 Serialized XML: Binary XML: L31.1D6UTF-16 Serialized XML: Binary XML: L31.0t1 Serialized XML: Binary XML: L31.1 The XML declaration tags are informational only and therefore optional. They provide the binary encoding with the information provided in the XML declaration of the source document. For example, all text is encoded as UTF-8 in the binary encoding, even if the source document used UTF-16. The fact that the source document used UTF-16 can be communicated using these tags. 5.6 DTD and DOCTYPE This specification defines a tag for the DOCTYPE. This tag cannot describe an internal DTD. 5.7 Namespace Notes Each namespace declaration in the source XML document needs to have a corresponding 'm' tag in the binary encoding, even if the namespace mapping is being declared again. For example: ... ... For the encoding of the Name and Person elements, both must contain an explicit namespace mapping using the 'm' tag. The namespace declarations appear immediately after the element tag in which they were declared. 21 An undeclared default namespace is encoded as m00. Elements within undeclared namespaces can be encoded with 'e' tag, 'X' tag, or 'x' tags with 00 for prefix and URI StringIDs. Attributes with undeclared namespaces can be encoded with a tag, or the 'Y' tag or 'y' tag with 00 for prefix and URI StringIDs. 5.8 Hint Tag Notes The hint tag is a way to add arbitrary information to the binary encoding. This is analogous to the use of the XML schema's xsd:appinfo. It consists of a TLV followed by an LV. The 'H' tag indicates that some information is contained in its value field that defines what is contained in the following LV. If the reader sees the initial TLV and does not understand or want to process it, it can use the length of the following LV to skip it. Otherwise, the reader can consume the information. For example, if validation was performed in a database with a schema in the database's schema repository, then the encoder may want to record exactly which schema it was validated with and could do so using this form. Therefore, the encoding could be: H11schema-used12http://x.y.z 5.9 Empty Sequence XQuery defines an empty sequence. This is represented in the binary stream as a header followed by a 'Z' tag. 5.10 Escaping of Characters The tags U and b enable XDBX to record that none of the characters in a text node or attribute value need to be escaped via an entity reference. The goal of this feature is to speed up serialization of the XDBX binary stream. When any of these tags are used, none of the characters in the text or attribute value need to be examined to determine if they need escaping. The 'U' tag can only be used if none of the characters in the text nodes are: • carriage return • ampersand • greater than • less than. The 'b' tag can only be used if none of the characters in the attribute values are: • carriage return • ampersand • greater than • less than • single quote • double quote • tab • linefeed 22 Note that this only applies to serialization to Unicode. Serialization to other encodings might require numeric character references due to the lack of encodings for certain characters in certain codepages. 5.11 Private Extensions Assuming agreement between a sender and receiver, the specification allows for the definition and use of private extensions. This allows the format to support additional features that are not currently and explicitly documented. An example of this is for type encoding data in elements and attributes in a specific, non-text format. This allows the encoder to encode the data in the most optimal form for the receiver. For example, consider the element "weight" that is of type float: 75.4 Using one of the reserved tags, the encoder can inform the receiver of an alternative, more efficient, encoding. This is also useful for user-defined types. Assuming StringIDs are off, the preceding element could be encoded as: 2016weight002407xxxxxxxz Where: • '#x201' is a reserved tag defined by the encoder and receiver to define this special element encoding. • '6' is the length of the string "weight" • '0' is the prefix length. • '0' is the URI length. • '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE float. • '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary encoding of the value as a float. Similarly, to encode attribute values, another reserved tag is used. For example: Joe Assuming StringIDs are off, the attribute portion of this element could be encoded as: 2106weight002407xxxxxxx Where: • '#x210' is the reserved tag defined by the encoder and receiver to define this special attribute encoding. • '6' is the length of the string "weight". • '0' is the prefix length. • '0' is the URI length. • '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE float. • '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary encoding of the value as a float. 23 5.12 Reserved Tags The set of reserved tags is for use by encoders that have agreement with the receivers on their meaning. These reserved tags will not be reassigned for use in future versions of this specification, thus ensuring forward and backward compatibility for implementations that choose to use them. 24 6 Examples The following section documents examples of serialized XML and the corresponding binary XML format when various encoding attributes are used. Note: The serialized XML values provided in these examples are shown with line breaks and indentation to make them more readable. These characters are not included in the byte counts shown in the example statistics. 6.1 Example 1 – Default encoding This example shows an XML document and its binary encoding with all the default encoding flags. Encoding Flags: Document Type: XML Document StringIDs (required): On XML: Joe Susan Bill Binary XML (Excluding the header): X4root100X4name200Y3mgr3002NOT3Joezx200T5Susanzx200T4BillzzZ Element, attribute, prefix, and URI IDs are: 1==root, 2==name, 3==mgr Statistics: 75 bytes of XML 60 bytes of binary + 8 byte header = 68 bytes of binary XML 25 6.2 Example 2 – Sequence This example shows an XML sequence with multiple items, including a comment node, a document node, an element node, and an atomic value. In the binary XML, 'b' is used to denote blanks. Encoding Flags: Document Type: XML Sequence StringIDs (required): On XML: Joe Susan Bill Binary XML (Excluding the header): c7comment@dX4name100Y3mgr2002NOT7bbJoebbz@V5Susan@x100T4BillzZ Element, attribute, prefix, and URI IDs are: 1==name, 2==mgr Statistics 67 bytes of XML 62 bytes of binary + 8 byte header = 70 bytes of binary XML 26 6.3 Example 3 – StringIDs This example shows an XML document and its binary encoding with stringIDs on. Encoding Flags: Document Type: XML Document StringIDs (required): On XML: Bill 35 Joe 45 Binary XML (Excluding the header): I3foo1I3bar2X4root300m12X6Person400X4name500Y3mgr6002NOT4BillzX3age712T 235zze4e5a62NOT3Joezx712T245zzzZ Element, attribute, prefix, and URI IDs are: 1==foo, 2==bar, 3==root, 4==Person, 5==name, 6==mgr, 7==age Statistics: 162 bytes of XML 103 bytes of binary + 8 byte header = 111 bytes of binary XML 27 6.4 Example 4 – Namespaces with StringIDs This example shows an XML document with multiple namespaces and its binary encoding with stringIDs. Encoding Flags: Document Type: XML Document StringIDs (required): On XML: Bill 35 Joe 45 Susan Amy Binary XML (Excluding the header): X4root100I3foo2I3bar3X6Person400m23X4name500Y3mgr6002NOT4BillzX3age723T 235zzI3baz8e4m28e5y6282NOT3Joezx728T245zzI4food9e4m39e5y6393YEST5Susanz ze4m32e5Y4exec10323YEST3AmyzzzZ Element, attribute, prefix, and uri IDs are: 1==root, 2==foo, 3==bar, 4==Person, 5==name, 6==mgr, 7==age, 8==baz, 9==food, 10==exec Statistics: 322 bytes of XML 173bytes of binary + 8 byte header = 181 bytes of binary XML 28 6.5 Example 5 – Mixed Content This example shows how mixed content is encoded. Encoding Attributes: Document Type: XML Document StringIDs (required): On XML: textmore text Binary XML (Excluding the header): X1a100T4textX1b200zT9more textzZ Element, attribute, prefix, and URI IDs are: 1==a, 2==b Statistics 24 bytes of XML 32 bytes of binary + 8 byte header = 40 bytes of binary XML 29 6.6 Example 6 – White Space This example shows a binary XML document with all of the white space characters that are shown in the corresponding serialized XML document. In the binary XML, 'b' is used to denote a blank and 'a' is used to indicate a linefeed character. Encoding Flags: Document Type: XML Document StringIDs (required): On XML: Susan Smith
MA
Binary XML (Excluding the header): X8employee100W4abbbX4name200I3xml3I5space4y4308preserveX2fn600T5SusanzT1 bX2ln700T5SmithzzW4abbbX7address800y4307defaultW7abbbbbbX5state900T2MAz W4abbbzW1azZ Element, attribute, prefix, and URI IDs are: 1==employee, 2==name, 3==xml, 4==space, 5==name, 6==fn, 7==ln, 8==address, 9==state Statistics: 160 bytes of XML 155 bytes of binary + 8 byte header = 163 bytes of binary XML 30 Appendix A Complete XDBX BNF XDBX ::= Header DocumentContent Header ::= DocIdentifier HeaderLength MajorVersion EncodingFlags HeaderFill DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */ HeaderLength ::= #x5 MajorVersion ::= #x1 EncodingFlags ::= FourBytes HeaderFill ::= Byte* DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd /* Which branch to choose is controlled by EncodingFlags */ DocumentEnd ::= 'Z' XMLSequence ::= (SequenceItem ('@' SequenceItem)*)? SequenceItem ::= Anywhere (CompleteDoc | Comment | PI | AtomicValue | Element) Anywhere CompleteDoc ::= 'd' XMLDocument AtomicValue ::= 'V' LengthValue XMLDocument ::= (Anywhere XMLDecl)? Misc* (DocType | Misc*)? Element Misc* Anywhere ::= (SI | Hint | Reserved)* Misc ::= Comment | PI | SI | Hint DocType ::= 'F' StringID StringID StringID XMLDecl ::= XMLVersion Encoding? Standalone? XMLVersion ::= 'L' LengthValue /* The value is a valid XML version. "1.0" or "1.1" for now */ Encoding ::= 'D' LengthValue Standalone ::= 't' BooleanValue Element ::= (ElementI | ElementSII | ElementIII) ElementContent EndElement ElementI ::= 'e' StringID ElementSII ::= 'X' LengthValue StringID StringID StringID 31 ElementIII ::= 'x' StringID StringID StringID EndElement ::= 'z' ElementContent ::= NSDecls Attributes Children NSDecls ::= (Anywhere NSDecl)* NSDecl ::= NSDeclII NSDeclII ::= 'm' StringID StringID Attributes ::= (Anywhere Attribute)* Attribute ::= (AttributeI | AttributeSII | AttributeIII) AttributeValue AttributeI ::= 'a' StringID AttributeSII ::= 'Y' LengthValue StringID StringID StringID AttributeIII ::= ('y' | 'b') StringID StringID StringID AttributeValue ::= LengthValue /* If 'b' is used, then no &,',", <,>,#xD,#xA,#x9 can appear in value */ Children ::= (Misc | Element | Text)* Text ::= ('T' | 'U' | 'C' | 'W' ) LengthValue Comment ::= 'c' LengthValue PI ::= PII PII ::= 'P' StringID LengthValue SI ::= 'I' LengthValue StringID Hint ::= 'H' LengthValue LengthValue Reserved ::= [#xC9 - #xFA] Byte* LengthValue ::= Length Value Length ::= VariableInteger Value ::= Byte* /* Number of bytes governed by preceding length */ StringID ::= VariableInteger TypeID ::= VariableInteger VariableInteger ::= (LongLeading | ShortLeading)? LastByte LongLeading ::= [#x81-#x8F] [#x80-#xFF]? [#x80-#xFF]? [#x80-#xFF]? ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]? LastByte ::= [#x0-#x7F] 32 BooleanValue ::= False | True False ::= #x0 True ::= #x1 FourBytes ::= Byte Byte Byte Byte Byte ::= [#x0-#xFF] 33