diff --git a/todo-binxml.txt b/todo-binxml.txt new file mode 100644 index 0000000..d9ddbd7 --- /dev/null +++ b/todo-binxml.txt @@ -0,0 +1,1004 @@ +Extensible Dynamic Binary XML, +Client/Server Binary XML Format +(XDBX) +Version 1.0 +(July 14, 2010) + +Permission to copy and display the Extensible Dynamic Binary XML, Client/Server +Binary XML Format (XDBX) (the "Specification"), in any medium without fee or +royalty is hereby granted by IBM (collectively, the "Authors"), provided that you include +the following on ALL copies of the Specification, or portions thereof, that you make: +1. A link or URL to the Specification at one of the Authors websites. +2. The copyright notice as shown in the Specification. +The Authors each agree to grant you a royalty-free license, under reasonable, non- +discriminatory terms and conditions to their respective patents that they deem necessary +to implement the Specification. +THE SPECIFICATION IS PROVIDED "AS IS," AND THE AUTHORS MAKE NO +REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, +BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR +A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR TITLE; THAT THE +CONTENTS OF THE SPECIFICATION ARE SUITABLE FOR ANY PURPOSE; NOR +THAT THE IMPLEMENTATION OF SUCH CONTENTS WILL NOT INFRINGE +ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER +RIGHTS. +THE AUTHORS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, +INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR +RELATING TO ANY USE OR DISTRIBUTION OF THE SPECIFICATION. +The name and trademarks of the Authors may NOT be used in any manner, including +advertising or publicity pertaining to the Specification or its contents without specific, +written prior permission. Title to copyright in the Specification will at all times remain +with the Authors. +No other rights are granted by implication, estoppel or otherwise. +© Copyright IBM Corporation 2010. + +Abstract +The solution that is presented in this document allows an encoder to produce the binary +XML format using one or more of a set of attributes. The encoder can choose which +attributes to include based on knowledge of the receiver. The receiver that reads the +binary XML format can inspect the format header to determine the attributes with which +it is encoded. This can be purely informational, or allow the receiver the opportunity to +optimize its configuration to more efficiently process the attributes contained in the +format. + +Table of Contents +1 Motivation 1 +2 Encoding overview 2 +3 Format Header 3 +3.1 Layout of the format header 3 +3.2 XDBX Major Version 3 +3.3 Encoding Flags 4 +3.3.1 Document Type 4 +3.3.2 StringID Flags 4 +3.3.3 Valid Flag 4 +3.4 Example of a Format header 5 +4 Format Content 6 +4.1 Conventions 7 +4.1.1 How Values and Lengths are Encoded 7 +4.2 Encoding of Single Documents and Sequences 9 +4.3 Encoding of XML Declarations 10 +4.4 Encoding of Elements 11 +4.5 Encoding of Attributes 12 +4.6 Encoding of Namespace Mappings 13 +4.7 Encoding of Text 13 +4.8 Encoding of Comments 14 +4.9 Encoding of Processing Instructions 14 +4.10 Encoding of Other Information 14 +4.11 Reserved Values for Tags 15 +5 Format details 16 +5.1 Encoding Single Documents and Sequences 16 +5.2 StringIDs 16 +5.2.1 Examples of StringID Usage 16 +5.3 StringID Notes 20 +5.4 Text Notes 20 +5.4.1 White Space 20 +5.5 XML Declaration Tag Notes 21 +5.6 DTD and DOCTYPE 21 +i +5.7 Namespace Notes 21 +5.8 Hint Tag Notes 22 +5.9 Empty Sequence 22 +5.10 Escaping of Characters 22 +5.11 Private Extensions 23 +5.12 Reserved Tags 24 +6 Examples 25 +6.1 Example 1 – Default encoding 25 +6.2 Example 2 – Sequence 26 +6.3 Example 3 – StringIDs 27 +6.4 Example 4 – Namespaces with StringIDs 28 +6.5 Example 5 – Mixed Content 29 +6.6 Example 6 – White Space 30 +Appendix A Complete XDBX BNF 31 +ii +1 Motivation +Binary serialization of XML is desirable because it allows encoding of XML data in a +smaller and more efficient form than textual XML format. The binary XML format is +more efficient for various reasons. These include: +• Multiple occurrences of repeated text are condensed through the use of StringIDs. +StringIDs are integer identifiers that replace text strings. +• When a parser processes data in a pretokenized format, the parser does not need +to search for as many token delimiters in the content, or handle as many edge +cases. +• All values are prefixed with their length. When the parser has length information, +it does not need to search for the ends of element names or values. +• All entity references are expanded in binary XML format. The XML parser does +not need to expand entity references. +The binary XML format has the following disadvantages: +• Loss of XML interoperability. Data that is in a proprietary format can be used +only on systems that have the software to decode it. +• The encoder must do extra processing to: +o Perform validation +o Perform well-formedness checking +o Resolve all entity references +o Identify repeated tags for replacement with StringIDs +This binary XML format is not intended as a replacement for XML. It can provide better +performance than XML when it is used in the implementation of some APIs. +In general, the benefits of the binary XML format outweigh the disadvantages. The +additional processing time that the encoder requires is usually less than the processing +time that is used for parsing an XML document , especially when the XML document +must be parsed more than once. +1 +2 Encoding overview +This binary XML representation contains a format header followed by a number of tags. +The format header has encoding attributes which give the receiver some useful properties +of the binary XML. +The following characteristics of the binary encoding are constant, regardless of the source +document or how the binary encoding is performed: +• All text is encoded as UTF-8. +• All entity references in the source document are replaced by their values. +• Line breaks are normalized. +• Attributes are normalized. +• Where applicable, data is encoded in big-endian format. +The binary XML format is made up of various tokens (tags) and values. When binary +XML format is viewed with a standard text editor or as ASCII in a debugger, the tags +display as single ASCII characters. This can aid in debugging while making the binary +XML format more humanly readable. +2 +3 Format Header +3.1 Layout of the format header +The binary XML format contains a header with information about how the format was +constructed. The header information allows the parser to configure itself in order to +process the message most efficiently. +To identify the format and its attributes, the following scheme is used for the first set of +bytes of the document: +(2 bytes) – Binary XML document identifier (“magic number”) +(1 byte) – Header length (not including magic number or the length byte itself) +(1 byte) – XDBX major version +(4 byte Integer) – Encoding flags +The “magic number” will always be this value in binary: 11001010 00111011 +DocumentContent follows the Header. HeaderLength determines the length of the +Header. +BNF +XDBX ::= Header DocumentContent +Header ::= DocIdentifier HeaderLength MajorVersion +EncodingFlags HeaderFill +DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */ +HeaderLength ::= #x5 +MajorVersion ::= #x1 +EncodingFlags ::= FourBytes +HeaderFill ::= Byte* +FourBytes ::= Byte Byte Byte Byte +Byte ::= [#x0-#xFF] +3.2 XDBX Major Version +There is just one major version of XDBX, identified by the XDBX major version value of +0x01 (version 1). In this version, the HeaderLength must be at least 5. +XDBX version 1 streams contain any of the following tags: 'e', 'X', 'x', 'z', 'a', 'Y', 'y', 'b', +'m', 'T', 'U', 'C', ‘W’, 'V', 'L', 'D', 't', 'I', 'Z', '@', 'd', 'P', 'c', 'H'. +The set of tags that an XDBX encoder generates is implementation defined. However, an +XDBX encoder must assign a valid XDBX major version number to each generated +stream, and ensure that each stream contains only tags that are allowed for that XDBX +major version. +3 +An XDBX decoder is required to fully support the tag set assigned to an implementation- +defined XDBX major version level. It must be able to decode all valid tags from the +corresponding tag set. However, XDBX decoders can reject XDBX streams that are +identified by an XDBX major version that is higher than the version that the decoder +supports. +3.3 Encoding Flags +The format for encoding flags allows for future expansion. Encoding flags, or features, +can be added as needed. The header consists of indicators that signal to a processor how +the format is encoded. Each encoding flag is a bit in a four-byte integer field in the +header. +The following encoding flags can be used in the binary XML format. Each encoding flag +is listed along with its value in the four-byte integer header field. +3.3.1 Document Type +This attribute indicates whether the binary stream represents one complete well-formed +XML document or a sequence of items, as defined by the XQuery 1.0 specification. +• XML document (Value: x00000000) +• XML sequence (Value: x00000001) +3.3.2 StringID Flags +The flags that are associated with stringIDs are: +• StringID flag +• Dense stringIDs used +3.3.2.1 StringID Flag (required) +This encoding flag (x00000002) must be set. +3.3.2.2 Dense StringIDs Used Flag +Certain implementations might require the stringIDs that are used in the binary XML to +be small numbers so that they can be used as indexes in an array (as opposed to a hash +table). +When specified (x00000020), this encoding flag notifies the receiver that the stringIDs +are small numbers. In general, small numbers are monotonically increasing numbers. The +stringID value 0 (zero) is reserved. +3.3.3 Valid Flag +When specified (x00000080), this encoding flag notifies the receiver that the XML +document or sequence of items conforms to a schema. This may have been determined by +4 +the use of a validating XML parser, or by construction from objects that are associated +with a schema. +The use of this information by the receiver is beyond the scope of this specification. A +receiver may choose to ignore this information. +3.4 Example of a Format header +Binary XML Document Identifier: 11001010 00111011 +Header Length: 00000101 +XDBX major version: 00000001 +Encoding flags: +• Document Type (Bit 1): XML Document +• StringID (Bit 2): On +Magic Num Hdr Len Version Encoding flags +11001010 00111011 00000101 00000001 00000000 00000000 00000000 00000010 +5 +4 Format Content +The following combinations of information are used in binary XML document encoding: +• TLV - Tag-Length-Value +• TV - Tag-Value +• LV - Length-Value +• TLVid - Tag-Length-Value-StringID +• ID - StringID +Some content is denoted via a TLV, while other content uses the shorter LV. This is +done for compactness, where a second tag is unnecessary and can be inferred from the +previous tag. The specification also uses TV when the length is known to be one. In +addition, TLVid is used when StringIDs are used, and is how a first occurrence of a string +value is assigned its ID. Finally, there is an ID format if only the stringID is needed. +6 +4.1 Conventions +All the lengths are expressed as a number of bytes. +A summary of each tag in the format and its meaning is contained in the tables that +follow. The values in the Tag column are the decimal values of the tags. The values in +the ASCII column are the ASCII encoding of the tag values. +The following conventions are used: +• TLV(localname) - a TLV for the localname is defined, where 'Value' is the text of +the localname. +• TLV(localname) /LV(prefix)/LV(uri)- a TLV for the localname, followed by an +LV for the namespace prefix, followed by an LV for the namespace URI. +• TLVid(localname) - a TLVid for the localname is defined where stringID is the +ID assigned to the text for localname. +• Tid(localname)/id(prefix)/id(uri) - a Tag-StringID for the localname, followed by +the stringID of the namespace prefix, followed by the stringID of the namespace +URI. The StringID references a string in the dictionary. +BNF +LengthValue ::= Length Value +Length ::= VariableInteger +Value ::= Byte* +/* Number of bytes governed by preceding +length */ +StringID ::= VariableInteger +VariableInteger ::= (LongLeading | ShortLeading)? LastByte +LongLeading ::= [#x81-#x8F] +[#x80-#xFF]? [#x80-#xFF]? [#x80-#xFF]? +ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]? +LastByte ::= [#x0-#x7F] +4.1.1 How Values and Lengths are Encoded +Encoding Attributes in the format header are always encoded as signed four-byte integers +in big endian format +For space efficiency, all other values and lengths are encoded as a variable number of +bytes, with the first byte containing the highest order bits for the integer, the next byte +containing the next highest order bits, and so on. This allows the encoding to represent +any arbitrary integer in as few bytes as possible. However, this specification limits the +integer to a value representable in a signed 32 bit integer, which is 2Gbytes. Each byte +contains seven bits of the integer's value, with the highest order bit of each byte +7 +designated as a flag bit. A byte's flag bit is off if the byte is the last byte (lowest order +byte) of a variable length byte sequence for a number. Because only as many bytes as +necessary to represent an integer are used, integers between 0 and 127 are represented in +one byte with the flag bit off. Integers between 128 and 16,383 are represented in two +bytes with the flag bit set in the first byte, and so on. +Examples: +• A length of binary 00000101 means 5 +• A length of binary 10000101 00100001 means 673 (binary 1010100001) +8 +4.2 Encoding of Single Documents and Sequences +A binary stream can represent one complete well-formed XML document or a sequence +of items, as defined by the XQuery specification. This information is encoded in the +format header with the following encoding flags: +• XML Document (Value: x00000000) +• XML Sequence (Value: x00000001) +Each item in the sequence can be a complete document, a subtree, or an atomic value. +BNF +DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd +/* Which branch to choose is controlled +by EncodingFlags */ +DocumentEnd ::= 'Z' +XMLDocument ::= (Anywhere XMLDecl)? Misc* +(DocType | Misc*)? Element Misc* +XMLSequence ::= (SequenceItem +(SequenceSeparator SequenceItem)*)? +SequenceItem ::= Anywhere +(CompleteDoc | Comment | PI +| AtomicValue | Element) +Anywhere +SequenceSeparator ::= '@' +CompleteDoc ::= 'd' XMLDocument +Anywhere ::= (SI | Hint | Reserved)* +Misc ::= Comment | PI | SI | Hint +DocType ::= 'F' StringID StringID StringID +Tags +Value ASCII Meaning +90 Z End of the binary stream +64 @ Separator for items in an XML sequence +100 d Document node (assumed for XML documents, not assumed in XML +sequences) +70 F DOCTYPE in Tid(rootElementName) /id(systemID)/id(publicID) +9 +4.3 Encoding of XML Declarations +BNF +XMLDecl ::= XMLVersion Encoding? Standalone? +XMLVersion ::= 'L' LengthValue +/* The value is a valid XML version. +"1.0" or "1.1" for now */ +Encoding ::= 'D' LengthValue +Standalone ::= 't' BooleanValue +BooleanValue ::= False | True +False ::= #x0 +True ::= #x1 +Tags +Value ASCII Meaning +76 L XML version in TLV(version) form. +68 D Encoding in TLV(encoding) form. +116 t Standalone in TV(standalone) form where the value of 'standalone' is +either 0 or 1. +10 +4.4 Encoding of Elements +BNF +Element ::= (ElementI | ElementSII | ElementIII) +ElementContent +EndElement +ElementI ::= 'e' StringID +ElementSII ::= 'X' LengthValue StringID StringID StringID +ElementIII ::= 'x' StringID StringID StringID +EndElement ::= 'z' +ElementContent ::= NSDecls Attributes Children +Children ::= (Misc | Element | Text)* +Tags +Value ASCII Meaning +101 e Tid(localname) +Used when the element is not associated with a namespace. +88 X TLVid(localname) / id(prefix) / id(uri) +Used when the stringID for the element name is not yet defined. If the +element is in the default namespace, then the prefix stringID is zero. If +the element is not in a namespace, then the URI stringID is zero. +120 x Tid(localname) / id(prefix) / id(uri) +Used when the stringID for the element name is already defined. If the +element is in the default namespace, then the prefix stringID will be +zero. If the element is not in a namespace, then the URI stringID is zero. +122 z End Element +11 +4.5 Encoding of Attributes +BNF +Attributes ::= (Anywhere Attribute)* +Attribute ::= (AttributeI | AttributeSII | AttributeIII) +AttributeValue +AttributeI ::= 'a' StringID +AttributeSII ::= 'Y' LengthValue StringID StringID StringID +AttributeIII ::= ('y' | 'b') StringID StringID StringID +AttributeValue ::= LengthValue +/* If 'b' is used, then no &,',",<, +>,#xD,#xA,#x9 can appear in value */ +Tags +Value ASCII Meaning +97 a Tid(localname) / LV(attribute-value) +Used when the attribute is not associated with a namespace. +89 Y TLVid(localname) / id(prefix) / id(uri) / LV(attribute-value) +Used when the stringID for the attribute name is not yet defined. If the +attribute is not in a namespace, then the prefix stringID and URI +stringID is zero. +121 y Tid(localname) / id(prefix) / id(uri) / LV(attribute-value) +Used when the stringID for the attribute name is already defined. If the +attribute is not in a namespace, then the prefix stringID and URI +stringID is zero. +98 b Tid(localname) / id(prefix) / id(uri) / LV(attribute-value) +Similar to the 'y' tag. Characters that cannot be used in the value are: +• '<' (#x3c) +• '>' (#x3e) +• '&' (#x26) +• carriage return (#x0d) +• single quote (#x27) +• double quote (#x22) +• tab (#x09) +• linefeed (#x0a) +Because no characters need to be escaped when this attribute node is +serialized, this feature should speed up serialization. +12 +4.6 Encoding of Namespace Mappings +BNF +NSDecls ::= (Anywhere NSDecl)* +NSDecl ::= NSDeclII +NSDeclII ::= 'm' StringID StringID +Tags +Value ASCII Meaning +109 m Tid(prefix) /id(namespace-uri) +Declares a namespace mapping of a prefix stringID to a namespace URI +stringID. For default namespace declarations, the stringID for the prefix +is zero. +4.7 Encoding of Text +BNF +Text ::= ('T' | 'U' | 'C' | 'W') LengthValue +AtomicValue ::= 'V' LengthValue +Tags +Value ASCII Meaning +84 T Text node in TLV(text) form. +85 U Text node in TLV(text) form. The '<' (#x3c), '>' (#x3e), '&' (#x26), and +carriage return (#x0d) characters cannot be used in the value. Because +no characters need to be escaped when this text node is serialized, this +feature should speed up serialization. +67 C CDATA string in TLV(text) form. +87 W Text node containing only white space in TLV(text) form. White space +consists of one or more space (#x20) characters, carriage returns (#x0d), +line feeds (#x0a), tabs (#x09), Unicode line separator characters +(#x2028), or NELs (#x85). +Used when a text node contains only white space, unless the nearest +containing element with an xml:space attribute specifies +xml:space='preserve'. +86 V Atomic Value in TLV(text) form. +13 +4.8 Encoding of Comments +BNF +Comment ::= 'c' LengthValue +Tags +Value ASCII Meaning +99 c Comment in TLV(comment) form. +4.9 Encoding of Processing Instructions +BNF +PI ::= PII +PII ::= 'P' StringID LengthValue +Tags +Value ASCII Meaning +80 P Processing instruction in Tid(target)/LV(value) form. +The 'P' tag cannot declare an ID for the target of the processing instruction. Instead, an 'I' +tag should be used to define the stringID for the target. Then the 'P' tag is used to define +the processing instruction itself. +Although this is unlike the behavior for element and attribute tags, this was done to avoid +creating several tags to describe a processing instruction. +4.10 Encoding of Other Information +BNF +SI ::= 'I' LengthValue StringID +Hint ::= 'H' LengthValue LengthValue +Tags +Value ASCII Meaning +73 I Definition of a stringID in TLVid(string) form. Used only when the +StringID flag is set. +72 H Hint in TLV/LV form. +14 +4.11 Reserved Values for Tags +BNF +Reserved ::= [#xC9 - #xFA] Byte* +Tags +Value ASCII Meaning +201 +-250 +Reserved for use by applications. +Values 201 through 250 are reserved for use by applications, and will not be used as tags +in future versions of this specification. These reserved values can be used to define +private extensions to the format for features not accounted for in this version of the +specification. See the Private Extensions section on page 23 for more information. +15 +5 Format details +This section provides additional details on the binary XML format. +5.1 Encoding Single Documents and Sequences +Whether an XDBX instance represents an XML document or a sequence of items is +encoded in the XDBX header. Most commonly, the binary stream represents an XML +Document. In this case, the document node as defined by the XML data models is +assumed. In other words, there is no need to start the document with a 'd' tag. If the binary +stream represents an XML Sequence, then the document node is not assumed, and any +document node in the stream needs to be denoted with a 'd' tag. Note that XPath behaves +differently whether there is a document node or not. +It is important to note that if stringIDs are used, the encoder must ensure that all stringIDs +are valid from one item to the next. In other words, the stringIDs are global to the binary +XML stream. Combining multiple documents together as items in a sequence could have +a size advantage, because the stringIDs would need to be defined only once. +5.2 StringIDs +Usage of stringIDs results in a smaller encoding, because the StringIDs are typically +smaller than the text they represent. In addition, the use of StringIDs can allow the data +in binary XML format to be processed more efficiently. The receiver must be prepared to +manage the StringIDs that appear in the document. This requires establishing and +managing lookup tables to efficiently reconcile StringIDs with the text they represent. +In some encodings the first occurrence of the text is written as text, then where that text +appears again, it is replaced with an ID that is computed during the processing of the first +occurrence. In other encodings all text, or only a portion of the text, could be represented +by an ID, where the ID is a reference to a dictionary that is contained in the message. +A StringID can be used only after the tag that defines it. +5.2.1 Examples of StringID Usage +The following shows example encodings of namespace declarations, elements, and +attributes when StringIDs are used. +Namespace Declaration: +The namespace declaration portion of the element tag: is +encoded as I3foo1I3bar2m12, where: +• 'I' assigns the StringID '1' to "foo" and '2' to "bar" +• 'm' declares the namespace mapping of "foo" to '1' and "bar" to '2'. +16 +Suppose that the namespace prefix is reassigned to a different uri later in the document. +For example: +
+The encoding of the namespace declaration is: +I3baz3m13, where '3' is the StringID assigned to "baz". +Element with no prefix and no namespace: +The first occurrence of
is encoded as: X7Address100, where: +• 'X' is the tag indicating an element name is encoded with StringIDs, and that a +length/value/ID tuple follows defining the localname and its associated ID, +followed by the stringIDs for the namespace prefix and namespace uri. +• '7' is the length of the localname string "Address" and '1' is the assigned ID for +that string. +• '0' is the stringID for "no namespace prefix". +• '0' is the stringID for "no namespace uri". +Subsequent occurrences of
are encoded more compactly as e1, where '1' is the +StringID for the string "Address". +Element with no prefix and the default namespace: +The first occurrence of
is encoded as: X7Address104 where: +• 'X' is the tag indicating an element name is encoded with StringIDs, and that a +length/value/ID tuple follows defining the localname and its associated ID. +• '0' is the stringID for the namespace prefix (because there is none). +• '4' is the stringID of the namespace uri. +Subsequent occurrences of
are encoded more compactly as x104, where +• '1' is the StringID for the string "Address". +• '4' is the stringID for the namespace uri. +Element with prefix: +The first occurrence of is encoded as X7Address154, where: +• '1' is the StringID assigned to the string "Address". +• '5' is the stringID that was previously assigned to "foo". +• '4' is the stringID that was previously assigned to the namespace uri. +Subsequent occurrences of are encoded more compactly as x154, where +'1' is the StringID for the string "Address". +17 +Attribute with no prefix (and thus no namespace): +The first occurrence of the attribute portion of is encoded as +Y3mgr9002NO where: +• 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a +length/value/id tuple for the attribute name. +• '3' is the length of the attribute name "mgr". +• '9' is the StringID assigned the string "mgr". +• '0' for the stringID of the prefix. +• '0' for the stringID of the URI. +• '2' is the length of the attribute value: "NO". +Subsequent occurrences of the attribute portion of are encoded as +a92NO, where: +• 'a' indicates an attribute declaration with StringIDs. +• '9' is the stringID of the attribute name. +• '2' is the length/value of the attribute value: "NO". +Attribute with prefix: +The first occurrence of the attribute portion of is encoded as: +Y3mgr9542NO where: +• 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a +length/value/id tuple for the attribute name. +• '5' for the stringID for prefix. +• '4' for the stringIDs for URI. +• '3' is the length of the attribute name "mgr". +• '9' is the StringID assigned the string "mgr". +• '5' is the stringID for the prefix. +• '4' is the stringID for the URI. +• '2' is the length of the attribute value "NO". +Subsequent occurrences of the attribute portion of are encoded +more compactly as: y9542NO, where: +• 'y' is the tag indicating an attribute declaration with StringIDs. +• '9' the stringID for the attribute name +• '5' is the stringID for prefix. +• '4' is the stringID for URI. +• '2' the length/value of the attribute value "NO". +Elements, Text, and namespaceIDs: +This section ties together some of the concepts described above and assumes StringIDs +are used. For example: +18 + +ABC + + +The namespace declaration in the above XML is encoded as: I3foo1I3bar2m12, where: +• '1' represents the StringID for "foo". +• '2' is the StringID for "bar". +• 'm12' is the structure to identify a mapping of foo ('1') to bar ('2'). +Therefore, the first occurrence of foo:Address is encoded as follows: +X7Address912T3ABCz where: +• 'X' indicates an element name expressed in LVid form. +• '7Address' is the LV for the localname. +• '9' is the StringID for "Address". +• '12' is a reference to the namespace mapping of foo to bar. +• 'T3ABC' is the TLV for the text node and 'z' represents the end element tag. +The subsequent occurrence of foo:Address are encoded more compactly as follows: +x912C3DEFz where: +• 'x' indicates an element name expressed in id form. +• '9' is the StringID for "Address". +• '12' is a reference to the namespace mapping of foo to bar. +• 'C3DEF' is the TLV for the CDATA. +• 'z' represents the end element tag. (NOTE: The encoder could choose to encode +the CDATA as a text node via 'T'.) +The first occurrence of foo:Address must use the more expansive form of an element +name 'X', where the second occurrence can use the more compact version 'x' because the +element name is already encoded with a stringID. +The following table summarizes the encoding of an element in various forms with +StringIDs on: +No Namespace Namespace +First Occurrence Subsequent +Occurrences +First Occurrence Subsequent +Occurrences +
X7Address100 e1 X7Address902 x902 + N/A N/A X7Address912 x912 +The following table summarizes the encoding of an attribute in various forms with +StringIDs on: +No Namespace Namespace +First +Occurrence +Subsequent +Occurrences +First +Occurrence +Subsequent +Occurrences + Y3mgr9002NO a92NO Y3mgr9022NO y9022NO + N/A N/A Y3mgr9122NO y9122NO +19 +5.3 StringID Notes +StringIDs are considered global. For example, if the string "Person" is given the stringID +4, this value will exist for the entire binary XML document. It is invalid for "Person" to +be given a different stringID, or for 4 to be assigned another string in the same binary +XML document. +The stringID value 0 (zero) is reserved and is used to mark "no namespace prefix" and +"no namespace URI". +5.4 Text Notes +Multiple text and/or CDATA tags can appear one after another in order to handle +arbitrarily large amounts of data. They are also used to encode mixed content. +It is up to the encoder whether to encode CDATA using the 'C' tag or a 'T' tag, because +they are semantically identical. The 'C' tag exists for applications that want to preserve +the CDATA syntax. Beyond the difference between CDATA and text as described in the +XML specification, this binary XML specification treats them identical. +The 'U' tag is similar to the 'T' tag, except that the encoder guarantees that none of the +characters in the 'U' tag need to be replaced with entity references if this text is serialized +as XML. In other words, none of the following four characters are present in the text +node: less-then “<” [<], greater-than “>” [>], ampersand “&” [&], and +carriage-return [ ]. +5.4.1 White Space +The XMLPARSE function, which may be applied to an XML document that is passed to +the receiver, offers the options of STRIP WHITESPACE and PRESERVE +WHITESPACE. STRIP WHITESPACE removes text nodes that contain only white +space unless the nearest containing element with an xml:space attribute specifies +xml:space='preserve'. +To facilitate the processing of STRIP WHITESPACE, text nodes that would be stripped +by this operation must be identified by the 'W' tag. +CDATA sections that contain white space that would be stripped by STRIP +WHITESPACE must be identified by a 'W' tag rather than a 'C' tag. This is seen in the +following examples: +Serialized XML: +Binary XML: X1a100T1 C3bcdT1 z +Serialized XML: +Binary XML: X1a100W1 W1 W1 z +or +X1a100W3 z +If a processor determines that certain white space characters can be removed (e.g. +ignorable whitespace SAX events), they should be removed instead of being encoded in a +'W' tag. +20 +5.5 XML Declaration Tag Notes +Typically, there is no XML declaration in binary XML. After all, the binary XML +encoding is always UTF-8. However, if the XML version is not 1.0, then the XML +declaration is mandatory, just like in serialized XML. +If the XML declaration tags are present in the binary XML, the tags must include the +version tag, however, the encoding and standalone tags are optional. +Example encodings: +Serialized XML: +Binary XML: L31.0D5UTF-8t0 +Serialized XML: +Binary XML: L31.1D6UTF-16 +Serialized XML: +Binary XML: L31.0t1 +Serialized XML: +Binary XML: L31.1 +The XML declaration tags are informational only and therefore optional. They provide +the binary encoding with the information provided in the XML declaration of the source +document. For example, all text is encoded as UTF-8 in the binary encoding, even if the +source document used UTF-16. The fact that the source document used UTF-16 can be +communicated using these tags. +5.6 DTD and DOCTYPE +This specification defines a tag for the DOCTYPE. This tag cannot describe an internal +DTD. +5.7 Namespace Notes +Each namespace declaration in the source XML document needs to have a corresponding +'m' tag in the binary encoding, even if the namespace mapping is being declared again. +For example: + +... + + +... + +For the encoding of the Name and Person elements, both must contain an explicit +namespace mapping using the 'm' tag. +The namespace declarations appear immediately after the element tag in which they were +declared. +21 +An undeclared default namespace is encoded as m00. Elements within undeclared +namespaces can be encoded with 'e' tag, 'X' tag, or 'x' tags with 00 for prefix and URI +StringIDs. Attributes with undeclared namespaces can be encoded with a tag, or the 'Y' +tag or 'y' tag with 00 for prefix and URI StringIDs. +5.8 Hint Tag Notes +The hint tag is a way to add arbitrary information to the binary encoding. This is +analogous to the use of the XML schema's xsd:appinfo. It consists of a TLV followed by +an LV. The 'H' tag indicates that some information is contained in its value field that +defines what is contained in the following LV. If the reader sees the initial TLV and does +not understand or want to process it, it can use the length of the following LV to skip it. +Otherwise, the reader can consume the information. For example, if validation was +performed in a database with a schema in the database's schema repository, then the +encoder may want to record exactly which schema it was validated with and could do so +using this form. Therefore, the encoding could be: +H11schema-used12http://x.y.z +5.9 Empty Sequence +XQuery defines an empty sequence. This is represented in the binary stream as a header +followed by a 'Z' tag. +5.10 Escaping of Characters +The tags U and b enable XDBX to record that none of the characters in a text node or +attribute value need to be escaped via an entity reference. The goal of this feature is to +speed up serialization of the XDBX binary stream. When any of these tags are used, none +of the characters in the text or attribute value need to be examined to determine if they +need escaping. +The 'U' tag can only be used if none of the characters in the text nodes are: +• carriage return +• ampersand +• greater than +• less than. +The 'b' tag can only be used if none of the characters in the attribute values are: +• carriage return +• ampersand +• greater than +• less than +• single quote +• double quote +• tab +• linefeed +22 +Note that this only applies to serialization to Unicode. Serialization to other encodings +might require numeric character references due to the lack of encodings for certain +characters in certain codepages. +5.11 Private Extensions +Assuming agreement between a sender and receiver, the specification allows for the +definition and use of private extensions. This allows the format to support additional +features that are not currently and explicitly documented. An example of this is for type +encoding data in elements and attributes in a specific, non-text format. This allows the +encoder to encode the data in the most optimal form for the receiver. For example, +consider the element "weight" that is of type float: +75.4 +Using one of the reserved tags, the encoder can inform the receiver of an alternative, +more efficient, encoding. This is also useful for user-defined types. Assuming StringIDs +are off, the preceding element could be encoded as: +2016weight002407xxxxxxxz +Where: +• '#x201' is a reserved tag defined by the encoder and receiver to define this special +element encoding. +• '6' is the length of the string "weight" +• '0' is the prefix length. +• '0' is the URI length. +• '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE +float. +• '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary +encoding of the value as a float. +Similarly, to encode attribute values, another reserved tag is used. For example: +Joe +Assuming StringIDs are off, the attribute portion of this element could be encoded as: +2106weight002407xxxxxxx +Where: +• '#x210' is the reserved tag defined by the encoder and receiver to define this +special attribute encoding. +• '6' is the length of the string "weight". +• '0' is the prefix length. +• '0' is the URI length. +• '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE +float. +• '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary +encoding of the value as a float. +23 +5.12 Reserved Tags +The set of reserved tags is for use by encoders that have agreement with the receivers on +their meaning. These reserved tags will not be reassigned for use in future versions of +this specification, thus ensuring forward and backward compatibility for implementations +that choose to use them. +24 +6 Examples +The following section documents examples of serialized XML and the corresponding +binary XML format when various encoding attributes are used. +Note: The serialized XML values provided in these examples are shown with line breaks +and indentation to make them more readable. These characters are not included in the +byte counts shown in the example statistics. +6.1 Example 1 – Default encoding +This example shows an XML document and its binary encoding with all the default +encoding flags. +Encoding Flags: +Document Type: XML Document +StringIDs (required): On +XML: + +Joe +Susan +Bill + +Binary XML (Excluding the header): +X4root100X4name200Y3mgr3002NOT3Joezx200T5Susanzx200T4BillzzZ +Element, attribute, prefix, and URI IDs are: +1==root, 2==name, 3==mgr +Statistics: +75 bytes of XML +60 bytes of binary + 8 byte header = 68 bytes of binary XML +25 +6.2 Example 2 – Sequence +This example shows an XML sequence with multiple items, including a comment node, a +document node, an element node, and an atomic value. In the binary XML, 'b' is used to +denote blanks. +Encoding Flags: +Document Type: XML Sequence +StringIDs (required): On +XML: + + Joe +Susan +Bill +Binary XML (Excluding the header): +c7comment@dX4name100Y3mgr2002NOT7bbJoebbz@V5Susan@x100T4BillzZ +Element, attribute, prefix, and URI IDs are: +1==name, 2==mgr +Statistics +67 bytes of XML +62 bytes of binary + 8 byte header = 70 bytes of binary XML +26 +6.3 Example 3 – StringIDs +This example shows an XML document and its binary encoding with stringIDs on. +Encoding Flags: +Document Type: XML Document +StringIDs (required): On +XML: + + +Bill +35 + + +Joe +45 + + +Binary XML (Excluding the header): +I3foo1I3bar2X4root300m12X6Person400X4name500Y3mgr6002NOT4BillzX3age712T +235zze4e5a62NOT3Joezx712T245zzzZ +Element, attribute, prefix, and URI IDs are: +1==foo, 2==bar, 3==root, 4==Person, 5==name, 6==mgr, 7==age +Statistics: +162 bytes of XML +103 bytes of binary + 8 byte header = 111 bytes of binary XML +27 +6.4 Example 4 – Namespaces with StringIDs +This example shows an XML document with multiple namespaces and its binary +encoding with stringIDs. +Encoding Flags: +Document Type: XML Document +StringIDs (required): On +XML: + + +Bill +35 + + +Joe +45 + + +Susan + + +Amy + + +Binary XML (Excluding the header): +X4root100I3foo2I3bar3X6Person400m23X4name500Y3mgr6002NOT4BillzX3age723T +235zzI3baz8e4m28e5y6282NOT3Joezx728T245zzI4food9e4m39e5y6393YEST5Susanz +ze4m32e5Y4exec10323YEST3AmyzzzZ +Element, attribute, prefix, and uri IDs are: +1==root, 2==foo, 3==bar, 4==Person, 5==name, 6==mgr, 7==age, 8==baz, 9==food, +10==exec +Statistics: +322 bytes of XML +173bytes of binary + 8 byte header = 181 bytes of binary XML +28 +6.5 Example 5 – Mixed Content +This example shows how mixed content is encoded. +Encoding Attributes: +Document Type: XML Document +StringIDs (required): On +XML: +textmore text +Binary XML (Excluding the header): +X1a100T4textX1b200zT9more textzZ +Element, attribute, prefix, and URI IDs are: +1==a, 2==b +Statistics +24 bytes of XML +32 bytes of binary + 8 byte header = 40 bytes of binary XML +29 +6.6 Example 6 – White Space +This example shows a binary XML document with all of the white space characters that +are shown in the corresponding serialized XML document. In the binary XML, 'b' is used +to denote a blank and 'a' is used to indicate a linefeed character. +Encoding Flags: +Document Type: XML Document +StringIDs (required): On +XML: + +Susan Smith +
+MA +
+
+Binary XML (Excluding the header): +X8employee100W4abbbX4name200I3xml3I5space4y4308preserveX2fn600T5SusanzT1 +bX2ln700T5SmithzzW4abbbX7address800y4307defaultW7abbbbbbX5state900T2MAz +W4abbbzW1azZ +Element, attribute, prefix, and URI IDs are: +1==employee, 2==name, 3==xml, 4==space, 5==name, 6==fn, 7==ln, 8==address, +9==state +Statistics: +160 bytes of XML +155 bytes of binary + 8 byte header = 163 bytes of binary XML +30 +Appendix A Complete XDBX BNF +XDBX ::= Header DocumentContent +Header ::= DocIdentifier HeaderLength MajorVersion +EncodingFlags HeaderFill +DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */ +HeaderLength ::= #x5 +MajorVersion ::= #x1 +EncodingFlags ::= FourBytes +HeaderFill ::= Byte* +DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd +/* Which branch to choose is controlled +by EncodingFlags */ +DocumentEnd ::= 'Z' +XMLSequence ::= (SequenceItem ('@' SequenceItem)*)? +SequenceItem ::= Anywhere +(CompleteDoc | Comment | PI | AtomicValue +| Element) +Anywhere +CompleteDoc ::= 'd' XMLDocument +AtomicValue ::= 'V' LengthValue +XMLDocument ::= (Anywhere XMLDecl)? Misc* +(DocType | Misc*)? Element Misc* +Anywhere ::= (SI | Hint | Reserved)* +Misc ::= Comment | PI | SI | Hint +DocType ::= 'F' StringID StringID StringID +XMLDecl ::= XMLVersion Encoding? Standalone? +XMLVersion ::= 'L' LengthValue +/* The value is a valid XML version. "1.0" +or "1.1" for now */ +Encoding ::= 'D' LengthValue +Standalone ::= 't' BooleanValue +Element ::= (ElementI | ElementSII | ElementIII) +ElementContent +EndElement +ElementI ::= 'e' StringID +ElementSII ::= 'X' LengthValue StringID StringID StringID +31 +ElementIII ::= 'x' StringID StringID StringID +EndElement ::= 'z' +ElementContent ::= NSDecls Attributes Children +NSDecls ::= (Anywhere NSDecl)* +NSDecl ::= NSDeclII +NSDeclII ::= 'm' StringID StringID +Attributes ::= (Anywhere Attribute)* +Attribute ::= (AttributeI | AttributeSII | AttributeIII) +AttributeValue +AttributeI ::= 'a' StringID +AttributeSII ::= 'Y' LengthValue StringID StringID StringID +AttributeIII ::= ('y' | 'b') StringID StringID StringID +AttributeValue ::= LengthValue +/* If 'b' is used, then no &,',", +<,>,#xD,#xA,#x9 can appear in value */ +Children ::= (Misc | Element | Text)* +Text ::= ('T' | 'U' | 'C' | 'W' ) LengthValue +Comment ::= 'c' LengthValue +PI ::= PII +PII ::= 'P' StringID LengthValue +SI ::= 'I' LengthValue StringID +Hint ::= 'H' LengthValue LengthValue +Reserved ::= [#xC9 - #xFA] Byte* +LengthValue ::= Length Value +Length ::= VariableInteger +Value ::= Byte* +/* Number of bytes governed by preceding +length */ +StringID ::= VariableInteger +TypeID ::= VariableInteger +VariableInteger ::= (LongLeading | ShortLeading)? LastByte +LongLeading ::= [#x81-#x8F] [#x80-#xFF]? [#x80-#xFF]? +[#x80-#xFF]? +ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]? +LastByte ::= [#x0-#x7F] +32 +BooleanValue ::= False | True +False ::= #x0 +True ::= #x1 +FourBytes ::= Byte Byte Byte Byte +Byte ::= [#x0-#xFF] +33 \ No newline at end of file