x4o/todo-binxml.txt

1004 lines
42 KiB
Plaintext
Raw Permalink Normal View History

2024-06-15 14:19:58 +00:00
Extensible Dynamic Binary XML,
Client/Server Binary XML Format
(XDBX)
Version 1.0
(July 14, 2010)
Permission to copy and display the Extensible Dynamic Binary XML, Client/Server
Binary XML Format (XDBX) (the "Specification"), in any medium without fee or
royalty is hereby granted by IBM (collectively, the "Authors"), provided that you include
the following on ALL copies of the Specification, or portions thereof, that you make:
1. A link or URL to the Specification at one of the Authors websites.
2. The copyright notice as shown in the Specification.
The Authors each agree to grant you a royalty-free license, under reasonable, non-
discriminatory terms and conditions to their respective patents that they deem necessary
to implement the Specification.
THE SPECIFICATION IS PROVIDED "AS IS," AND THE AUTHORS MAKE NO
REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING,
BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR
A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR TITLE; THAT THE
CONTENTS OF THE SPECIFICATION ARE SUITABLE FOR ANY PURPOSE; NOR
THAT THE IMPLEMENTATION OF SUCH CONTENTS WILL NOT INFRINGE
ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER
RIGHTS.
THE AUTHORS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL,
INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR
RELATING TO ANY USE OR DISTRIBUTION OF THE SPECIFICATION.
The name and trademarks of the Authors may NOT be used in any manner, including
advertising or publicity pertaining to the Specification or its contents without specific,
written prior permission. Title to copyright in the Specification will at all times remain
with the Authors.
No other rights are granted by implication, estoppel or otherwise.
© Copyright IBM Corporation 2010.
Abstract
The solution that is presented in this document allows an encoder to produce the binary
XML format using one or more of a set of attributes. The encoder can choose which
attributes to include based on knowledge of the receiver. The receiver that reads the
binary XML format can inspect the format header to determine the attributes with which
it is encoded. This can be purely informational, or allow the receiver the opportunity to
optimize its configuration to more efficiently process the attributes contained in the
format.
Table of Contents
1 Motivation 1
2 Encoding overview 2
3 Format Header 3
3.1 Layout of the format header 3
3.2 XDBX Major Version 3
3.3 Encoding Flags 4
3.3.1 Document Type 4
3.3.2 StringID Flags 4
3.3.3 Valid Flag 4
3.4 Example of a Format header 5
4 Format Content 6
4.1 Conventions 7
4.1.1 How Values and Lengths are Encoded 7
4.2 Encoding of Single Documents and Sequences 9
4.3 Encoding of XML Declarations 10
4.4 Encoding of Elements 11
4.5 Encoding of Attributes 12
4.6 Encoding of Namespace Mappings 13
4.7 Encoding of Text 13
4.8 Encoding of Comments 14
4.9 Encoding of Processing Instructions 14
4.10 Encoding of Other Information 14
4.11 Reserved Values for Tags 15
5 Format details 16
5.1 Encoding Single Documents and Sequences 16
5.2 StringIDs 16
5.2.1 Examples of StringID Usage 16
5.3 StringID Notes 20
5.4 Text Notes 20
5.4.1 White Space 20
5.5 XML Declaration Tag Notes 21
5.6 DTD and DOCTYPE 21
i
5.7 Namespace Notes 21
5.8 Hint Tag Notes 22
5.9 Empty Sequence 22
5.10 Escaping of Characters 22
5.11 Private Extensions 23
5.12 Reserved Tags 24
6 Examples 25
6.1 Example 1 Default encoding 25
6.2 Example 2 Sequence 26
6.3 Example 3 StringIDs 27
6.4 Example 4 Namespaces with StringIDs 28
6.5 Example 5 Mixed Content 29
6.6 Example 6 White Space 30
Appendix A Complete XDBX BNF 31
ii
1 Motivation
Binary serialization of XML is desirable because it allows encoding of XML data in a
smaller and more efficient form than textual XML format. The binary XML format is
more efficient for various reasons. These include:
• Multiple occurrences of repeated text are condensed through the use of StringIDs.
StringIDs are integer identifiers that replace text strings.
• When a parser processes data in a pretokenized format, the parser does not need
to search for as many token delimiters in the content, or handle as many edge
cases.
• All values are prefixed with their length. When the parser has length information,
it does not need to search for the ends of element names or values.
• All entity references are expanded in binary XML format. The XML parser does
not need to expand entity references.
The binary XML format has the following disadvantages:
• Loss of XML interoperability. Data that is in a proprietary format can be used
only on systems that have the software to decode it.
• The encoder must do extra processing to:
o Perform validation
o Perform well-formedness checking
o Resolve all entity references
o Identify repeated tags for replacement with StringIDs
This binary XML format is not intended as a replacement for XML. It can provide better
performance than XML when it is used in the implementation of some APIs.
In general, the benefits of the binary XML format outweigh the disadvantages. The
additional processing time that the encoder requires is usually less than the processing
time that is used for parsing an XML document , especially when the XML document
must be parsed more than once.
1
2 Encoding overview
This binary XML representation contains a format header followed by a number of tags.
The format header has encoding attributes which give the receiver some useful properties
of the binary XML.
The following characteristics of the binary encoding are constant, regardless of the source
document or how the binary encoding is performed:
• All text is encoded as UTF-8.
• All entity references in the source document are replaced by their values.
• Line breaks are normalized.
• Attributes are normalized.
• Where applicable, data is encoded in big-endian format.
The binary XML format is made up of various tokens (tags) and values. When binary
XML format is viewed with a standard text editor or as ASCII in a debugger, the tags
display as single ASCII characters. This can aid in debugging while making the binary
XML format more humanly readable.
2
3 Format Header
3.1 Layout of the format header
The binary XML format contains a header with information about how the format was
constructed. The header information allows the parser to configure itself in order to
process the message most efficiently.
To identify the format and its attributes, the following scheme is used for the first set of
bytes of the document:
(2 bytes) Binary XML document identifier (“magic number”)
(1 byte) Header length (not including magic number or the length byte itself)
(1 byte) XDBX major version
(4 byte Integer) Encoding flags
The “magic number” will always be this value in binary: 11001010 00111011
DocumentContent follows the Header. HeaderLength determines the length of the
Header.
BNF
XDBX ::= Header DocumentContent
Header ::= DocIdentifier HeaderLength MajorVersion
EncodingFlags HeaderFill
DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */
HeaderLength ::= #x5
MajorVersion ::= #x1
EncodingFlags ::= FourBytes
HeaderFill ::= Byte*
FourBytes ::= Byte Byte Byte Byte
Byte ::= [#x0-#xFF]
3.2 XDBX Major Version
There is just one major version of XDBX, identified by the XDBX major version value of
0x01 (version 1). In this version, the HeaderLength must be at least 5.
XDBX version 1 streams contain any of the following tags: 'e', 'X', 'x', 'z', 'a', 'Y', 'y', 'b',
'm', 'T', 'U', 'C', W, 'V', 'L', 'D', 't', 'I', 'Z', '@', 'd', 'P', 'c', 'H'.
The set of tags that an XDBX encoder generates is implementation defined. However, an
XDBX encoder must assign a valid XDBX major version number to each generated
stream, and ensure that each stream contains only tags that are allowed for that XDBX
major version.
3
An XDBX decoder is required to fully support the tag set assigned to an implementation-
defined XDBX major version level. It must be able to decode all valid tags from the
corresponding tag set. However, XDBX decoders can reject XDBX streams that are
identified by an XDBX major version that is higher than the version that the decoder
supports.
3.3 Encoding Flags
The format for encoding flags allows for future expansion. Encoding flags, or features,
can be added as needed. The header consists of indicators that signal to a processor how
the format is encoded. Each encoding flag is a bit in a four-byte integer field in the
header.
The following encoding flags can be used in the binary XML format. Each encoding flag
is listed along with its value in the four-byte integer header field.
3.3.1 Document Type
This attribute indicates whether the binary stream represents one complete well-formed
XML document or a sequence of items, as defined by the XQuery 1.0 specification.
• XML document (Value: x00000000)
• XML sequence (Value: x00000001)
3.3.2 StringID Flags
The flags that are associated with stringIDs are:
• StringID flag
• Dense stringIDs used
3.3.2.1 StringID Flag (required)
This encoding flag (x00000002) must be set.
3.3.2.2 Dense StringIDs Used Flag
Certain implementations might require the stringIDs that are used in the binary XML to
be small numbers so that they can be used as indexes in an array (as opposed to a hash
table).
When specified (x00000020), this encoding flag notifies the receiver that the stringIDs
are small numbers. In general, small numbers are monotonically increasing numbers. The
stringID value 0 (zero) is reserved.
3.3.3 Valid Flag
When specified (x00000080), this encoding flag notifies the receiver that the XML
document or sequence of items conforms to a schema. This may have been determined by
4
the use of a validating XML parser, or by construction from objects that are associated
with a schema.
The use of this information by the receiver is beyond the scope of this specification. A
receiver may choose to ignore this information.
3.4 Example of a Format header
Binary XML Document Identifier: 11001010 00111011
Header Length: 00000101
XDBX major version: 00000001
Encoding flags:
• Document Type (Bit 1): XML Document
• StringID (Bit 2): On
Magic Num Hdr Len Version Encoding flags
11001010 00111011 00000101 00000001 00000000 00000000 00000000 00000010
5
4 Format Content
The following combinations of information are used in binary XML document encoding:
• TLV - Tag-Length-Value
• TV - Tag-Value
• LV - Length-Value
• TLVid - Tag-Length-Value-StringID
• ID - StringID
Some content is denoted via a TLV, while other content uses the shorter LV. This is
done for compactness, where a second tag is unnecessary and can be inferred from the
previous tag. The specification also uses TV when the length is known to be one. In
addition, TLVid is used when StringIDs are used, and is how a first occurrence of a string
value is assigned its ID. Finally, there is an ID format if only the stringID is needed.
6
4.1 Conventions
All the lengths are expressed as a number of bytes.
A summary of each tag in the format and its meaning is contained in the tables that
follow. The values in the Tag column are the decimal values of the tags. The values in
the ASCII column are the ASCII encoding of the tag values.
The following conventions are used:
• TLV(localname) - a TLV for the localname is defined, where 'Value' is the text of
the localname.
• TLV(localname) /LV(prefix)/LV(uri)- a TLV for the localname, followed by an
LV for the namespace prefix, followed by an LV for the namespace URI.
• TLVid(localname) - a TLVid for the localname is defined where stringID is the
ID assigned to the text for localname.
• Tid(localname)/id(prefix)/id(uri) - a Tag-StringID for the localname, followed by
the stringID of the namespace prefix, followed by the stringID of the namespace
URI. The StringID references a string in the dictionary.
BNF
LengthValue ::= Length Value
Length ::= VariableInteger
Value ::= Byte*
/* Number of bytes governed by preceding
length */
StringID ::= VariableInteger
VariableInteger ::= (LongLeading | ShortLeading)? LastByte
LongLeading ::= [#x81-#x8F]
[#x80-#xFF]? [#x80-#xFF]? [#x80-#xFF]?
ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]?
LastByte ::= [#x0-#x7F]
4.1.1 How Values and Lengths are Encoded
Encoding Attributes in the format header are always encoded as signed four-byte integers
in big endian format
For space efficiency, all other values and lengths are encoded as a variable number of
bytes, with the first byte containing the highest order bits for the integer, the next byte
containing the next highest order bits, and so on. This allows the encoding to represent
any arbitrary integer in as few bytes as possible. However, this specification limits the
integer to a value representable in a signed 32 bit integer, which is 2Gbytes. Each byte
contains seven bits of the integer's value, with the highest order bit of each byte
7
designated as a flag bit. A byte's flag bit is off if the byte is the last byte (lowest order
byte) of a variable length byte sequence for a number. Because only as many bytes as
necessary to represent an integer are used, integers between 0 and 127 are represented in
one byte with the flag bit off. Integers between 128 and 16,383 are represented in two
bytes with the flag bit set in the first byte, and so on.
Examples:
• A length of binary 00000101 means 5
• A length of binary 10000101 00100001 means 673 (binary 1010100001)
8
4.2 Encoding of Single Documents and Sequences
A binary stream can represent one complete well-formed XML document or a sequence
of items, as defined by the XQuery specification. This information is encoded in the
format header with the following encoding flags:
• XML Document (Value: x00000000)
• XML Sequence (Value: x00000001)
Each item in the sequence can be a complete document, a subtree, or an atomic value.
BNF
DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd
/* Which branch to choose is controlled
by EncodingFlags */
DocumentEnd ::= 'Z'
XMLDocument ::= (Anywhere XMLDecl)? Misc*
(DocType | Misc*)? Element Misc*
XMLSequence ::= (SequenceItem
(SequenceSeparator SequenceItem)*)?
SequenceItem ::= Anywhere
(CompleteDoc | Comment | PI
| AtomicValue | Element)
Anywhere
SequenceSeparator ::= '@'
CompleteDoc ::= 'd' XMLDocument
Anywhere ::= (SI | Hint | Reserved)*
Misc ::= Comment | PI | SI | Hint
DocType ::= 'F' StringID StringID StringID
Tags
Value ASCII Meaning
90 Z End of the binary stream
64 @ Separator for items in an XML sequence
100 d Document node (assumed for XML documents, not assumed in XML
sequences)
70 F DOCTYPE in Tid(rootElementName) /id(systemID)/id(publicID)
9
4.3 Encoding of XML Declarations
BNF
XMLDecl ::= XMLVersion Encoding? Standalone?
XMLVersion ::= 'L' LengthValue
/* The value is a valid XML version.
"1.0" or "1.1" for now */
Encoding ::= 'D' LengthValue
Standalone ::= 't' BooleanValue
BooleanValue ::= False | True
False ::= #x0
True ::= #x1
Tags
Value ASCII Meaning
76 L XML version in TLV(version) form.
68 D Encoding in TLV(encoding) form.
116 t Standalone in TV(standalone) form where the value of 'standalone' is
either 0 or 1.
10
4.4 Encoding of Elements
BNF
Element ::= (ElementI | ElementSII | ElementIII)
ElementContent
EndElement
ElementI ::= 'e' StringID
ElementSII ::= 'X' LengthValue StringID StringID StringID
ElementIII ::= 'x' StringID StringID StringID
EndElement ::= 'z'
ElementContent ::= NSDecls Attributes Children
Children ::= (Misc | Element | Text)*
Tags
Value ASCII Meaning
101 e Tid(localname)
Used when the element is not associated with a namespace.
88 X TLVid(localname) / id(prefix) / id(uri)
Used when the stringID for the element name is not yet defined. If the
element is in the default namespace, then the prefix stringID is zero. If
the element is not in a namespace, then the URI stringID is zero.
120 x Tid(localname) / id(prefix) / id(uri)
Used when the stringID for the element name is already defined. If the
element is in the default namespace, then the prefix stringID will be
zero. If the element is not in a namespace, then the URI stringID is zero.
122 z End Element
11
4.5 Encoding of Attributes
BNF
Attributes ::= (Anywhere Attribute)*
Attribute ::= (AttributeI | AttributeSII | AttributeIII)
AttributeValue
AttributeI ::= 'a' StringID
AttributeSII ::= 'Y' LengthValue StringID StringID StringID
AttributeIII ::= ('y' | 'b') StringID StringID StringID
AttributeValue ::= LengthValue
/* If 'b' is used, then no &,',",<,
>,#xD,#xA,#x9 can appear in value */
Tags
Value ASCII Meaning
97 a Tid(localname) / LV(attribute-value)
Used when the attribute is not associated with a namespace.
89 Y TLVid(localname) / id(prefix) / id(uri) / LV(attribute-value)
Used when the stringID for the attribute name is not yet defined. If the
attribute is not in a namespace, then the prefix stringID and URI
stringID is zero.
121 y Tid(localname) / id(prefix) / id(uri) / LV(attribute-value)
Used when the stringID for the attribute name is already defined. If the
attribute is not in a namespace, then the prefix stringID and URI
stringID is zero.
98 b Tid(localname) / id(prefix) / id(uri) / LV(attribute-value)
Similar to the 'y' tag. Characters that cannot be used in the value are:
• '<' (#x3c)
• '>' (#x3e)
• '&' (#x26)
• carriage return (#x0d)
• single quote (#x27)
• double quote (#x22)
• tab (#x09)
• linefeed (#x0a)
Because no characters need to be escaped when this attribute node is
serialized, this feature should speed up serialization.
12
4.6 Encoding of Namespace Mappings
BNF
NSDecls ::= (Anywhere NSDecl)*
NSDecl ::= NSDeclII
NSDeclII ::= 'm' StringID StringID
Tags
Value ASCII Meaning
109 m Tid(prefix) /id(namespace-uri)
Declares a namespace mapping of a prefix stringID to a namespace URI
stringID. For default namespace declarations, the stringID for the prefix
is zero.
4.7 Encoding of Text
BNF
Text ::= ('T' | 'U' | 'C' | 'W') LengthValue
AtomicValue ::= 'V' LengthValue
Tags
Value ASCII Meaning
84 T Text node in TLV(text) form.
85 U Text node in TLV(text) form. The '<' (#x3c), '>' (#x3e), '&' (#x26), and
carriage return (#x0d) characters cannot be used in the value. Because
no characters need to be escaped when this text node is serialized, this
feature should speed up serialization.
67 C CDATA string in TLV(text) form.
87 W Text node containing only white space in TLV(text) form. White space
consists of one or more space (#x20) characters, carriage returns (#x0d),
line feeds (#x0a), tabs (#x09), Unicode line separator characters
(#x2028), or NELs (#x85).
Used when a text node contains only white space, unless the nearest
containing element with an xml:space attribute specifies
xml:space='preserve'.
86 V Atomic Value in TLV(text) form.
13
4.8 Encoding of Comments
BNF
Comment ::= 'c' LengthValue
Tags
Value ASCII Meaning
99 c Comment in TLV(comment) form.
4.9 Encoding of Processing Instructions
BNF
PI ::= PII
PII ::= 'P' StringID LengthValue
Tags
Value ASCII Meaning
80 P Processing instruction in Tid(target)/LV(value) form.
The 'P' tag cannot declare an ID for the target of the processing instruction. Instead, an 'I'
tag should be used to define the stringID for the target. Then the 'P' tag is used to define
the processing instruction itself.
Although this is unlike the behavior for element and attribute tags, this was done to avoid
creating several tags to describe a processing instruction.
4.10 Encoding of Other Information
BNF
SI ::= 'I' LengthValue StringID
Hint ::= 'H' LengthValue LengthValue
Tags
Value ASCII Meaning
73 I Definition of a stringID in TLVid(string) form. Used only when the
StringID flag is set.
72 H Hint in TLV/LV form.
14
4.11 Reserved Values for Tags
BNF
Reserved ::= [#xC9 - #xFA] Byte*
Tags
Value ASCII Meaning
201
-250
Reserved for use by applications.
Values 201 through 250 are reserved for use by applications, and will not be used as tags
in future versions of this specification. These reserved values can be used to define
private extensions to the format for features not accounted for in this version of the
specification. See the Private Extensions section on page 23 for more information.
15
5 Format details
This section provides additional details on the binary XML format.
5.1 Encoding Single Documents and Sequences
Whether an XDBX instance represents an XML document or a sequence of items is
encoded in the XDBX header. Most commonly, the binary stream represents an XML
Document. In this case, the document node as defined by the XML data models is
assumed. In other words, there is no need to start the document with a 'd' tag. If the binary
stream represents an XML Sequence, then the document node is not assumed, and any
document node in the stream needs to be denoted with a 'd' tag. Note that XPath behaves
differently whether there is a document node or not.
It is important to note that if stringIDs are used, the encoder must ensure that all stringIDs
are valid from one item to the next. In other words, the stringIDs are global to the binary
XML stream. Combining multiple documents together as items in a sequence could have
a size advantage, because the stringIDs would need to be defined only once.
5.2 StringIDs
Usage of stringIDs results in a smaller encoding, because the StringIDs are typically
smaller than the text they represent. In addition, the use of StringIDs can allow the data
in binary XML format to be processed more efficiently. The receiver must be prepared to
manage the StringIDs that appear in the document. This requires establishing and
managing lookup tables to efficiently reconcile StringIDs with the text they represent.
In some encodings the first occurrence of the text is written as text, then where that text
appears again, it is replaced with an ID that is computed during the processing of the first
occurrence. In other encodings all text, or only a portion of the text, could be represented
by an ID, where the ID is a reference to a dictionary that is contained in the message.
A StringID can be used only after the tag that defines it.
5.2.1 Examples of StringID Usage
The following shows example encodings of namespace declarations, elements, and
attributes when StringIDs are used.
Namespace Declaration:
The namespace declaration portion of the element tag: <root xmlns:foo="bar"> is
encoded as I3foo1I3bar2m12, where:
• 'I' assigns the StringID '1' to "foo" and '2' to "bar"
• 'm' declares the namespace mapping of "foo" to '1' and "bar" to '2'.
16
Suppose that the namespace prefix is reassigned to a different uri later in the document.
For example:
<Address xmlns:foo= "baz">
The encoding of the namespace declaration is:
I3baz3m13, where '3' is the StringID assigned to "baz".
Element with no prefix and no namespace:
The first occurrence of <Address> is encoded as: X7Address100, where:
• 'X' is the tag indicating an element name is encoded with StringIDs, and that a
length/value/ID tuple follows defining the localname and its associated ID,
followed by the stringIDs for the namespace prefix and namespace uri.
• '7' is the length of the localname string "Address" and '1' is the assigned ID for
that string.
• '0' is the stringID for "no namespace prefix".
• '0' is the stringID for "no namespace uri".
Subsequent occurrences of <Address> are encoded more compactly as e1, where '1' is the
StringID for the string "Address".
Element with no prefix and the default namespace:
The first occurrence of <Address> is encoded as: X7Address104 where:
• 'X' is the tag indicating an element name is encoded with StringIDs, and that a
length/value/ID tuple follows defining the localname and its associated ID.
• '0' is the stringID for the namespace prefix (because there is none).
• '4' is the stringID of the namespace uri.
Subsequent occurrences of <Address> are encoded more compactly as x104, where
• '1' is the StringID for the string "Address".
• '4' is the stringID for the namespace uri.
Element with prefix:
The first occurrence of <foo:Address> is encoded as X7Address154, where:
• '1' is the StringID assigned to the string "Address".
• '5' is the stringID that was previously assigned to "foo".
• '4' is the stringID that was previously assigned to the namespace uri.
Subsequent occurrences of <foo:Address> are encoded more compactly as x154, where
'1' is the StringID for the string "Address".
17
Attribute with no prefix (and thus no namespace):
The first occurrence of the attribute portion of <name mgr="NO"> is encoded as
Y3mgr9002NO where:
• 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a
length/value/id tuple for the attribute name.
• '3' is the length of the attribute name "mgr".
• '9' is the StringID assigned the string "mgr".
• '0' for the stringID of the prefix.
• '0' for the stringID of the URI.
• '2' is the length of the attribute value: "NO".
Subsequent occurrences of the attribute portion of <name mgr="NO"> are encoded as
a92NO, where:
• 'a' indicates an attribute declaration with StringIDs.
• '9' is the stringID of the attribute name.
• '2' is the length/value of the attribute value: "NO".
Attribute with prefix:
The first occurrence of the attribute portion of <name foo:mgr="NO"> is encoded as:
Y3mgr9542NO where:
• 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a
length/value/id tuple for the attribute name.
• '5' for the stringID for prefix.
• '4' for the stringIDs for URI.
• '3' is the length of the attribute name "mgr".
• '9' is the StringID assigned the string "mgr".
• '5' is the stringID for the prefix.
• '4' is the stringID for the URI.
• '2' is the length of the attribute value "NO".
Subsequent occurrences of the attribute portion of <name foo:mgr= "NO"> are encoded
more compactly as: y9542NO, where:
• 'y' is the tag indicating an attribute declaration with StringIDs.
• '9' the stringID for the attribute name
• '5' is the stringID for prefix.
• '4' is the stringID for URI.
• '2' the length/value of the attribute value "NO".
Elements, Text, and namespaceIDs:
This section ties together some of the concepts described above and assumes StringIDs
are used. For example:
18
<root xmlns:foo="bar">
<foo:Address>ABC</foo:Address>
<foo:Address><![CDATA[DEF]]</foo:Address>
</root>
The namespace declaration in the above XML is encoded as: I3foo1I3bar2m12, where:
• '1' represents the StringID for "foo".
• '2' is the StringID for "bar".
• 'm12' is the structure to identify a mapping of foo ('1') to bar ('2').
Therefore, the first occurrence of foo:Address is encoded as follows:
X7Address912T3ABCz where:
• 'X' indicates an element name expressed in LVid form.
• '7Address' is the LV for the localname.
• '9' is the StringID for "Address".
• '12' is a reference to the namespace mapping of foo to bar.
• 'T3ABC' is the TLV for the text node and 'z' represents the end element tag.
The subsequent occurrence of foo:Address are encoded more compactly as follows:
x912C3DEFz where:
• 'x' indicates an element name expressed in id form.
• '9' is the StringID for "Address".
• '12' is a reference to the namespace mapping of foo to bar.
• 'C3DEF' is the TLV for the CDATA.
• 'z' represents the end element tag. (NOTE: The encoder could choose to encode
the CDATA as a text node via 'T'.)
The first occurrence of foo:Address must use the more expansive form of an element
name 'X', where the second occurrence can use the more compact version 'x' because the
element name is already encoded with a stringID.
The following table summarizes the encoding of an element in various forms with
StringIDs on:
No Namespace Namespace
First Occurrence Subsequent
Occurrences
First Occurrence Subsequent
Occurrences
<Address> X7Address100 e1 X7Address902 x902
<foo:Address> N/A N/A X7Address912 x912
The following table summarizes the encoding of an attribute in various forms with
StringIDs on:
No Namespace Namespace
First
Occurrence
Subsequent
Occurrences
First
Occurrence
Subsequent
Occurrences
<mgr="NO"> Y3mgr9002NO a92NO Y3mgr9022NO y9022NO
<foo:mgr="NO"> N/A N/A Y3mgr9122NO y9122NO
19
5.3 StringID Notes
StringIDs are considered global. For example, if the string "Person" is given the stringID
4, this value will exist for the entire binary XML document. It is invalid for "Person" to
be given a different stringID, or for 4 to be assigned another string in the same binary
XML document.
The stringID value 0 (zero) is reserved and is used to mark "no namespace prefix" and
"no namespace URI".
5.4 Text Notes
Multiple text and/or CDATA tags can appear one after another in order to handle
arbitrarily large amounts of data. They are also used to encode mixed content.
It is up to the encoder whether to encode CDATA using the 'C' tag or a 'T' tag, because
they are semantically identical. The 'C' tag exists for applications that want to preserve
the CDATA syntax. Beyond the difference between CDATA and text as described in the
XML specification, this binary XML specification treats them identical.
The 'U' tag is similar to the 'T' tag, except that the encoder guarantees that none of the
characters in the 'U' tag need to be replaced with entity references if this text is serialized
as XML. In other words, none of the following four characters are present in the text
node: less-then “<” [&lt;], greater-than “>” [&gt;], ampersand “&” [&amp;], and
carriage-return [&#xD;].
5.4.1 White Space
The XMLPARSE function, which may be applied to an XML document that is passed to
the receiver, offers the options of STRIP WHITESPACE and PRESERVE
WHITESPACE. STRIP WHITESPACE removes text nodes that contain only white
space unless the nearest containing element with an xml:space attribute specifies
xml:space='preserve'.
To facilitate the processing of STRIP WHITESPACE, text nodes that would be stripped
by this operation must be identified by the 'W' tag.
CDATA sections that contain white space that would be stripped by STRIP
WHITESPACE must be identified by a 'W' tag rather than a 'C' tag. This is seen in the
following examples:
Serialized XML: <a> <![CDATA[bcd]]> </a>
Binary XML: X1a100T1 C3bcdT1 z
Serialized XML: <a> <![CDATA[ ]]> </a>
Binary XML: X1a100W1 W1 W1 z
or
X1a100W3 z
If a processor determines that certain white space characters can be removed (e.g.
ignorable whitespace SAX events), they should be removed instead of being encoded in a
'W' tag.
20
5.5 XML Declaration Tag Notes
Typically, there is no XML declaration in binary XML. After all, the binary XML
encoding is always UTF-8. However, if the XML version is not 1.0, then the XML
declaration is mandatory, just like in serialized XML.
If the XML declaration tags are present in the binary XML, the tags must include the
version tag, however, the encoding and standalone tags are optional.
Example encodings:
Serialized XML: <?xml version="1.0" encoding="UTF-8"
standalone="no" ?>
Binary XML: L31.0D5UTF-8t0
Serialized XML: <?xml version="1.1" encoding="UTF-16" ?>
Binary XML: L31.1D6UTF-16
Serialized XML: <?xml version="1.0" standalone="yes" ?>
Binary XML: L31.0t1
Serialized XML: <?xml version="1.1" ?>
Binary XML: L31.1
The XML declaration tags are informational only and therefore optional. They provide
the binary encoding with the information provided in the XML declaration of the source
document. For example, all text is encoded as UTF-8 in the binary encoding, even if the
source document used UTF-16. The fact that the source document used UTF-16 can be
communicated using these tags.
5.6 DTD and DOCTYPE
This specification defines a tag for the DOCTYPE. This tag cannot describe an internal
DTD.
5.7 Namespace Notes
Each namespace declaration in the source XML document needs to have a corresponding
'm' tag in the binary encoding, even if the namespace mapping is being declared again.
For example:
<Name xmlns:foo="bar">
...
</Name>
<Person xmlns:foo="bar">
...
</Person>
For the encoding of the Name and Person elements, both must contain an explicit
namespace mapping using the 'm' tag.
The namespace declarations appear immediately after the element tag in which they were
declared.
21
An undeclared default namespace is encoded as m00. Elements within undeclared
namespaces can be encoded with 'e' tag, 'X' tag, or 'x' tags with 00 for prefix and URI
StringIDs. Attributes with undeclared namespaces can be encoded with a tag, or the 'Y'
tag or 'y' tag with 00 for prefix and URI StringIDs.
5.8 Hint Tag Notes
The hint tag is a way to add arbitrary information to the binary encoding. This is
analogous to the use of the XML schema's xsd:appinfo. It consists of a TLV followed by
an LV. The 'H' tag indicates that some information is contained in its value field that
defines what is contained in the following LV. If the reader sees the initial TLV and does
not understand or want to process it, it can use the length of the following LV to skip it.
Otherwise, the reader can consume the information. For example, if validation was
performed in a database with a schema in the database's schema repository, then the
encoder may want to record exactly which schema it was validated with and could do so
using this form. Therefore, the encoding could be:
H11schema-used12http://x.y.z
5.9 Empty Sequence
XQuery defines an empty sequence. This is represented in the binary stream as a header
followed by a 'Z' tag.
5.10 Escaping of Characters
The tags U and b enable XDBX to record that none of the characters in a text node or
attribute value need to be escaped via an entity reference. The goal of this feature is to
speed up serialization of the XDBX binary stream. When any of these tags are used, none
of the characters in the text or attribute value need to be examined to determine if they
need escaping.
The 'U' tag can only be used if none of the characters in the text nodes are:
• carriage return
• ampersand
• greater than
• less than.
The 'b' tag can only be used if none of the characters in the attribute values are:
• carriage return
• ampersand
• greater than
• less than
• single quote
• double quote
• tab
• linefeed
22
Note that this only applies to serialization to Unicode. Serialization to other encodings
might require numeric character references due to the lack of encodings for certain
characters in certain codepages.
5.11 Private Extensions
Assuming agreement between a sender and receiver, the specification allows for the
definition and use of private extensions. This allows the format to support additional
features that are not currently and explicitly documented. An example of this is for type
encoding data in elements and attributes in a specific, non-text format. This allows the
encoder to encode the data in the most optimal form for the receiver. For example,
consider the element "weight" that is of type float:
<weight>75.4</weight>
Using one of the reserved tags, the encoder can inform the receiver of an alternative,
more efficient, encoding. This is also useful for user-defined types. Assuming StringIDs
are off, the preceding element could be encoded as:
2016weight002407xxxxxxxz
Where:
• '#x201' is a reserved tag defined by the encoder and receiver to define this special
element encoding.
• '6' is the length of the string "weight"
• '0' is the prefix length.
• '0' is the URI length.
• '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE
float.
• '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary
encoding of the value as a float.
Similarly, to encode attribute values, another reserved tag is used. For example:
<Person weight = "75.4">Joe</Person>
Assuming StringIDs are off, the attribute portion of this element could be encoded as:
2106weight002407xxxxxxx
Where:
• '#x210' is the reserved tag defined by the encoder and receiver to define this
special attribute encoding.
• '6' is the length of the string "weight".
• '0' is the prefix length.
• '0' is the URI length.
• '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE
float.
• '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary
encoding of the value as a float.
23
5.12 Reserved Tags
The set of reserved tags is for use by encoders that have agreement with the receivers on
their meaning. These reserved tags will not be reassigned for use in future versions of
this specification, thus ensuring forward and backward compatibility for implementations
that choose to use them.
24
6 Examples
The following section documents examples of serialized XML and the corresponding
binary XML format when various encoding attributes are used.
Note: The serialized XML values provided in these examples are shown with line breaks
and indentation to make them more readable. These characters are not included in the
byte counts shown in the example statistics.
6.1 Example 1 Default encoding
This example shows an XML document and its binary encoding with all the default
encoding flags.
Encoding Flags:
Document Type: XML Document
StringIDs (required): On
XML:
<root>
<name mgr = "NO">Joe</name>
<name>Susan</name>
<name>Bill</name>
</root>
Binary XML (Excluding the header):
X4root100X4name200Y3mgr3002NOT3Joezx200T5Susanzx200T4BillzzZ
Element, attribute, prefix, and URI IDs are:
1==root, 2==name, 3==mgr
Statistics:
75 bytes of XML
60 bytes of binary + 8 byte header = 68 bytes of binary XML
25
6.2 Example 2 Sequence
This example shows an XML sequence with multiple items, including a comment node, a
document node, an element node, and an atomic value. In the binary XML, 'b' is used to
denote blanks.
Encoding Flags:
Document Type: XML Sequence
StringIDs (required): On
XML:
<!--comment-->
<name mgr = "NO"> Joe </name>
Susan
<name>Bill</name>
Binary XML (Excluding the header):
c7comment@dX4name100Y3mgr2002NOT7bbJoebbz@V5Susan@x100T4BillzZ
Element, attribute, prefix, and URI IDs are:
1==name, 2==mgr
Statistics
67 bytes of XML
62 bytes of binary + 8 byte header = 70 bytes of binary XML
26
6.3 Example 3 StringIDs
This example shows an XML document and its binary encoding with stringIDs on.
Encoding Flags:
Document Type: XML Document
StringIDs (required): On
XML:
<root xmlns:foo = "bar">
<Person>
<name mgr = "NO">Bill</name>
<foo:age>35</foo:age>
</Person>
<Person>
<name mgr = "NO">Joe</name>
<foo:age>45</foo:age>
</Person>
</root>
Binary XML (Excluding the header):
I3foo1I3bar2X4root300m12X6Person400X4name500Y3mgr6002NOT4BillzX3age712T
235zze4e5a62NOT3Joezx712T245zzzZ
Element, attribute, prefix, and URI IDs are:
1==foo, 2==bar, 3==root, 4==Person, 5==name, 6==mgr, 7==age
Statistics:
162 bytes of XML
103 bytes of binary + 8 byte header = 111 bytes of binary XML
27
6.4 Example 4 Namespaces with StringIDs
This example shows an XML document with multiple namespaces and its binary
encoding with stringIDs.
Encoding Flags:
Document Type: XML Document
StringIDs (required): On
XML:
<root>
<Person xmlns:foo = "bar">
<name mgr = "NO">Bill</name>
<foo:age>35</foo:age>
</Person>
<Person xmlns:foo = "baz">
<name foo:mgr = "NO">Joe</name>
<foo:age>45</foo:age>
</Person>
<Person xmlns:bar = "food">
<name bar:mgr = "YES">Susan</name>
</Person>
<Person xmlns:bar = "foo">
<name bar:exec = "YES">Amy</name>
</Person>
</root>
Binary XML (Excluding the header):
X4root100I3foo2I3bar3X6Person400m23X4name500Y3mgr6002NOT4BillzX3age723T
235zzI3baz8e4m28e5y6282NOT3Joezx728T245zzI4food9e4m39e5y6393YEST5Susanz
ze4m32e5Y4exec10323YEST3AmyzzzZ
Element, attribute, prefix, and uri IDs are:
1==root, 2==foo, 3==bar, 4==Person, 5==name, 6==mgr, 7==age, 8==baz, 9==food,
10==exec
Statistics:
322 bytes of XML
173bytes of binary + 8 byte header = 181 bytes of binary XML
28
6.5 Example 5 Mixed Content
This example shows how mixed content is encoded.
Encoding Attributes:
Document Type: XML Document
StringIDs (required): On
XML:
<a>text<b/>more text</a>
Binary XML (Excluding the header):
X1a100T4textX1b200zT9more textzZ
Element, attribute, prefix, and URI IDs are:
1==a, 2==b
Statistics
24 bytes of XML
32 bytes of binary + 8 byte header = 40 bytes of binary XML
29
6.6 Example 6 White Space
This example shows a binary XML document with all of the white space characters that
are shown in the corresponding serialized XML document. In the binary XML, 'b' is used
to denote a blank and 'a' is used to indicate a linefeed character.
Encoding Flags:
Document Type: XML Document
StringIDs (required): On
XML:
<employee>
<name xml:space="preserve"><fn>Susan</fn> <ln>Smith</ln></name>
<address xml:space="default">
<state>MA</state>
</address>
</employee>
Binary XML (Excluding the header):
X8employee100W4abbbX4name200I3xml3I5space4y4308preserveX2fn600T5SusanzT1
bX2ln700T5SmithzzW4abbbX7address800y4307defaultW7abbbbbbX5state900T2MAz
W4abbbzW1azZ
Element, attribute, prefix, and URI IDs are:
1==employee, 2==name, 3==xml, 4==space, 5==name, 6==fn, 7==ln, 8==address,
9==state
Statistics:
160 bytes of XML
155 bytes of binary + 8 byte header = 163 bytes of binary XML
30
Appendix A Complete XDBX BNF
XDBX ::= Header DocumentContent
Header ::= DocIdentifier HeaderLength MajorVersion
EncodingFlags HeaderFill
DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */
HeaderLength ::= #x5
MajorVersion ::= #x1
EncodingFlags ::= FourBytes
HeaderFill ::= Byte*
DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd
/* Which branch to choose is controlled
by EncodingFlags */
DocumentEnd ::= 'Z'
XMLSequence ::= (SequenceItem ('@' SequenceItem)*)?
SequenceItem ::= Anywhere
(CompleteDoc | Comment | PI | AtomicValue
| Element)
Anywhere
CompleteDoc ::= 'd' XMLDocument
AtomicValue ::= 'V' LengthValue
XMLDocument ::= (Anywhere XMLDecl)? Misc*
(DocType | Misc*)? Element Misc*
Anywhere ::= (SI | Hint | Reserved)*
Misc ::= Comment | PI | SI | Hint
DocType ::= 'F' StringID StringID StringID
XMLDecl ::= XMLVersion Encoding? Standalone?
XMLVersion ::= 'L' LengthValue
/* The value is a valid XML version. "1.0"
or "1.1" for now */
Encoding ::= 'D' LengthValue
Standalone ::= 't' BooleanValue
Element ::= (ElementI | ElementSII | ElementIII)
ElementContent
EndElement
ElementI ::= 'e' StringID
ElementSII ::= 'X' LengthValue StringID StringID StringID
31
ElementIII ::= 'x' StringID StringID StringID
EndElement ::= 'z'
ElementContent ::= NSDecls Attributes Children
NSDecls ::= (Anywhere NSDecl)*
NSDecl ::= NSDeclII
NSDeclII ::= 'm' StringID StringID
Attributes ::= (Anywhere Attribute)*
Attribute ::= (AttributeI | AttributeSII | AttributeIII)
AttributeValue
AttributeI ::= 'a' StringID
AttributeSII ::= 'Y' LengthValue StringID StringID StringID
AttributeIII ::= ('y' | 'b') StringID StringID StringID
AttributeValue ::= LengthValue
/* If 'b' is used, then no &,',",
<,>,#xD,#xA,#x9 can appear in value */
Children ::= (Misc | Element | Text)*
Text ::= ('T' | 'U' | 'C' | 'W' ) LengthValue
Comment ::= 'c' LengthValue
PI ::= PII
PII ::= 'P' StringID LengthValue
SI ::= 'I' LengthValue StringID
Hint ::= 'H' LengthValue LengthValue
Reserved ::= [#xC9 - #xFA] Byte*
LengthValue ::= Length Value
Length ::= VariableInteger
Value ::= Byte*
/* Number of bytes governed by preceding
length */
StringID ::= VariableInteger
TypeID ::= VariableInteger
VariableInteger ::= (LongLeading | ShortLeading)? LastByte
LongLeading ::= [#x81-#x8F] [#x80-#xFF]? [#x80-#xFF]?
[#x80-#xFF]?
ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]?
LastByte ::= [#x0-#x7F]
32
BooleanValue ::= False | True
False ::= #x0
True ::= #x1
FourBytes ::= Byte Byte Byte Byte
Byte ::= [#x0-#xFF]
33