1004 lines
42 KiB
Plaintext
1004 lines
42 KiB
Plaintext
|
Extensible Dynamic Binary XML,
|
|||
|
Client/Server Binary XML Format
|
|||
|
(XDBX)
|
|||
|
Version 1.0
|
|||
|
(July 14, 2010)
|
|||
|
|
|||
|
Permission to copy and display the Extensible Dynamic Binary XML, Client/Server
|
|||
|
Binary XML Format (XDBX) (the "Specification"), in any medium without fee or
|
|||
|
royalty is hereby granted by IBM (collectively, the "Authors"), provided that you include
|
|||
|
the following on ALL copies of the Specification, or portions thereof, that you make:
|
|||
|
1. A link or URL to the Specification at one of the Authors websites.
|
|||
|
2. The copyright notice as shown in the Specification.
|
|||
|
The Authors each agree to grant you a royalty-free license, under reasonable, non-
|
|||
|
discriminatory terms and conditions to their respective patents that they deem necessary
|
|||
|
to implement the Specification.
|
|||
|
THE SPECIFICATION IS PROVIDED "AS IS," AND THE AUTHORS MAKE NO
|
|||
|
REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING,
|
|||
|
BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR
|
|||
|
A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR TITLE; THAT THE
|
|||
|
CONTENTS OF THE SPECIFICATION ARE SUITABLE FOR ANY PURPOSE; NOR
|
|||
|
THAT THE IMPLEMENTATION OF SUCH CONTENTS WILL NOT INFRINGE
|
|||
|
ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER
|
|||
|
RIGHTS.
|
|||
|
THE AUTHORS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL,
|
|||
|
INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR
|
|||
|
RELATING TO ANY USE OR DISTRIBUTION OF THE SPECIFICATION.
|
|||
|
The name and trademarks of the Authors may NOT be used in any manner, including
|
|||
|
advertising or publicity pertaining to the Specification or its contents without specific,
|
|||
|
written prior permission. Title to copyright in the Specification will at all times remain
|
|||
|
with the Authors.
|
|||
|
No other rights are granted by implication, estoppel or otherwise.
|
|||
|
© Copyright IBM Corporation 2010.
|
|||
|
|
|||
|
Abstract
|
|||
|
The solution that is presented in this document allows an encoder to produce the binary
|
|||
|
XML format using one or more of a set of attributes. The encoder can choose which
|
|||
|
attributes to include based on knowledge of the receiver. The receiver that reads the
|
|||
|
binary XML format can inspect the format header to determine the attributes with which
|
|||
|
it is encoded. This can be purely informational, or allow the receiver the opportunity to
|
|||
|
optimize its configuration to more efficiently process the attributes contained in the
|
|||
|
format.
|
|||
|
|
|||
|
Table of Contents
|
|||
|
1 Motivation 1
|
|||
|
2 Encoding overview 2
|
|||
|
3 Format Header 3
|
|||
|
3.1 Layout of the format header 3
|
|||
|
3.2 XDBX Major Version 3
|
|||
|
3.3 Encoding Flags 4
|
|||
|
3.3.1 Document Type 4
|
|||
|
3.3.2 StringID Flags 4
|
|||
|
3.3.3 Valid Flag 4
|
|||
|
3.4 Example of a Format header 5
|
|||
|
4 Format Content 6
|
|||
|
4.1 Conventions 7
|
|||
|
4.1.1 How Values and Lengths are Encoded 7
|
|||
|
4.2 Encoding of Single Documents and Sequences 9
|
|||
|
4.3 Encoding of XML Declarations 10
|
|||
|
4.4 Encoding of Elements 11
|
|||
|
4.5 Encoding of Attributes 12
|
|||
|
4.6 Encoding of Namespace Mappings 13
|
|||
|
4.7 Encoding of Text 13
|
|||
|
4.8 Encoding of Comments 14
|
|||
|
4.9 Encoding of Processing Instructions 14
|
|||
|
4.10 Encoding of Other Information 14
|
|||
|
4.11 Reserved Values for Tags 15
|
|||
|
5 Format details 16
|
|||
|
5.1 Encoding Single Documents and Sequences 16
|
|||
|
5.2 StringIDs 16
|
|||
|
5.2.1 Examples of StringID Usage 16
|
|||
|
5.3 StringID Notes 20
|
|||
|
5.4 Text Notes 20
|
|||
|
5.4.1 White Space 20
|
|||
|
5.5 XML Declaration Tag Notes 21
|
|||
|
5.6 DTD and DOCTYPE 21
|
|||
|
i
|
|||
|
5.7 Namespace Notes 21
|
|||
|
5.8 Hint Tag Notes 22
|
|||
|
5.9 Empty Sequence 22
|
|||
|
5.10 Escaping of Characters 22
|
|||
|
5.11 Private Extensions 23
|
|||
|
5.12 Reserved Tags 24
|
|||
|
6 Examples 25
|
|||
|
6.1 Example 1 – Default encoding 25
|
|||
|
6.2 Example 2 – Sequence 26
|
|||
|
6.3 Example 3 – StringIDs 27
|
|||
|
6.4 Example 4 – Namespaces with StringIDs 28
|
|||
|
6.5 Example 5 – Mixed Content 29
|
|||
|
6.6 Example 6 – White Space 30
|
|||
|
Appendix A Complete XDBX BNF 31
|
|||
|
ii
|
|||
|
1 Motivation
|
|||
|
Binary serialization of XML is desirable because it allows encoding of XML data in a
|
|||
|
smaller and more efficient form than textual XML format. The binary XML format is
|
|||
|
more efficient for various reasons. These include:
|
|||
|
• Multiple occurrences of repeated text are condensed through the use of StringIDs.
|
|||
|
StringIDs are integer identifiers that replace text strings.
|
|||
|
• When a parser processes data in a pretokenized format, the parser does not need
|
|||
|
to search for as many token delimiters in the content, or handle as many edge
|
|||
|
cases.
|
|||
|
• All values are prefixed with their length. When the parser has length information,
|
|||
|
it does not need to search for the ends of element names or values.
|
|||
|
• All entity references are expanded in binary XML format. The XML parser does
|
|||
|
not need to expand entity references.
|
|||
|
The binary XML format has the following disadvantages:
|
|||
|
• Loss of XML interoperability. Data that is in a proprietary format can be used
|
|||
|
only on systems that have the software to decode it.
|
|||
|
• The encoder must do extra processing to:
|
|||
|
o Perform validation
|
|||
|
o Perform well-formedness checking
|
|||
|
o Resolve all entity references
|
|||
|
o Identify repeated tags for replacement with StringIDs
|
|||
|
This binary XML format is not intended as a replacement for XML. It can provide better
|
|||
|
performance than XML when it is used in the implementation of some APIs.
|
|||
|
In general, the benefits of the binary XML format outweigh the disadvantages. The
|
|||
|
additional processing time that the encoder requires is usually less than the processing
|
|||
|
time that is used for parsing an XML document , especially when the XML document
|
|||
|
must be parsed more than once.
|
|||
|
1
|
|||
|
2 Encoding overview
|
|||
|
This binary XML representation contains a format header followed by a number of tags.
|
|||
|
The format header has encoding attributes which give the receiver some useful properties
|
|||
|
of the binary XML.
|
|||
|
The following characteristics of the binary encoding are constant, regardless of the source
|
|||
|
document or how the binary encoding is performed:
|
|||
|
• All text is encoded as UTF-8.
|
|||
|
• All entity references in the source document are replaced by their values.
|
|||
|
• Line breaks are normalized.
|
|||
|
• Attributes are normalized.
|
|||
|
• Where applicable, data is encoded in big-endian format.
|
|||
|
The binary XML format is made up of various tokens (tags) and values. When binary
|
|||
|
XML format is viewed with a standard text editor or as ASCII in a debugger, the tags
|
|||
|
display as single ASCII characters. This can aid in debugging while making the binary
|
|||
|
XML format more humanly readable.
|
|||
|
2
|
|||
|
3 Format Header
|
|||
|
3.1 Layout of the format header
|
|||
|
The binary XML format contains a header with information about how the format was
|
|||
|
constructed. The header information allows the parser to configure itself in order to
|
|||
|
process the message most efficiently.
|
|||
|
To identify the format and its attributes, the following scheme is used for the first set of
|
|||
|
bytes of the document:
|
|||
|
(2 bytes) – Binary XML document identifier (“magic number”)
|
|||
|
(1 byte) – Header length (not including magic number or the length byte itself)
|
|||
|
(1 byte) – XDBX major version
|
|||
|
(4 byte Integer) – Encoding flags
|
|||
|
The “magic number” will always be this value in binary: 11001010 00111011
|
|||
|
DocumentContent follows the Header. HeaderLength determines the length of the
|
|||
|
Header.
|
|||
|
BNF
|
|||
|
XDBX ::= Header DocumentContent
|
|||
|
Header ::= DocIdentifier HeaderLength MajorVersion
|
|||
|
EncodingFlags HeaderFill
|
|||
|
DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */
|
|||
|
HeaderLength ::= #x5
|
|||
|
MajorVersion ::= #x1
|
|||
|
EncodingFlags ::= FourBytes
|
|||
|
HeaderFill ::= Byte*
|
|||
|
FourBytes ::= Byte Byte Byte Byte
|
|||
|
Byte ::= [#x0-#xFF]
|
|||
|
3.2 XDBX Major Version
|
|||
|
There is just one major version of XDBX, identified by the XDBX major version value of
|
|||
|
0x01 (version 1). In this version, the HeaderLength must be at least 5.
|
|||
|
XDBX version 1 streams contain any of the following tags: 'e', 'X', 'x', 'z', 'a', 'Y', 'y', 'b',
|
|||
|
'm', 'T', 'U', 'C', ‘W’, 'V', 'L', 'D', 't', 'I', 'Z', '@', 'd', 'P', 'c', 'H'.
|
|||
|
The set of tags that an XDBX encoder generates is implementation defined. However, an
|
|||
|
XDBX encoder must assign a valid XDBX major version number to each generated
|
|||
|
stream, and ensure that each stream contains only tags that are allowed for that XDBX
|
|||
|
major version.
|
|||
|
3
|
|||
|
An XDBX decoder is required to fully support the tag set assigned to an implementation-
|
|||
|
defined XDBX major version level. It must be able to decode all valid tags from the
|
|||
|
corresponding tag set. However, XDBX decoders can reject XDBX streams that are
|
|||
|
identified by an XDBX major version that is higher than the version that the decoder
|
|||
|
supports.
|
|||
|
3.3 Encoding Flags
|
|||
|
The format for encoding flags allows for future expansion. Encoding flags, or features,
|
|||
|
can be added as needed. The header consists of indicators that signal to a processor how
|
|||
|
the format is encoded. Each encoding flag is a bit in a four-byte integer field in the
|
|||
|
header.
|
|||
|
The following encoding flags can be used in the binary XML format. Each encoding flag
|
|||
|
is listed along with its value in the four-byte integer header field.
|
|||
|
3.3.1 Document Type
|
|||
|
This attribute indicates whether the binary stream represents one complete well-formed
|
|||
|
XML document or a sequence of items, as defined by the XQuery 1.0 specification.
|
|||
|
• XML document (Value: x00000000)
|
|||
|
• XML sequence (Value: x00000001)
|
|||
|
3.3.2 StringID Flags
|
|||
|
The flags that are associated with stringIDs are:
|
|||
|
• StringID flag
|
|||
|
• Dense stringIDs used
|
|||
|
3.3.2.1 StringID Flag (required)
|
|||
|
This encoding flag (x00000002) must be set.
|
|||
|
3.3.2.2 Dense StringIDs Used Flag
|
|||
|
Certain implementations might require the stringIDs that are used in the binary XML to
|
|||
|
be small numbers so that they can be used as indexes in an array (as opposed to a hash
|
|||
|
table).
|
|||
|
When specified (x00000020), this encoding flag notifies the receiver that the stringIDs
|
|||
|
are small numbers. In general, small numbers are monotonically increasing numbers. The
|
|||
|
stringID value 0 (zero) is reserved.
|
|||
|
3.3.3 Valid Flag
|
|||
|
When specified (x00000080), this encoding flag notifies the receiver that the XML
|
|||
|
document or sequence of items conforms to a schema. This may have been determined by
|
|||
|
4
|
|||
|
the use of a validating XML parser, or by construction from objects that are associated
|
|||
|
with a schema.
|
|||
|
The use of this information by the receiver is beyond the scope of this specification. A
|
|||
|
receiver may choose to ignore this information.
|
|||
|
3.4 Example of a Format header
|
|||
|
Binary XML Document Identifier: 11001010 00111011
|
|||
|
Header Length: 00000101
|
|||
|
XDBX major version: 00000001
|
|||
|
Encoding flags:
|
|||
|
• Document Type (Bit 1): XML Document
|
|||
|
• StringID (Bit 2): On
|
|||
|
Magic Num Hdr Len Version Encoding flags
|
|||
|
11001010 00111011 00000101 00000001 00000000 00000000 00000000 00000010
|
|||
|
5
|
|||
|
4 Format Content
|
|||
|
The following combinations of information are used in binary XML document encoding:
|
|||
|
• TLV - Tag-Length-Value
|
|||
|
• TV - Tag-Value
|
|||
|
• LV - Length-Value
|
|||
|
• TLVid - Tag-Length-Value-StringID
|
|||
|
• ID - StringID
|
|||
|
Some content is denoted via a TLV, while other content uses the shorter LV. This is
|
|||
|
done for compactness, where a second tag is unnecessary and can be inferred from the
|
|||
|
previous tag. The specification also uses TV when the length is known to be one. In
|
|||
|
addition, TLVid is used when StringIDs are used, and is how a first occurrence of a string
|
|||
|
value is assigned its ID. Finally, there is an ID format if only the stringID is needed.
|
|||
|
6
|
|||
|
4.1 Conventions
|
|||
|
All the lengths are expressed as a number of bytes.
|
|||
|
A summary of each tag in the format and its meaning is contained in the tables that
|
|||
|
follow. The values in the Tag column are the decimal values of the tags. The values in
|
|||
|
the ASCII column are the ASCII encoding of the tag values.
|
|||
|
The following conventions are used:
|
|||
|
• TLV(localname) - a TLV for the localname is defined, where 'Value' is the text of
|
|||
|
the localname.
|
|||
|
• TLV(localname) /LV(prefix)/LV(uri)- a TLV for the localname, followed by an
|
|||
|
LV for the namespace prefix, followed by an LV for the namespace URI.
|
|||
|
• TLVid(localname) - a TLVid for the localname is defined where stringID is the
|
|||
|
ID assigned to the text for localname.
|
|||
|
• Tid(localname)/id(prefix)/id(uri) - a Tag-StringID for the localname, followed by
|
|||
|
the stringID of the namespace prefix, followed by the stringID of the namespace
|
|||
|
URI. The StringID references a string in the dictionary.
|
|||
|
BNF
|
|||
|
LengthValue ::= Length Value
|
|||
|
Length ::= VariableInteger
|
|||
|
Value ::= Byte*
|
|||
|
/* Number of bytes governed by preceding
|
|||
|
length */
|
|||
|
StringID ::= VariableInteger
|
|||
|
VariableInteger ::= (LongLeading | ShortLeading)? LastByte
|
|||
|
LongLeading ::= [#x81-#x8F]
|
|||
|
[#x80-#xFF]? [#x80-#xFF]? [#x80-#xFF]?
|
|||
|
ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]?
|
|||
|
LastByte ::= [#x0-#x7F]
|
|||
|
4.1.1 How Values and Lengths are Encoded
|
|||
|
Encoding Attributes in the format header are always encoded as signed four-byte integers
|
|||
|
in big endian format
|
|||
|
For space efficiency, all other values and lengths are encoded as a variable number of
|
|||
|
bytes, with the first byte containing the highest order bits for the integer, the next byte
|
|||
|
containing the next highest order bits, and so on. This allows the encoding to represent
|
|||
|
any arbitrary integer in as few bytes as possible. However, this specification limits the
|
|||
|
integer to a value representable in a signed 32 bit integer, which is 2Gbytes. Each byte
|
|||
|
contains seven bits of the integer's value, with the highest order bit of each byte
|
|||
|
7
|
|||
|
designated as a flag bit. A byte's flag bit is off if the byte is the last byte (lowest order
|
|||
|
byte) of a variable length byte sequence for a number. Because only as many bytes as
|
|||
|
necessary to represent an integer are used, integers between 0 and 127 are represented in
|
|||
|
one byte with the flag bit off. Integers between 128 and 16,383 are represented in two
|
|||
|
bytes with the flag bit set in the first byte, and so on.
|
|||
|
Examples:
|
|||
|
• A length of binary 00000101 means 5
|
|||
|
• A length of binary 10000101 00100001 means 673 (binary 1010100001)
|
|||
|
8
|
|||
|
4.2 Encoding of Single Documents and Sequences
|
|||
|
A binary stream can represent one complete well-formed XML document or a sequence
|
|||
|
of items, as defined by the XQuery specification. This information is encoded in the
|
|||
|
format header with the following encoding flags:
|
|||
|
• XML Document (Value: x00000000)
|
|||
|
• XML Sequence (Value: x00000001)
|
|||
|
Each item in the sequence can be a complete document, a subtree, or an atomic value.
|
|||
|
BNF
|
|||
|
DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd
|
|||
|
/* Which branch to choose is controlled
|
|||
|
by EncodingFlags */
|
|||
|
DocumentEnd ::= 'Z'
|
|||
|
XMLDocument ::= (Anywhere XMLDecl)? Misc*
|
|||
|
(DocType | Misc*)? Element Misc*
|
|||
|
XMLSequence ::= (SequenceItem
|
|||
|
(SequenceSeparator SequenceItem)*)?
|
|||
|
SequenceItem ::= Anywhere
|
|||
|
(CompleteDoc | Comment | PI
|
|||
|
| AtomicValue | Element)
|
|||
|
Anywhere
|
|||
|
SequenceSeparator ::= '@'
|
|||
|
CompleteDoc ::= 'd' XMLDocument
|
|||
|
Anywhere ::= (SI | Hint | Reserved)*
|
|||
|
Misc ::= Comment | PI | SI | Hint
|
|||
|
DocType ::= 'F' StringID StringID StringID
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
90 Z End of the binary stream
|
|||
|
64 @ Separator for items in an XML sequence
|
|||
|
100 d Document node (assumed for XML documents, not assumed in XML
|
|||
|
sequences)
|
|||
|
70 F DOCTYPE in Tid(rootElementName) /id(systemID)/id(publicID)
|
|||
|
9
|
|||
|
4.3 Encoding of XML Declarations
|
|||
|
BNF
|
|||
|
XMLDecl ::= XMLVersion Encoding? Standalone?
|
|||
|
XMLVersion ::= 'L' LengthValue
|
|||
|
/* The value is a valid XML version.
|
|||
|
"1.0" or "1.1" for now */
|
|||
|
Encoding ::= 'D' LengthValue
|
|||
|
Standalone ::= 't' BooleanValue
|
|||
|
BooleanValue ::= False | True
|
|||
|
False ::= #x0
|
|||
|
True ::= #x1
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
76 L XML version in TLV(version) form.
|
|||
|
68 D Encoding in TLV(encoding) form.
|
|||
|
116 t Standalone in TV(standalone) form where the value of 'standalone' is
|
|||
|
either 0 or 1.
|
|||
|
10
|
|||
|
4.4 Encoding of Elements
|
|||
|
BNF
|
|||
|
Element ::= (ElementI | ElementSII | ElementIII)
|
|||
|
ElementContent
|
|||
|
EndElement
|
|||
|
ElementI ::= 'e' StringID
|
|||
|
ElementSII ::= 'X' LengthValue StringID StringID StringID
|
|||
|
ElementIII ::= 'x' StringID StringID StringID
|
|||
|
EndElement ::= 'z'
|
|||
|
ElementContent ::= NSDecls Attributes Children
|
|||
|
Children ::= (Misc | Element | Text)*
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
101 e Tid(localname)
|
|||
|
Used when the element is not associated with a namespace.
|
|||
|
88 X TLVid(localname) / id(prefix) / id(uri)
|
|||
|
Used when the stringID for the element name is not yet defined. If the
|
|||
|
element is in the default namespace, then the prefix stringID is zero. If
|
|||
|
the element is not in a namespace, then the URI stringID is zero.
|
|||
|
120 x Tid(localname) / id(prefix) / id(uri)
|
|||
|
Used when the stringID for the element name is already defined. If the
|
|||
|
element is in the default namespace, then the prefix stringID will be
|
|||
|
zero. If the element is not in a namespace, then the URI stringID is zero.
|
|||
|
122 z End Element
|
|||
|
11
|
|||
|
4.5 Encoding of Attributes
|
|||
|
BNF
|
|||
|
Attributes ::= (Anywhere Attribute)*
|
|||
|
Attribute ::= (AttributeI | AttributeSII | AttributeIII)
|
|||
|
AttributeValue
|
|||
|
AttributeI ::= 'a' StringID
|
|||
|
AttributeSII ::= 'Y' LengthValue StringID StringID StringID
|
|||
|
AttributeIII ::= ('y' | 'b') StringID StringID StringID
|
|||
|
AttributeValue ::= LengthValue
|
|||
|
/* If 'b' is used, then no &,',",<,
|
|||
|
>,#xD,#xA,#x9 can appear in value */
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
97 a Tid(localname) / LV(attribute-value)
|
|||
|
Used when the attribute is not associated with a namespace.
|
|||
|
89 Y TLVid(localname) / id(prefix) / id(uri) / LV(attribute-value)
|
|||
|
Used when the stringID for the attribute name is not yet defined. If the
|
|||
|
attribute is not in a namespace, then the prefix stringID and URI
|
|||
|
stringID is zero.
|
|||
|
121 y Tid(localname) / id(prefix) / id(uri) / LV(attribute-value)
|
|||
|
Used when the stringID for the attribute name is already defined. If the
|
|||
|
attribute is not in a namespace, then the prefix stringID and URI
|
|||
|
stringID is zero.
|
|||
|
98 b Tid(localname) / id(prefix) / id(uri) / LV(attribute-value)
|
|||
|
Similar to the 'y' tag. Characters that cannot be used in the value are:
|
|||
|
• '<' (#x3c)
|
|||
|
• '>' (#x3e)
|
|||
|
• '&' (#x26)
|
|||
|
• carriage return (#x0d)
|
|||
|
• single quote (#x27)
|
|||
|
• double quote (#x22)
|
|||
|
• tab (#x09)
|
|||
|
• linefeed (#x0a)
|
|||
|
Because no characters need to be escaped when this attribute node is
|
|||
|
serialized, this feature should speed up serialization.
|
|||
|
12
|
|||
|
4.6 Encoding of Namespace Mappings
|
|||
|
BNF
|
|||
|
NSDecls ::= (Anywhere NSDecl)*
|
|||
|
NSDecl ::= NSDeclII
|
|||
|
NSDeclII ::= 'm' StringID StringID
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
109 m Tid(prefix) /id(namespace-uri)
|
|||
|
Declares a namespace mapping of a prefix stringID to a namespace URI
|
|||
|
stringID. For default namespace declarations, the stringID for the prefix
|
|||
|
is zero.
|
|||
|
4.7 Encoding of Text
|
|||
|
BNF
|
|||
|
Text ::= ('T' | 'U' | 'C' | 'W') LengthValue
|
|||
|
AtomicValue ::= 'V' LengthValue
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
84 T Text node in TLV(text) form.
|
|||
|
85 U Text node in TLV(text) form. The '<' (#x3c), '>' (#x3e), '&' (#x26), and
|
|||
|
carriage return (#x0d) characters cannot be used in the value. Because
|
|||
|
no characters need to be escaped when this text node is serialized, this
|
|||
|
feature should speed up serialization.
|
|||
|
67 C CDATA string in TLV(text) form.
|
|||
|
87 W Text node containing only white space in TLV(text) form. White space
|
|||
|
consists of one or more space (#x20) characters, carriage returns (#x0d),
|
|||
|
line feeds (#x0a), tabs (#x09), Unicode line separator characters
|
|||
|
(#x2028), or NELs (#x85).
|
|||
|
Used when a text node contains only white space, unless the nearest
|
|||
|
containing element with an xml:space attribute specifies
|
|||
|
xml:space='preserve'.
|
|||
|
86 V Atomic Value in TLV(text) form.
|
|||
|
13
|
|||
|
4.8 Encoding of Comments
|
|||
|
BNF
|
|||
|
Comment ::= 'c' LengthValue
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
99 c Comment in TLV(comment) form.
|
|||
|
4.9 Encoding of Processing Instructions
|
|||
|
BNF
|
|||
|
PI ::= PII
|
|||
|
PII ::= 'P' StringID LengthValue
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
80 P Processing instruction in Tid(target)/LV(value) form.
|
|||
|
The 'P' tag cannot declare an ID for the target of the processing instruction. Instead, an 'I'
|
|||
|
tag should be used to define the stringID for the target. Then the 'P' tag is used to define
|
|||
|
the processing instruction itself.
|
|||
|
Although this is unlike the behavior for element and attribute tags, this was done to avoid
|
|||
|
creating several tags to describe a processing instruction.
|
|||
|
4.10 Encoding of Other Information
|
|||
|
BNF
|
|||
|
SI ::= 'I' LengthValue StringID
|
|||
|
Hint ::= 'H' LengthValue LengthValue
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
73 I Definition of a stringID in TLVid(string) form. Used only when the
|
|||
|
StringID flag is set.
|
|||
|
72 H Hint in TLV/LV form.
|
|||
|
14
|
|||
|
4.11 Reserved Values for Tags
|
|||
|
BNF
|
|||
|
Reserved ::= [#xC9 - #xFA] Byte*
|
|||
|
Tags
|
|||
|
Value ASCII Meaning
|
|||
|
201
|
|||
|
-250
|
|||
|
Reserved for use by applications.
|
|||
|
Values 201 through 250 are reserved for use by applications, and will not be used as tags
|
|||
|
in future versions of this specification. These reserved values can be used to define
|
|||
|
private extensions to the format for features not accounted for in this version of the
|
|||
|
specification. See the Private Extensions section on page 23 for more information.
|
|||
|
15
|
|||
|
5 Format details
|
|||
|
This section provides additional details on the binary XML format.
|
|||
|
5.1 Encoding Single Documents and Sequences
|
|||
|
Whether an XDBX instance represents an XML document or a sequence of items is
|
|||
|
encoded in the XDBX header. Most commonly, the binary stream represents an XML
|
|||
|
Document. In this case, the document node as defined by the XML data models is
|
|||
|
assumed. In other words, there is no need to start the document with a 'd' tag. If the binary
|
|||
|
stream represents an XML Sequence, then the document node is not assumed, and any
|
|||
|
document node in the stream needs to be denoted with a 'd' tag. Note that XPath behaves
|
|||
|
differently whether there is a document node or not.
|
|||
|
It is important to note that if stringIDs are used, the encoder must ensure that all stringIDs
|
|||
|
are valid from one item to the next. In other words, the stringIDs are global to the binary
|
|||
|
XML stream. Combining multiple documents together as items in a sequence could have
|
|||
|
a size advantage, because the stringIDs would need to be defined only once.
|
|||
|
5.2 StringIDs
|
|||
|
Usage of stringIDs results in a smaller encoding, because the StringIDs are typically
|
|||
|
smaller than the text they represent. In addition, the use of StringIDs can allow the data
|
|||
|
in binary XML format to be processed more efficiently. The receiver must be prepared to
|
|||
|
manage the StringIDs that appear in the document. This requires establishing and
|
|||
|
managing lookup tables to efficiently reconcile StringIDs with the text they represent.
|
|||
|
In some encodings the first occurrence of the text is written as text, then where that text
|
|||
|
appears again, it is replaced with an ID that is computed during the processing of the first
|
|||
|
occurrence. In other encodings all text, or only a portion of the text, could be represented
|
|||
|
by an ID, where the ID is a reference to a dictionary that is contained in the message.
|
|||
|
A StringID can be used only after the tag that defines it.
|
|||
|
5.2.1 Examples of StringID Usage
|
|||
|
The following shows example encodings of namespace declarations, elements, and
|
|||
|
attributes when StringIDs are used.
|
|||
|
Namespace Declaration:
|
|||
|
The namespace declaration portion of the element tag: <root xmlns:foo="bar"> is
|
|||
|
encoded as I3foo1I3bar2m12, where:
|
|||
|
• 'I' assigns the StringID '1' to "foo" and '2' to "bar"
|
|||
|
• 'm' declares the namespace mapping of "foo" to '1' and "bar" to '2'.
|
|||
|
16
|
|||
|
Suppose that the namespace prefix is reassigned to a different uri later in the document.
|
|||
|
For example:
|
|||
|
<Address xmlns:foo= "baz">
|
|||
|
The encoding of the namespace declaration is:
|
|||
|
I3baz3m13, where '3' is the StringID assigned to "baz".
|
|||
|
Element with no prefix and no namespace:
|
|||
|
The first occurrence of <Address> is encoded as: X7Address100, where:
|
|||
|
• 'X' is the tag indicating an element name is encoded with StringIDs, and that a
|
|||
|
length/value/ID tuple follows defining the localname and its associated ID,
|
|||
|
followed by the stringIDs for the namespace prefix and namespace uri.
|
|||
|
• '7' is the length of the localname string "Address" and '1' is the assigned ID for
|
|||
|
that string.
|
|||
|
• '0' is the stringID for "no namespace prefix".
|
|||
|
• '0' is the stringID for "no namespace uri".
|
|||
|
Subsequent occurrences of <Address> are encoded more compactly as e1, where '1' is the
|
|||
|
StringID for the string "Address".
|
|||
|
Element with no prefix and the default namespace:
|
|||
|
The first occurrence of <Address> is encoded as: X7Address104 where:
|
|||
|
• 'X' is the tag indicating an element name is encoded with StringIDs, and that a
|
|||
|
length/value/ID tuple follows defining the localname and its associated ID.
|
|||
|
• '0' is the stringID for the namespace prefix (because there is none).
|
|||
|
• '4' is the stringID of the namespace uri.
|
|||
|
Subsequent occurrences of <Address> are encoded more compactly as x104, where
|
|||
|
• '1' is the StringID for the string "Address".
|
|||
|
• '4' is the stringID for the namespace uri.
|
|||
|
Element with prefix:
|
|||
|
The first occurrence of <foo:Address> is encoded as X7Address154, where:
|
|||
|
• '1' is the StringID assigned to the string "Address".
|
|||
|
• '5' is the stringID that was previously assigned to "foo".
|
|||
|
• '4' is the stringID that was previously assigned to the namespace uri.
|
|||
|
Subsequent occurrences of <foo:Address> are encoded more compactly as x154, where
|
|||
|
'1' is the StringID for the string "Address".
|
|||
|
17
|
|||
|
Attribute with no prefix (and thus no namespace):
|
|||
|
The first occurrence of the attribute portion of <name mgr="NO"> is encoded as
|
|||
|
Y3mgr9002NO where:
|
|||
|
• 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a
|
|||
|
length/value/id tuple for the attribute name.
|
|||
|
• '3' is the length of the attribute name "mgr".
|
|||
|
• '9' is the StringID assigned the string "mgr".
|
|||
|
• '0' for the stringID of the prefix.
|
|||
|
• '0' for the stringID of the URI.
|
|||
|
• '2' is the length of the attribute value: "NO".
|
|||
|
Subsequent occurrences of the attribute portion of <name mgr="NO"> are encoded as
|
|||
|
a92NO, where:
|
|||
|
• 'a' indicates an attribute declaration with StringIDs.
|
|||
|
• '9' is the stringID of the attribute name.
|
|||
|
• '2' is the length/value of the attribute value: "NO".
|
|||
|
Attribute with prefix:
|
|||
|
The first occurrence of the attribute portion of <name foo:mgr="NO"> is encoded as:
|
|||
|
Y3mgr9542NO where:
|
|||
|
• 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a
|
|||
|
length/value/id tuple for the attribute name.
|
|||
|
• '5' for the stringID for prefix.
|
|||
|
• '4' for the stringIDs for URI.
|
|||
|
• '3' is the length of the attribute name "mgr".
|
|||
|
• '9' is the StringID assigned the string "mgr".
|
|||
|
• '5' is the stringID for the prefix.
|
|||
|
• '4' is the stringID for the URI.
|
|||
|
• '2' is the length of the attribute value "NO".
|
|||
|
Subsequent occurrences of the attribute portion of <name foo:mgr= "NO"> are encoded
|
|||
|
more compactly as: y9542NO, where:
|
|||
|
• 'y' is the tag indicating an attribute declaration with StringIDs.
|
|||
|
• '9' the stringID for the attribute name
|
|||
|
• '5' is the stringID for prefix.
|
|||
|
• '4' is the stringID for URI.
|
|||
|
• '2' the length/value of the attribute value "NO".
|
|||
|
Elements, Text, and namespaceIDs:
|
|||
|
This section ties together some of the concepts described above and assumes StringIDs
|
|||
|
are used. For example:
|
|||
|
18
|
|||
|
<root xmlns:foo="bar">
|
|||
|
<foo:Address>ABC</foo:Address>
|
|||
|
<foo:Address><![CDATA[DEF]]</foo:Address>
|
|||
|
</root>
|
|||
|
The namespace declaration in the above XML is encoded as: I3foo1I3bar2m12, where:
|
|||
|
• '1' represents the StringID for "foo".
|
|||
|
• '2' is the StringID for "bar".
|
|||
|
• 'm12' is the structure to identify a mapping of foo ('1') to bar ('2').
|
|||
|
Therefore, the first occurrence of foo:Address is encoded as follows:
|
|||
|
X7Address912T3ABCz where:
|
|||
|
• 'X' indicates an element name expressed in LVid form.
|
|||
|
• '7Address' is the LV for the localname.
|
|||
|
• '9' is the StringID for "Address".
|
|||
|
• '12' is a reference to the namespace mapping of foo to bar.
|
|||
|
• 'T3ABC' is the TLV for the text node and 'z' represents the end element tag.
|
|||
|
The subsequent occurrence of foo:Address are encoded more compactly as follows:
|
|||
|
x912C3DEFz where:
|
|||
|
• 'x' indicates an element name expressed in id form.
|
|||
|
• '9' is the StringID for "Address".
|
|||
|
• '12' is a reference to the namespace mapping of foo to bar.
|
|||
|
• 'C3DEF' is the TLV for the CDATA.
|
|||
|
• 'z' represents the end element tag. (NOTE: The encoder could choose to encode
|
|||
|
the CDATA as a text node via 'T'.)
|
|||
|
The first occurrence of foo:Address must use the more expansive form of an element
|
|||
|
name 'X', where the second occurrence can use the more compact version 'x' because the
|
|||
|
element name is already encoded with a stringID.
|
|||
|
The following table summarizes the encoding of an element in various forms with
|
|||
|
StringIDs on:
|
|||
|
No Namespace Namespace
|
|||
|
First Occurrence Subsequent
|
|||
|
Occurrences
|
|||
|
First Occurrence Subsequent
|
|||
|
Occurrences
|
|||
|
<Address> X7Address100 e1 X7Address902 x902
|
|||
|
<foo:Address> N/A N/A X7Address912 x912
|
|||
|
The following table summarizes the encoding of an attribute in various forms with
|
|||
|
StringIDs on:
|
|||
|
No Namespace Namespace
|
|||
|
First
|
|||
|
Occurrence
|
|||
|
Subsequent
|
|||
|
Occurrences
|
|||
|
First
|
|||
|
Occurrence
|
|||
|
Subsequent
|
|||
|
Occurrences
|
|||
|
<mgr="NO"> Y3mgr9002NO a92NO Y3mgr9022NO y9022NO
|
|||
|
<foo:mgr="NO"> N/A N/A Y3mgr9122NO y9122NO
|
|||
|
19
|
|||
|
5.3 StringID Notes
|
|||
|
StringIDs are considered global. For example, if the string "Person" is given the stringID
|
|||
|
4, this value will exist for the entire binary XML document. It is invalid for "Person" to
|
|||
|
be given a different stringID, or for 4 to be assigned another string in the same binary
|
|||
|
XML document.
|
|||
|
The stringID value 0 (zero) is reserved and is used to mark "no namespace prefix" and
|
|||
|
"no namespace URI".
|
|||
|
5.4 Text Notes
|
|||
|
Multiple text and/or CDATA tags can appear one after another in order to handle
|
|||
|
arbitrarily large amounts of data. They are also used to encode mixed content.
|
|||
|
It is up to the encoder whether to encode CDATA using the 'C' tag or a 'T' tag, because
|
|||
|
they are semantically identical. The 'C' tag exists for applications that want to preserve
|
|||
|
the CDATA syntax. Beyond the difference between CDATA and text as described in the
|
|||
|
XML specification, this binary XML specification treats them identical.
|
|||
|
The 'U' tag is similar to the 'T' tag, except that the encoder guarantees that none of the
|
|||
|
characters in the 'U' tag need to be replaced with entity references if this text is serialized
|
|||
|
as XML. In other words, none of the following four characters are present in the text
|
|||
|
node: less-then “<” [<], greater-than “>” [>], ampersand “&” [&], and
|
|||
|
carriage-return [
].
|
|||
|
5.4.1 White Space
|
|||
|
The XMLPARSE function, which may be applied to an XML document that is passed to
|
|||
|
the receiver, offers the options of STRIP WHITESPACE and PRESERVE
|
|||
|
WHITESPACE. STRIP WHITESPACE removes text nodes that contain only white
|
|||
|
space unless the nearest containing element with an xml:space attribute specifies
|
|||
|
xml:space='preserve'.
|
|||
|
To facilitate the processing of STRIP WHITESPACE, text nodes that would be stripped
|
|||
|
by this operation must be identified by the 'W' tag.
|
|||
|
CDATA sections that contain white space that would be stripped by STRIP
|
|||
|
WHITESPACE must be identified by a 'W' tag rather than a 'C' tag. This is seen in the
|
|||
|
following examples:
|
|||
|
Serialized XML: <a> <![CDATA[bcd]]> </a>
|
|||
|
Binary XML: X1a100T1 C3bcdT1 z
|
|||
|
Serialized XML: <a> <![CDATA[ ]]> </a>
|
|||
|
Binary XML: X1a100W1 W1 W1 z
|
|||
|
or
|
|||
|
X1a100W3 z
|
|||
|
If a processor determines that certain white space characters can be removed (e.g.
|
|||
|
ignorable whitespace SAX events), they should be removed instead of being encoded in a
|
|||
|
'W' tag.
|
|||
|
20
|
|||
|
5.5 XML Declaration Tag Notes
|
|||
|
Typically, there is no XML declaration in binary XML. After all, the binary XML
|
|||
|
encoding is always UTF-8. However, if the XML version is not 1.0, then the XML
|
|||
|
declaration is mandatory, just like in serialized XML.
|
|||
|
If the XML declaration tags are present in the binary XML, the tags must include the
|
|||
|
version tag, however, the encoding and standalone tags are optional.
|
|||
|
Example encodings:
|
|||
|
Serialized XML: <?xml version="1.0" encoding="UTF-8"
|
|||
|
standalone="no" ?>
|
|||
|
Binary XML: L31.0D5UTF-8t0
|
|||
|
Serialized XML: <?xml version="1.1" encoding="UTF-16" ?>
|
|||
|
Binary XML: L31.1D6UTF-16
|
|||
|
Serialized XML: <?xml version="1.0" standalone="yes" ?>
|
|||
|
Binary XML: L31.0t1
|
|||
|
Serialized XML: <?xml version="1.1" ?>
|
|||
|
Binary XML: L31.1
|
|||
|
The XML declaration tags are informational only and therefore optional. They provide
|
|||
|
the binary encoding with the information provided in the XML declaration of the source
|
|||
|
document. For example, all text is encoded as UTF-8 in the binary encoding, even if the
|
|||
|
source document used UTF-16. The fact that the source document used UTF-16 can be
|
|||
|
communicated using these tags.
|
|||
|
5.6 DTD and DOCTYPE
|
|||
|
This specification defines a tag for the DOCTYPE. This tag cannot describe an internal
|
|||
|
DTD.
|
|||
|
5.7 Namespace Notes
|
|||
|
Each namespace declaration in the source XML document needs to have a corresponding
|
|||
|
'm' tag in the binary encoding, even if the namespace mapping is being declared again.
|
|||
|
For example:
|
|||
|
<Name xmlns:foo="bar">
|
|||
|
...
|
|||
|
</Name>
|
|||
|
<Person xmlns:foo="bar">
|
|||
|
...
|
|||
|
</Person>
|
|||
|
For the encoding of the Name and Person elements, both must contain an explicit
|
|||
|
namespace mapping using the 'm' tag.
|
|||
|
The namespace declarations appear immediately after the element tag in which they were
|
|||
|
declared.
|
|||
|
21
|
|||
|
An undeclared default namespace is encoded as m00. Elements within undeclared
|
|||
|
namespaces can be encoded with 'e' tag, 'X' tag, or 'x' tags with 00 for prefix and URI
|
|||
|
StringIDs. Attributes with undeclared namespaces can be encoded with a tag, or the 'Y'
|
|||
|
tag or 'y' tag with 00 for prefix and URI StringIDs.
|
|||
|
5.8 Hint Tag Notes
|
|||
|
The hint tag is a way to add arbitrary information to the binary encoding. This is
|
|||
|
analogous to the use of the XML schema's xsd:appinfo. It consists of a TLV followed by
|
|||
|
an LV. The 'H' tag indicates that some information is contained in its value field that
|
|||
|
defines what is contained in the following LV. If the reader sees the initial TLV and does
|
|||
|
not understand or want to process it, it can use the length of the following LV to skip it.
|
|||
|
Otherwise, the reader can consume the information. For example, if validation was
|
|||
|
performed in a database with a schema in the database's schema repository, then the
|
|||
|
encoder may want to record exactly which schema it was validated with and could do so
|
|||
|
using this form. Therefore, the encoding could be:
|
|||
|
H11schema-used12http://x.y.z
|
|||
|
5.9 Empty Sequence
|
|||
|
XQuery defines an empty sequence. This is represented in the binary stream as a header
|
|||
|
followed by a 'Z' tag.
|
|||
|
5.10 Escaping of Characters
|
|||
|
The tags U and b enable XDBX to record that none of the characters in a text node or
|
|||
|
attribute value need to be escaped via an entity reference. The goal of this feature is to
|
|||
|
speed up serialization of the XDBX binary stream. When any of these tags are used, none
|
|||
|
of the characters in the text or attribute value need to be examined to determine if they
|
|||
|
need escaping.
|
|||
|
The 'U' tag can only be used if none of the characters in the text nodes are:
|
|||
|
• carriage return
|
|||
|
• ampersand
|
|||
|
• greater than
|
|||
|
• less than.
|
|||
|
The 'b' tag can only be used if none of the characters in the attribute values are:
|
|||
|
• carriage return
|
|||
|
• ampersand
|
|||
|
• greater than
|
|||
|
• less than
|
|||
|
• single quote
|
|||
|
• double quote
|
|||
|
• tab
|
|||
|
• linefeed
|
|||
|
22
|
|||
|
Note that this only applies to serialization to Unicode. Serialization to other encodings
|
|||
|
might require numeric character references due to the lack of encodings for certain
|
|||
|
characters in certain codepages.
|
|||
|
5.11 Private Extensions
|
|||
|
Assuming agreement between a sender and receiver, the specification allows for the
|
|||
|
definition and use of private extensions. This allows the format to support additional
|
|||
|
features that are not currently and explicitly documented. An example of this is for type
|
|||
|
encoding data in elements and attributes in a specific, non-text format. This allows the
|
|||
|
encoder to encode the data in the most optimal form for the receiver. For example,
|
|||
|
consider the element "weight" that is of type float:
|
|||
|
<weight>75.4</weight>
|
|||
|
Using one of the reserved tags, the encoder can inform the receiver of an alternative,
|
|||
|
more efficient, encoding. This is also useful for user-defined types. Assuming StringIDs
|
|||
|
are off, the preceding element could be encoded as:
|
|||
|
2016weight002407xxxxxxxz
|
|||
|
Where:
|
|||
|
• '#x201' is a reserved tag defined by the encoder and receiver to define this special
|
|||
|
element encoding.
|
|||
|
• '6' is the length of the string "weight"
|
|||
|
• '0' is the prefix length.
|
|||
|
• '0' is the URI length.
|
|||
|
• '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE
|
|||
|
float.
|
|||
|
• '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary
|
|||
|
encoding of the value as a float.
|
|||
|
Similarly, to encode attribute values, another reserved tag is used. For example:
|
|||
|
<Person weight = "75.4">Joe</Person>
|
|||
|
Assuming StringIDs are off, the attribute portion of this element could be encoded as:
|
|||
|
2106weight002407xxxxxxx
|
|||
|
Where:
|
|||
|
• '#x210' is the reserved tag defined by the encoder and receiver to define this
|
|||
|
special attribute encoding.
|
|||
|
• '6' is the length of the string "weight".
|
|||
|
• '0' is the prefix length.
|
|||
|
• '0' is the URI length.
|
|||
|
• '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE
|
|||
|
float.
|
|||
|
• '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary
|
|||
|
encoding of the value as a float.
|
|||
|
23
|
|||
|
5.12 Reserved Tags
|
|||
|
The set of reserved tags is for use by encoders that have agreement with the receivers on
|
|||
|
their meaning. These reserved tags will not be reassigned for use in future versions of
|
|||
|
this specification, thus ensuring forward and backward compatibility for implementations
|
|||
|
that choose to use them.
|
|||
|
24
|
|||
|
6 Examples
|
|||
|
The following section documents examples of serialized XML and the corresponding
|
|||
|
binary XML format when various encoding attributes are used.
|
|||
|
Note: The serialized XML values provided in these examples are shown with line breaks
|
|||
|
and indentation to make them more readable. These characters are not included in the
|
|||
|
byte counts shown in the example statistics.
|
|||
|
6.1 Example 1 – Default encoding
|
|||
|
This example shows an XML document and its binary encoding with all the default
|
|||
|
encoding flags.
|
|||
|
Encoding Flags:
|
|||
|
Document Type: XML Document
|
|||
|
StringIDs (required): On
|
|||
|
XML:
|
|||
|
<root>
|
|||
|
<name mgr = "NO">Joe</name>
|
|||
|
<name>Susan</name>
|
|||
|
<name>Bill</name>
|
|||
|
</root>
|
|||
|
Binary XML (Excluding the header):
|
|||
|
X4root100X4name200Y3mgr3002NOT3Joezx200T5Susanzx200T4BillzzZ
|
|||
|
Element, attribute, prefix, and URI IDs are:
|
|||
|
1==root, 2==name, 3==mgr
|
|||
|
Statistics:
|
|||
|
75 bytes of XML
|
|||
|
60 bytes of binary + 8 byte header = 68 bytes of binary XML
|
|||
|
25
|
|||
|
6.2 Example 2 – Sequence
|
|||
|
This example shows an XML sequence with multiple items, including a comment node, a
|
|||
|
document node, an element node, and an atomic value. In the binary XML, 'b' is used to
|
|||
|
denote blanks.
|
|||
|
Encoding Flags:
|
|||
|
Document Type: XML Sequence
|
|||
|
StringIDs (required): On
|
|||
|
XML:
|
|||
|
<!--comment-->
|
|||
|
<name mgr = "NO"> Joe </name>
|
|||
|
Susan
|
|||
|
<name>Bill</name>
|
|||
|
Binary XML (Excluding the header):
|
|||
|
c7comment@dX4name100Y3mgr2002NOT7bbJoebbz@V5Susan@x100T4BillzZ
|
|||
|
Element, attribute, prefix, and URI IDs are:
|
|||
|
1==name, 2==mgr
|
|||
|
Statistics
|
|||
|
67 bytes of XML
|
|||
|
62 bytes of binary + 8 byte header = 70 bytes of binary XML
|
|||
|
26
|
|||
|
6.3 Example 3 – StringIDs
|
|||
|
This example shows an XML document and its binary encoding with stringIDs on.
|
|||
|
Encoding Flags:
|
|||
|
Document Type: XML Document
|
|||
|
StringIDs (required): On
|
|||
|
XML:
|
|||
|
<root xmlns:foo = "bar">
|
|||
|
<Person>
|
|||
|
<name mgr = "NO">Bill</name>
|
|||
|
<foo:age>35</foo:age>
|
|||
|
</Person>
|
|||
|
<Person>
|
|||
|
<name mgr = "NO">Joe</name>
|
|||
|
<foo:age>45</foo:age>
|
|||
|
</Person>
|
|||
|
</root>
|
|||
|
Binary XML (Excluding the header):
|
|||
|
I3foo1I3bar2X4root300m12X6Person400X4name500Y3mgr6002NOT4BillzX3age712T
|
|||
|
235zze4e5a62NOT3Joezx712T245zzzZ
|
|||
|
Element, attribute, prefix, and URI IDs are:
|
|||
|
1==foo, 2==bar, 3==root, 4==Person, 5==name, 6==mgr, 7==age
|
|||
|
Statistics:
|
|||
|
162 bytes of XML
|
|||
|
103 bytes of binary + 8 byte header = 111 bytes of binary XML
|
|||
|
27
|
|||
|
6.4 Example 4 – Namespaces with StringIDs
|
|||
|
This example shows an XML document with multiple namespaces and its binary
|
|||
|
encoding with stringIDs.
|
|||
|
Encoding Flags:
|
|||
|
Document Type: XML Document
|
|||
|
StringIDs (required): On
|
|||
|
XML:
|
|||
|
<root>
|
|||
|
<Person xmlns:foo = "bar">
|
|||
|
<name mgr = "NO">Bill</name>
|
|||
|
<foo:age>35</foo:age>
|
|||
|
</Person>
|
|||
|
<Person xmlns:foo = "baz">
|
|||
|
<name foo:mgr = "NO">Joe</name>
|
|||
|
<foo:age>45</foo:age>
|
|||
|
</Person>
|
|||
|
<Person xmlns:bar = "food">
|
|||
|
<name bar:mgr = "YES">Susan</name>
|
|||
|
</Person>
|
|||
|
<Person xmlns:bar = "foo">
|
|||
|
<name bar:exec = "YES">Amy</name>
|
|||
|
</Person>
|
|||
|
</root>
|
|||
|
Binary XML (Excluding the header):
|
|||
|
X4root100I3foo2I3bar3X6Person400m23X4name500Y3mgr6002NOT4BillzX3age723T
|
|||
|
235zzI3baz8e4m28e5y6282NOT3Joezx728T245zzI4food9e4m39e5y6393YEST5Susanz
|
|||
|
ze4m32e5Y4exec10323YEST3AmyzzzZ
|
|||
|
Element, attribute, prefix, and uri IDs are:
|
|||
|
1==root, 2==foo, 3==bar, 4==Person, 5==name, 6==mgr, 7==age, 8==baz, 9==food,
|
|||
|
10==exec
|
|||
|
Statistics:
|
|||
|
322 bytes of XML
|
|||
|
173bytes of binary + 8 byte header = 181 bytes of binary XML
|
|||
|
28
|
|||
|
6.5 Example 5 – Mixed Content
|
|||
|
This example shows how mixed content is encoded.
|
|||
|
Encoding Attributes:
|
|||
|
Document Type: XML Document
|
|||
|
StringIDs (required): On
|
|||
|
XML:
|
|||
|
<a>text<b/>more text</a>
|
|||
|
Binary XML (Excluding the header):
|
|||
|
X1a100T4textX1b200zT9more textzZ
|
|||
|
Element, attribute, prefix, and URI IDs are:
|
|||
|
1==a, 2==b
|
|||
|
Statistics
|
|||
|
24 bytes of XML
|
|||
|
32 bytes of binary + 8 byte header = 40 bytes of binary XML
|
|||
|
29
|
|||
|
6.6 Example 6 – White Space
|
|||
|
This example shows a binary XML document with all of the white space characters that
|
|||
|
are shown in the corresponding serialized XML document. In the binary XML, 'b' is used
|
|||
|
to denote a blank and 'a' is used to indicate a linefeed character.
|
|||
|
Encoding Flags:
|
|||
|
Document Type: XML Document
|
|||
|
StringIDs (required): On
|
|||
|
XML:
|
|||
|
<employee>
|
|||
|
<name xml:space="preserve"><fn>Susan</fn> <ln>Smith</ln></name>
|
|||
|
<address xml:space="default">
|
|||
|
<state>MA</state>
|
|||
|
</address>
|
|||
|
</employee>
|
|||
|
Binary XML (Excluding the header):
|
|||
|
X8employee100W4abbbX4name200I3xml3I5space4y4308preserveX2fn600T5SusanzT1
|
|||
|
bX2ln700T5SmithzzW4abbbX7address800y4307defaultW7abbbbbbX5state900T2MAz
|
|||
|
W4abbbzW1azZ
|
|||
|
Element, attribute, prefix, and URI IDs are:
|
|||
|
1==employee, 2==name, 3==xml, 4==space, 5==name, 6==fn, 7==ln, 8==address,
|
|||
|
9==state
|
|||
|
Statistics:
|
|||
|
160 bytes of XML
|
|||
|
155 bytes of binary + 8 byte header = 163 bytes of binary XML
|
|||
|
30
|
|||
|
Appendix A Complete XDBX BNF
|
|||
|
XDBX ::= Header DocumentContent
|
|||
|
Header ::= DocIdentifier HeaderLength MajorVersion
|
|||
|
EncodingFlags HeaderFill
|
|||
|
DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */
|
|||
|
HeaderLength ::= #x5
|
|||
|
MajorVersion ::= #x1
|
|||
|
EncodingFlags ::= FourBytes
|
|||
|
HeaderFill ::= Byte*
|
|||
|
DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd
|
|||
|
/* Which branch to choose is controlled
|
|||
|
by EncodingFlags */
|
|||
|
DocumentEnd ::= 'Z'
|
|||
|
XMLSequence ::= (SequenceItem ('@' SequenceItem)*)?
|
|||
|
SequenceItem ::= Anywhere
|
|||
|
(CompleteDoc | Comment | PI | AtomicValue
|
|||
|
| Element)
|
|||
|
Anywhere
|
|||
|
CompleteDoc ::= 'd' XMLDocument
|
|||
|
AtomicValue ::= 'V' LengthValue
|
|||
|
XMLDocument ::= (Anywhere XMLDecl)? Misc*
|
|||
|
(DocType | Misc*)? Element Misc*
|
|||
|
Anywhere ::= (SI | Hint | Reserved)*
|
|||
|
Misc ::= Comment | PI | SI | Hint
|
|||
|
DocType ::= 'F' StringID StringID StringID
|
|||
|
XMLDecl ::= XMLVersion Encoding? Standalone?
|
|||
|
XMLVersion ::= 'L' LengthValue
|
|||
|
/* The value is a valid XML version. "1.0"
|
|||
|
or "1.1" for now */
|
|||
|
Encoding ::= 'D' LengthValue
|
|||
|
Standalone ::= 't' BooleanValue
|
|||
|
Element ::= (ElementI | ElementSII | ElementIII)
|
|||
|
ElementContent
|
|||
|
EndElement
|
|||
|
ElementI ::= 'e' StringID
|
|||
|
ElementSII ::= 'X' LengthValue StringID StringID StringID
|
|||
|
31
|
|||
|
ElementIII ::= 'x' StringID StringID StringID
|
|||
|
EndElement ::= 'z'
|
|||
|
ElementContent ::= NSDecls Attributes Children
|
|||
|
NSDecls ::= (Anywhere NSDecl)*
|
|||
|
NSDecl ::= NSDeclII
|
|||
|
NSDeclII ::= 'm' StringID StringID
|
|||
|
Attributes ::= (Anywhere Attribute)*
|
|||
|
Attribute ::= (AttributeI | AttributeSII | AttributeIII)
|
|||
|
AttributeValue
|
|||
|
AttributeI ::= 'a' StringID
|
|||
|
AttributeSII ::= 'Y' LengthValue StringID StringID StringID
|
|||
|
AttributeIII ::= ('y' | 'b') StringID StringID StringID
|
|||
|
AttributeValue ::= LengthValue
|
|||
|
/* If 'b' is used, then no &,',",
|
|||
|
<,>,#xD,#xA,#x9 can appear in value */
|
|||
|
Children ::= (Misc | Element | Text)*
|
|||
|
Text ::= ('T' | 'U' | 'C' | 'W' ) LengthValue
|
|||
|
Comment ::= 'c' LengthValue
|
|||
|
PI ::= PII
|
|||
|
PII ::= 'P' StringID LengthValue
|
|||
|
SI ::= 'I' LengthValue StringID
|
|||
|
Hint ::= 'H' LengthValue LengthValue
|
|||
|
Reserved ::= [#xC9 - #xFA] Byte*
|
|||
|
LengthValue ::= Length Value
|
|||
|
Length ::= VariableInteger
|
|||
|
Value ::= Byte*
|
|||
|
/* Number of bytes governed by preceding
|
|||
|
length */
|
|||
|
StringID ::= VariableInteger
|
|||
|
TypeID ::= VariableInteger
|
|||
|
VariableInteger ::= (LongLeading | ShortLeading)? LastByte
|
|||
|
LongLeading ::= [#x81-#x8F] [#x80-#xFF]? [#x80-#xFF]?
|
|||
|
[#x80-#xFF]?
|
|||
|
ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]?
|
|||
|
LastByte ::= [#x0-#x7F]
|
|||
|
32
|
|||
|
BooleanValue ::= False | True
|
|||
|
False ::= #x0
|
|||
|
True ::= #x1
|
|||
|
FourBytes ::= Byte Byte Byte Byte
|
|||
|
Byte ::= [#x0-#xFF]
|
|||
|
33
|