diff --git a/nx01-x4o-driver/todo-binxml.txt b/nx01-x4o-driver/todo-binxml.txt deleted file mode 100644 index d9ddbd7..0000000 --- a/nx01-x4o-driver/todo-binxml.txt +++ /dev/null @@ -1,1004 +0,0 @@ -Extensible Dynamic Binary XML, -Client/Server Binary XML Format -(XDBX) -Version 1.0 -(July 14, 2010) - -Permission to copy and display the Extensible Dynamic Binary XML, Client/Server -Binary XML Format (XDBX) (the "Specification"), in any medium without fee or -royalty is hereby granted by IBM (collectively, the "Authors"), provided that you include -the following on ALL copies of the Specification, or portions thereof, that you make: -1. A link or URL to the Specification at one of the Authors websites. -2. The copyright notice as shown in the Specification. -The Authors each agree to grant you a royalty-free license, under reasonable, non- -discriminatory terms and conditions to their respective patents that they deem necessary -to implement the Specification. -THE SPECIFICATION IS PROVIDED "AS IS," AND THE AUTHORS MAKE NO -REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, -BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR -A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR TITLE; THAT THE -CONTENTS OF THE SPECIFICATION ARE SUITABLE FOR ANY PURPOSE; NOR -THAT THE IMPLEMENTATION OF SUCH CONTENTS WILL NOT INFRINGE -ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER -RIGHTS. -THE AUTHORS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, -INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR -RELATING TO ANY USE OR DISTRIBUTION OF THE SPECIFICATION. -The name and trademarks of the Authors may NOT be used in any manner, including -advertising or publicity pertaining to the Specification or its contents without specific, -written prior permission. Title to copyright in the Specification will at all times remain -with the Authors. -No other rights are granted by implication, estoppel or otherwise. -© Copyright IBM Corporation 2010. - -Abstract -The solution that is presented in this document allows an encoder to produce the binary -XML format using one or more of a set of attributes. The encoder can choose which -attributes to include based on knowledge of the receiver. The receiver that reads the -binary XML format can inspect the format header to determine the attributes with which -it is encoded. This can be purely informational, or allow the receiver the opportunity to -optimize its configuration to more efficiently process the attributes contained in the -format. - -Table of Contents -1 Motivation 1 -2 Encoding overview 2 -3 Format Header 3 -3.1 Layout of the format header 3 -3.2 XDBX Major Version 3 -3.3 Encoding Flags 4 -3.3.1 Document Type 4 -3.3.2 StringID Flags 4 -3.3.3 Valid Flag 4 -3.4 Example of a Format header 5 -4 Format Content 6 -4.1 Conventions 7 -4.1.1 How Values and Lengths are Encoded 7 -4.2 Encoding of Single Documents and Sequences 9 -4.3 Encoding of XML Declarations 10 -4.4 Encoding of Elements 11 -4.5 Encoding of Attributes 12 -4.6 Encoding of Namespace Mappings 13 -4.7 Encoding of Text 13 -4.8 Encoding of Comments 14 -4.9 Encoding of Processing Instructions 14 -4.10 Encoding of Other Information 14 -4.11 Reserved Values for Tags 15 -5 Format details 16 -5.1 Encoding Single Documents and Sequences 16 -5.2 StringIDs 16 -5.2.1 Examples of StringID Usage 16 -5.3 StringID Notes 20 -5.4 Text Notes 20 -5.4.1 White Space 20 -5.5 XML Declaration Tag Notes 21 -5.6 DTD and DOCTYPE 21 -i -5.7 Namespace Notes 21 -5.8 Hint Tag Notes 22 -5.9 Empty Sequence 22 -5.10 Escaping of Characters 22 -5.11 Private Extensions 23 -5.12 Reserved Tags 24 -6 Examples 25 -6.1 Example 1 – Default encoding 25 -6.2 Example 2 – Sequence 26 -6.3 Example 3 – StringIDs 27 -6.4 Example 4 – Namespaces with StringIDs 28 -6.5 Example 5 – Mixed Content 29 -6.6 Example 6 – White Space 30 -Appendix A Complete XDBX BNF 31 -ii -1 Motivation -Binary serialization of XML is desirable because it allows encoding of XML data in a -smaller and more efficient form than textual XML format. The binary XML format is -more efficient for various reasons. These include: -• Multiple occurrences of repeated text are condensed through the use of StringIDs. -StringIDs are integer identifiers that replace text strings. -• When a parser processes data in a pretokenized format, the parser does not need -to search for as many token delimiters in the content, or handle as many edge -cases. -• All values are prefixed with their length. When the parser has length information, -it does not need to search for the ends of element names or values. -• All entity references are expanded in binary XML format. The XML parser does -not need to expand entity references. -The binary XML format has the following disadvantages: -• Loss of XML interoperability. Data that is in a proprietary format can be used -only on systems that have the software to decode it. -• The encoder must do extra processing to: -o Perform validation -o Perform well-formedness checking -o Resolve all entity references -o Identify repeated tags for replacement with StringIDs -This binary XML format is not intended as a replacement for XML. It can provide better -performance than XML when it is used in the implementation of some APIs. -In general, the benefits of the binary XML format outweigh the disadvantages. The -additional processing time that the encoder requires is usually less than the processing -time that is used for parsing an XML document , especially when the XML document -must be parsed more than once. -1 -2 Encoding overview -This binary XML representation contains a format header followed by a number of tags. -The format header has encoding attributes which give the receiver some useful properties -of the binary XML. -The following characteristics of the binary encoding are constant, regardless of the source -document or how the binary encoding is performed: -• All text is encoded as UTF-8. -• All entity references in the source document are replaced by their values. -• Line breaks are normalized. -• Attributes are normalized. -• Where applicable, data is encoded in big-endian format. -The binary XML format is made up of various tokens (tags) and values. When binary -XML format is viewed with a standard text editor or as ASCII in a debugger, the tags -display as single ASCII characters. This can aid in debugging while making the binary -XML format more humanly readable. -2 -3 Format Header -3.1 Layout of the format header -The binary XML format contains a header with information about how the format was -constructed. The header information allows the parser to configure itself in order to -process the message most efficiently. -To identify the format and its attributes, the following scheme is used for the first set of -bytes of the document: -(2 bytes) – Binary XML document identifier (“magic number”) -(1 byte) – Header length (not including magic number or the length byte itself) -(1 byte) – XDBX major version -(4 byte Integer) – Encoding flags -The “magic number” will always be this value in binary: 11001010 00111011 -DocumentContent follows the Header. HeaderLength determines the length of the -Header. -BNF -XDBX ::= Header DocumentContent -Header ::= DocIdentifier HeaderLength MajorVersion -EncodingFlags HeaderFill -DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */ -HeaderLength ::= #x5 -MajorVersion ::= #x1 -EncodingFlags ::= FourBytes -HeaderFill ::= Byte* -FourBytes ::= Byte Byte Byte Byte -Byte ::= [#x0-#xFF] -3.2 XDBX Major Version -There is just one major version of XDBX, identified by the XDBX major version value of -0x01 (version 1). In this version, the HeaderLength must be at least 5. -XDBX version 1 streams contain any of the following tags: 'e', 'X', 'x', 'z', 'a', 'Y', 'y', 'b', -'m', 'T', 'U', 'C', ‘W’, 'V', 'L', 'D', 't', 'I', 'Z', '@', 'd', 'P', 'c', 'H'. -The set of tags that an XDBX encoder generates is implementation defined. However, an -XDBX encoder must assign a valid XDBX major version number to each generated -stream, and ensure that each stream contains only tags that are allowed for that XDBX -major version. -3 -An XDBX decoder is required to fully support the tag set assigned to an implementation- -defined XDBX major version level. It must be able to decode all valid tags from the -corresponding tag set. However, XDBX decoders can reject XDBX streams that are -identified by an XDBX major version that is higher than the version that the decoder -supports. -3.3 Encoding Flags -The format for encoding flags allows for future expansion. Encoding flags, or features, -can be added as needed. The header consists of indicators that signal to a processor how -the format is encoded. Each encoding flag is a bit in a four-byte integer field in the -header. -The following encoding flags can be used in the binary XML format. Each encoding flag -is listed along with its value in the four-byte integer header field. -3.3.1 Document Type -This attribute indicates whether the binary stream represents one complete well-formed -XML document or a sequence of items, as defined by the XQuery 1.0 specification. -• XML document (Value: x00000000) -• XML sequence (Value: x00000001) -3.3.2 StringID Flags -The flags that are associated with stringIDs are: -• StringID flag -• Dense stringIDs used -3.3.2.1 StringID Flag (required) -This encoding flag (x00000002) must be set. -3.3.2.2 Dense StringIDs Used Flag -Certain implementations might require the stringIDs that are used in the binary XML to -be small numbers so that they can be used as indexes in an array (as opposed to a hash -table). -When specified (x00000020), this encoding flag notifies the receiver that the stringIDs -are small numbers. In general, small numbers are monotonically increasing numbers. The -stringID value 0 (zero) is reserved. -3.3.3 Valid Flag -When specified (x00000080), this encoding flag notifies the receiver that the XML -document or sequence of items conforms to a schema. This may have been determined by -4 -the use of a validating XML parser, or by construction from objects that are associated -with a schema. -The use of this information by the receiver is beyond the scope of this specification. A -receiver may choose to ignore this information. -3.4 Example of a Format header -Binary XML Document Identifier: 11001010 00111011 -Header Length: 00000101 -XDBX major version: 00000001 -Encoding flags: -• Document Type (Bit 1): XML Document -• StringID (Bit 2): On -Magic Num Hdr Len Version Encoding flags -11001010 00111011 00000101 00000001 00000000 00000000 00000000 00000010 -5 -4 Format Content -The following combinations of information are used in binary XML document encoding: -• TLV - Tag-Length-Value -• TV - Tag-Value -• LV - Length-Value -• TLVid - Tag-Length-Value-StringID -• ID - StringID -Some content is denoted via a TLV, while other content uses the shorter LV. This is -done for compactness, where a second tag is unnecessary and can be inferred from the -previous tag. The specification also uses TV when the length is known to be one. In -addition, TLVid is used when StringIDs are used, and is how a first occurrence of a string -value is assigned its ID. Finally, there is an ID format if only the stringID is needed. -6 -4.1 Conventions -All the lengths are expressed as a number of bytes. -A summary of each tag in the format and its meaning is contained in the tables that -follow. The values in the Tag column are the decimal values of the tags. The values in -the ASCII column are the ASCII encoding of the tag values. -The following conventions are used: -• TLV(localname) - a TLV for the localname is defined, where 'Value' is the text of -the localname. -• TLV(localname) /LV(prefix)/LV(uri)- a TLV for the localname, followed by an -LV for the namespace prefix, followed by an LV for the namespace URI. -• TLVid(localname) - a TLVid for the localname is defined where stringID is the -ID assigned to the text for localname. -• Tid(localname)/id(prefix)/id(uri) - a Tag-StringID for the localname, followed by -the stringID of the namespace prefix, followed by the stringID of the namespace -URI. The StringID references a string in the dictionary. -BNF -LengthValue ::= Length Value -Length ::= VariableInteger -Value ::= Byte* -/* Number of bytes governed by preceding -length */ -StringID ::= VariableInteger -VariableInteger ::= (LongLeading | ShortLeading)? LastByte -LongLeading ::= [#x81-#x8F] -[#x80-#xFF]? [#x80-#xFF]? [#x80-#xFF]? -ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]? -LastByte ::= [#x0-#x7F] -4.1.1 How Values and Lengths are Encoded -Encoding Attributes in the format header are always encoded as signed four-byte integers -in big endian format -For space efficiency, all other values and lengths are encoded as a variable number of -bytes, with the first byte containing the highest order bits for the integer, the next byte -containing the next highest order bits, and so on. This allows the encoding to represent -any arbitrary integer in as few bytes as possible. However, this specification limits the -integer to a value representable in a signed 32 bit integer, which is 2Gbytes. Each byte -contains seven bits of the integer's value, with the highest order bit of each byte -7 -designated as a flag bit. A byte's flag bit is off if the byte is the last byte (lowest order -byte) of a variable length byte sequence for a number. Because only as many bytes as -necessary to represent an integer are used, integers between 0 and 127 are represented in -one byte with the flag bit off. Integers between 128 and 16,383 are represented in two -bytes with the flag bit set in the first byte, and so on. -Examples: -• A length of binary 00000101 means 5 -• A length of binary 10000101 00100001 means 673 (binary 1010100001) -8 -4.2 Encoding of Single Documents and Sequences -A binary stream can represent one complete well-formed XML document or a sequence -of items, as defined by the XQuery specification. This information is encoded in the -format header with the following encoding flags: -• XML Document (Value: x00000000) -• XML Sequence (Value: x00000001) -Each item in the sequence can be a complete document, a subtree, or an atomic value. -BNF -DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd -/* Which branch to choose is controlled -by EncodingFlags */ -DocumentEnd ::= 'Z' -XMLDocument ::= (Anywhere XMLDecl)? Misc* -(DocType | Misc*)? Element Misc* -XMLSequence ::= (SequenceItem -(SequenceSeparator SequenceItem)*)? -SequenceItem ::= Anywhere -(CompleteDoc | Comment | PI -| AtomicValue | Element) -Anywhere -SequenceSeparator ::= '@' -CompleteDoc ::= 'd' XMLDocument -Anywhere ::= (SI | Hint | Reserved)* -Misc ::= Comment | PI | SI | Hint -DocType ::= 'F' StringID StringID StringID -Tags -Value ASCII Meaning -90 Z End of the binary stream -64 @ Separator for items in an XML sequence -100 d Document node (assumed for XML documents, not assumed in XML -sequences) -70 F DOCTYPE in Tid(rootElementName) /id(systemID)/id(publicID) -9 -4.3 Encoding of XML Declarations -BNF -XMLDecl ::= XMLVersion Encoding? Standalone? -XMLVersion ::= 'L' LengthValue -/* The value is a valid XML version. -"1.0" or "1.1" for now */ -Encoding ::= 'D' LengthValue -Standalone ::= 't' BooleanValue -BooleanValue ::= False | True -False ::= #x0 -True ::= #x1 -Tags -Value ASCII Meaning -76 L XML version in TLV(version) form. -68 D Encoding in TLV(encoding) form. -116 t Standalone in TV(standalone) form where the value of 'standalone' is -either 0 or 1. -10 -4.4 Encoding of Elements -BNF -Element ::= (ElementI | ElementSII | ElementIII) -ElementContent -EndElement -ElementI ::= 'e' StringID -ElementSII ::= 'X' LengthValue StringID StringID StringID -ElementIII ::= 'x' StringID StringID StringID -EndElement ::= 'z' -ElementContent ::= NSDecls Attributes Children -Children ::= (Misc | Element | Text)* -Tags -Value ASCII Meaning -101 e Tid(localname) -Used when the element is not associated with a namespace. -88 X TLVid(localname) / id(prefix) / id(uri) -Used when the stringID for the element name is not yet defined. If the -element is in the default namespace, then the prefix stringID is zero. If -the element is not in a namespace, then the URI stringID is zero. -120 x Tid(localname) / id(prefix) / id(uri) -Used when the stringID for the element name is already defined. If the -element is in the default namespace, then the prefix stringID will be -zero. If the element is not in a namespace, then the URI stringID is zero. -122 z End Element -11 -4.5 Encoding of Attributes -BNF -Attributes ::= (Anywhere Attribute)* -Attribute ::= (AttributeI | AttributeSII | AttributeIII) -AttributeValue -AttributeI ::= 'a' StringID -AttributeSII ::= 'Y' LengthValue StringID StringID StringID -AttributeIII ::= ('y' | 'b') StringID StringID StringID -AttributeValue ::= LengthValue -/* If 'b' is used, then no &,',",<, ->,#xD,#xA,#x9 can appear in value */ -Tags -Value ASCII Meaning -97 a Tid(localname) / LV(attribute-value) -Used when the attribute is not associated with a namespace. -89 Y TLVid(localname) / id(prefix) / id(uri) / LV(attribute-value) -Used when the stringID for the attribute name is not yet defined. If the -attribute is not in a namespace, then the prefix stringID and URI -stringID is zero. -121 y Tid(localname) / id(prefix) / id(uri) / LV(attribute-value) -Used when the stringID for the attribute name is already defined. If the -attribute is not in a namespace, then the prefix stringID and URI -stringID is zero. -98 b Tid(localname) / id(prefix) / id(uri) / LV(attribute-value) -Similar to the 'y' tag. Characters that cannot be used in the value are: -• '<' (#x3c) -• '>' (#x3e) -• '&' (#x26) -• carriage return (#x0d) -• single quote (#x27) -• double quote (#x22) -• tab (#x09) -• linefeed (#x0a) -Because no characters need to be escaped when this attribute node is -serialized, this feature should speed up serialization. -12 -4.6 Encoding of Namespace Mappings -BNF -NSDecls ::= (Anywhere NSDecl)* -NSDecl ::= NSDeclII -NSDeclII ::= 'm' StringID StringID -Tags -Value ASCII Meaning -109 m Tid(prefix) /id(namespace-uri) -Declares a namespace mapping of a prefix stringID to a namespace URI -stringID. For default namespace declarations, the stringID for the prefix -is zero. -4.7 Encoding of Text -BNF -Text ::= ('T' | 'U' | 'C' | 'W') LengthValue -AtomicValue ::= 'V' LengthValue -Tags -Value ASCII Meaning -84 T Text node in TLV(text) form. -85 U Text node in TLV(text) form. The '<' (#x3c), '>' (#x3e), '&' (#x26), and -carriage return (#x0d) characters cannot be used in the value. Because -no characters need to be escaped when this text node is serialized, this -feature should speed up serialization. -67 C CDATA string in TLV(text) form. -87 W Text node containing only white space in TLV(text) form. White space -consists of one or more space (#x20) characters, carriage returns (#x0d), -line feeds (#x0a), tabs (#x09), Unicode line separator characters -(#x2028), or NELs (#x85). -Used when a text node contains only white space, unless the nearest -containing element with an xml:space attribute specifies -xml:space='preserve'. -86 V Atomic Value in TLV(text) form. -13 -4.8 Encoding of Comments -BNF -Comment ::= 'c' LengthValue -Tags -Value ASCII Meaning -99 c Comment in TLV(comment) form. -4.9 Encoding of Processing Instructions -BNF -PI ::= PII -PII ::= 'P' StringID LengthValue -Tags -Value ASCII Meaning -80 P Processing instruction in Tid(target)/LV(value) form. -The 'P' tag cannot declare an ID for the target of the processing instruction. Instead, an 'I' -tag should be used to define the stringID for the target. Then the 'P' tag is used to define -the processing instruction itself. -Although this is unlike the behavior for element and attribute tags, this was done to avoid -creating several tags to describe a processing instruction. -4.10 Encoding of Other Information -BNF -SI ::= 'I' LengthValue StringID -Hint ::= 'H' LengthValue LengthValue -Tags -Value ASCII Meaning -73 I Definition of a stringID in TLVid(string) form. Used only when the -StringID flag is set. -72 H Hint in TLV/LV form. -14 -4.11 Reserved Values for Tags -BNF -Reserved ::= [#xC9 - #xFA] Byte* -Tags -Value ASCII Meaning -201 --250 -Reserved for use by applications. -Values 201 through 250 are reserved for use by applications, and will not be used as tags -in future versions of this specification. These reserved values can be used to define -private extensions to the format for features not accounted for in this version of the -specification. See the Private Extensions section on page 23 for more information. -15 -5 Format details -This section provides additional details on the binary XML format. -5.1 Encoding Single Documents and Sequences -Whether an XDBX instance represents an XML document or a sequence of items is -encoded in the XDBX header. Most commonly, the binary stream represents an XML -Document. In this case, the document node as defined by the XML data models is -assumed. In other words, there is no need to start the document with a 'd' tag. If the binary -stream represents an XML Sequence, then the document node is not assumed, and any -document node in the stream needs to be denoted with a 'd' tag. Note that XPath behaves -differently whether there is a document node or not. -It is important to note that if stringIDs are used, the encoder must ensure that all stringIDs -are valid from one item to the next. In other words, the stringIDs are global to the binary -XML stream. Combining multiple documents together as items in a sequence could have -a size advantage, because the stringIDs would need to be defined only once. -5.2 StringIDs -Usage of stringIDs results in a smaller encoding, because the StringIDs are typically -smaller than the text they represent. In addition, the use of StringIDs can allow the data -in binary XML format to be processed more efficiently. The receiver must be prepared to -manage the StringIDs that appear in the document. This requires establishing and -managing lookup tables to efficiently reconcile StringIDs with the text they represent. -In some encodings the first occurrence of the text is written as text, then where that text -appears again, it is replaced with an ID that is computed during the processing of the first -occurrence. In other encodings all text, or only a portion of the text, could be represented -by an ID, where the ID is a reference to a dictionary that is contained in the message. -A StringID can be used only after the tag that defines it. -5.2.1 Examples of StringID Usage -The following shows example encodings of namespace declarations, elements, and -attributes when StringIDs are used. -Namespace Declaration: -The namespace declaration portion of the element tag: is -encoded as I3foo1I3bar2m12, where: -• 'I' assigns the StringID '1' to "foo" and '2' to "bar" -• 'm' declares the namespace mapping of "foo" to '1' and "bar" to '2'. -16 -Suppose that the namespace prefix is reassigned to a different uri later in the document. -For example: -
-The encoding of the namespace declaration is: -I3baz3m13, where '3' is the StringID assigned to "baz". -Element with no prefix and no namespace: -The first occurrence of
is encoded as: X7Address100, where: -• 'X' is the tag indicating an element name is encoded with StringIDs, and that a -length/value/ID tuple follows defining the localname and its associated ID, -followed by the stringIDs for the namespace prefix and namespace uri. -• '7' is the length of the localname string "Address" and '1' is the assigned ID for -that string. -• '0' is the stringID for "no namespace prefix". -• '0' is the stringID for "no namespace uri". -Subsequent occurrences of
are encoded more compactly as e1, where '1' is the -StringID for the string "Address". -Element with no prefix and the default namespace: -The first occurrence of
is encoded as: X7Address104 where: -• 'X' is the tag indicating an element name is encoded with StringIDs, and that a -length/value/ID tuple follows defining the localname and its associated ID. -• '0' is the stringID for the namespace prefix (because there is none). -• '4' is the stringID of the namespace uri. -Subsequent occurrences of
are encoded more compactly as x104, where -• '1' is the StringID for the string "Address". -• '4' is the stringID for the namespace uri. -Element with prefix: -The first occurrence of is encoded as X7Address154, where: -• '1' is the StringID assigned to the string "Address". -• '5' is the stringID that was previously assigned to "foo". -• '4' is the stringID that was previously assigned to the namespace uri. -Subsequent occurrences of are encoded more compactly as x154, where -'1' is the StringID for the string "Address". -17 -Attribute with no prefix (and thus no namespace): -The first occurrence of the attribute portion of is encoded as -Y3mgr9002NO where: -• 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a -length/value/id tuple for the attribute name. -• '3' is the length of the attribute name "mgr". -• '9' is the StringID assigned the string "mgr". -• '0' for the stringID of the prefix. -• '0' for the stringID of the URI. -• '2' is the length of the attribute value: "NO". -Subsequent occurrences of the attribute portion of are encoded as -a92NO, where: -• 'a' indicates an attribute declaration with StringIDs. -• '9' is the stringID of the attribute name. -• '2' is the length/value of the attribute value: "NO". -Attribute with prefix: -The first occurrence of the attribute portion of is encoded as: -Y3mgr9542NO where: -• 'Y' is the tag indicating an attribute name is encoded with StringIDs followed by a -length/value/id tuple for the attribute name. -• '5' for the stringID for prefix. -• '4' for the stringIDs for URI. -• '3' is the length of the attribute name "mgr". -• '9' is the StringID assigned the string "mgr". -• '5' is the stringID for the prefix. -• '4' is the stringID for the URI. -• '2' is the length of the attribute value "NO". -Subsequent occurrences of the attribute portion of are encoded -more compactly as: y9542NO, where: -• 'y' is the tag indicating an attribute declaration with StringIDs. -• '9' the stringID for the attribute name -• '5' is the stringID for prefix. -• '4' is the stringID for URI. -• '2' the length/value of the attribute value "NO". -Elements, Text, and namespaceIDs: -This section ties together some of the concepts described above and assumes StringIDs -are used. For example: -18 - -ABC - - -The namespace declaration in the above XML is encoded as: I3foo1I3bar2m12, where: -• '1' represents the StringID for "foo". -• '2' is the StringID for "bar". -• 'm12' is the structure to identify a mapping of foo ('1') to bar ('2'). -Therefore, the first occurrence of foo:Address is encoded as follows: -X7Address912T3ABCz where: -• 'X' indicates an element name expressed in LVid form. -• '7Address' is the LV for the localname. -• '9' is the StringID for "Address". -• '12' is a reference to the namespace mapping of foo to bar. -• 'T3ABC' is the TLV for the text node and 'z' represents the end element tag. -The subsequent occurrence of foo:Address are encoded more compactly as follows: -x912C3DEFz where: -• 'x' indicates an element name expressed in id form. -• '9' is the StringID for "Address". -• '12' is a reference to the namespace mapping of foo to bar. -• 'C3DEF' is the TLV for the CDATA. -• 'z' represents the end element tag. (NOTE: The encoder could choose to encode -the CDATA as a text node via 'T'.) -The first occurrence of foo:Address must use the more expansive form of an element -name 'X', where the second occurrence can use the more compact version 'x' because the -element name is already encoded with a stringID. -The following table summarizes the encoding of an element in various forms with -StringIDs on: -No Namespace Namespace -First Occurrence Subsequent -Occurrences -First Occurrence Subsequent -Occurrences -
X7Address100 e1 X7Address902 x902 - N/A N/A X7Address912 x912 -The following table summarizes the encoding of an attribute in various forms with -StringIDs on: -No Namespace Namespace -First -Occurrence -Subsequent -Occurrences -First -Occurrence -Subsequent -Occurrences - Y3mgr9002NO a92NO Y3mgr9022NO y9022NO - N/A N/A Y3mgr9122NO y9122NO -19 -5.3 StringID Notes -StringIDs are considered global. For example, if the string "Person" is given the stringID -4, this value will exist for the entire binary XML document. It is invalid for "Person" to -be given a different stringID, or for 4 to be assigned another string in the same binary -XML document. -The stringID value 0 (zero) is reserved and is used to mark "no namespace prefix" and -"no namespace URI". -5.4 Text Notes -Multiple text and/or CDATA tags can appear one after another in order to handle -arbitrarily large amounts of data. They are also used to encode mixed content. -It is up to the encoder whether to encode CDATA using the 'C' tag or a 'T' tag, because -they are semantically identical. The 'C' tag exists for applications that want to preserve -the CDATA syntax. Beyond the difference between CDATA and text as described in the -XML specification, this binary XML specification treats them identical. -The 'U' tag is similar to the 'T' tag, except that the encoder guarantees that none of the -characters in the 'U' tag need to be replaced with entity references if this text is serialized -as XML. In other words, none of the following four characters are present in the text -node: less-then “<” [<], greater-than “>” [>], ampersand “&” [&], and -carriage-return [ ]. -5.4.1 White Space -The XMLPARSE function, which may be applied to an XML document that is passed to -the receiver, offers the options of STRIP WHITESPACE and PRESERVE -WHITESPACE. STRIP WHITESPACE removes text nodes that contain only white -space unless the nearest containing element with an xml:space attribute specifies -xml:space='preserve'. -To facilitate the processing of STRIP WHITESPACE, text nodes that would be stripped -by this operation must be identified by the 'W' tag. -CDATA sections that contain white space that would be stripped by STRIP -WHITESPACE must be identified by a 'W' tag rather than a 'C' tag. This is seen in the -following examples: -Serialized XML: -Binary XML: X1a100T1 C3bcdT1 z -Serialized XML: -Binary XML: X1a100W1 W1 W1 z -or -X1a100W3 z -If a processor determines that certain white space characters can be removed (e.g. -ignorable whitespace SAX events), they should be removed instead of being encoded in a -'W' tag. -20 -5.5 XML Declaration Tag Notes -Typically, there is no XML declaration in binary XML. After all, the binary XML -encoding is always UTF-8. However, if the XML version is not 1.0, then the XML -declaration is mandatory, just like in serialized XML. -If the XML declaration tags are present in the binary XML, the tags must include the -version tag, however, the encoding and standalone tags are optional. -Example encodings: -Serialized XML: -Binary XML: L31.0D5UTF-8t0 -Serialized XML: -Binary XML: L31.1D6UTF-16 -Serialized XML: -Binary XML: L31.0t1 -Serialized XML: -Binary XML: L31.1 -The XML declaration tags are informational only and therefore optional. They provide -the binary encoding with the information provided in the XML declaration of the source -document. For example, all text is encoded as UTF-8 in the binary encoding, even if the -source document used UTF-16. The fact that the source document used UTF-16 can be -communicated using these tags. -5.6 DTD and DOCTYPE -This specification defines a tag for the DOCTYPE. This tag cannot describe an internal -DTD. -5.7 Namespace Notes -Each namespace declaration in the source XML document needs to have a corresponding -'m' tag in the binary encoding, even if the namespace mapping is being declared again. -For example: - -... - - -... - -For the encoding of the Name and Person elements, both must contain an explicit -namespace mapping using the 'm' tag. -The namespace declarations appear immediately after the element tag in which they were -declared. -21 -An undeclared default namespace is encoded as m00. Elements within undeclared -namespaces can be encoded with 'e' tag, 'X' tag, or 'x' tags with 00 for prefix and URI -StringIDs. Attributes with undeclared namespaces can be encoded with a tag, or the 'Y' -tag or 'y' tag with 00 for prefix and URI StringIDs. -5.8 Hint Tag Notes -The hint tag is a way to add arbitrary information to the binary encoding. This is -analogous to the use of the XML schema's xsd:appinfo. It consists of a TLV followed by -an LV. The 'H' tag indicates that some information is contained in its value field that -defines what is contained in the following LV. If the reader sees the initial TLV and does -not understand or want to process it, it can use the length of the following LV to skip it. -Otherwise, the reader can consume the information. For example, if validation was -performed in a database with a schema in the database's schema repository, then the -encoder may want to record exactly which schema it was validated with and could do so -using this form. Therefore, the encoding could be: -H11schema-used12http://x.y.z -5.9 Empty Sequence -XQuery defines an empty sequence. This is represented in the binary stream as a header -followed by a 'Z' tag. -5.10 Escaping of Characters -The tags U and b enable XDBX to record that none of the characters in a text node or -attribute value need to be escaped via an entity reference. The goal of this feature is to -speed up serialization of the XDBX binary stream. When any of these tags are used, none -of the characters in the text or attribute value need to be examined to determine if they -need escaping. -The 'U' tag can only be used if none of the characters in the text nodes are: -• carriage return -• ampersand -• greater than -• less than. -The 'b' tag can only be used if none of the characters in the attribute values are: -• carriage return -• ampersand -• greater than -• less than -• single quote -• double quote -• tab -• linefeed -22 -Note that this only applies to serialization to Unicode. Serialization to other encodings -might require numeric character references due to the lack of encodings for certain -characters in certain codepages. -5.11 Private Extensions -Assuming agreement between a sender and receiver, the specification allows for the -definition and use of private extensions. This allows the format to support additional -features that are not currently and explicitly documented. An example of this is for type -encoding data in elements and attributes in a specific, non-text format. This allows the -encoder to encode the data in the most optimal form for the receiver. For example, -consider the element "weight" that is of type float: -75.4 -Using one of the reserved tags, the encoder can inform the receiver of an alternative, -more efficient, encoding. This is also useful for user-defined types. Assuming StringIDs -are off, the preceding element could be encoded as: -2016weight002407xxxxxxxz -Where: -• '#x201' is a reserved tag defined by the encoder and receiver to define this special -element encoding. -• '6' is the length of the string "weight" -• '0' is the prefix length. -• '0' is the URI length. -• '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE -float. -• '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary -encoding of the value as a float. -Similarly, to encode attribute values, another reserved tag is used. For example: -Joe -Assuming StringIDs are off, the attribute portion of this element could be encoded as: -2106weight002407xxxxxxx -Where: -• '#x210' is the reserved tag defined by the encoder and receiver to define this -special attribute encoding. -• '6' is the length of the string "weight". -• '0' is the prefix length. -• '0' is the URI length. -• '#x240' is another reserved tag used to indicate that the data is encoded as an IEEE -float. -• '7' is the length of the encoded data, and 'xxxxxxx' is used to represent the binary -encoding of the value as a float. -23 -5.12 Reserved Tags -The set of reserved tags is for use by encoders that have agreement with the receivers on -their meaning. These reserved tags will not be reassigned for use in future versions of -this specification, thus ensuring forward and backward compatibility for implementations -that choose to use them. -24 -6 Examples -The following section documents examples of serialized XML and the corresponding -binary XML format when various encoding attributes are used. -Note: The serialized XML values provided in these examples are shown with line breaks -and indentation to make them more readable. These characters are not included in the -byte counts shown in the example statistics. -6.1 Example 1 – Default encoding -This example shows an XML document and its binary encoding with all the default -encoding flags. -Encoding Flags: -Document Type: XML Document -StringIDs (required): On -XML: - -Joe -Susan -Bill - -Binary XML (Excluding the header): -X4root100X4name200Y3mgr3002NOT3Joezx200T5Susanzx200T4BillzzZ -Element, attribute, prefix, and URI IDs are: -1==root, 2==name, 3==mgr -Statistics: -75 bytes of XML -60 bytes of binary + 8 byte header = 68 bytes of binary XML -25 -6.2 Example 2 – Sequence -This example shows an XML sequence with multiple items, including a comment node, a -document node, an element node, and an atomic value. In the binary XML, 'b' is used to -denote blanks. -Encoding Flags: -Document Type: XML Sequence -StringIDs (required): On -XML: - - Joe -Susan -Bill -Binary XML (Excluding the header): -c7comment@dX4name100Y3mgr2002NOT7bbJoebbz@V5Susan@x100T4BillzZ -Element, attribute, prefix, and URI IDs are: -1==name, 2==mgr -Statistics -67 bytes of XML -62 bytes of binary + 8 byte header = 70 bytes of binary XML -26 -6.3 Example 3 – StringIDs -This example shows an XML document and its binary encoding with stringIDs on. -Encoding Flags: -Document Type: XML Document -StringIDs (required): On -XML: - - -Bill -35 - - -Joe -45 - - -Binary XML (Excluding the header): -I3foo1I3bar2X4root300m12X6Person400X4name500Y3mgr6002NOT4BillzX3age712T -235zze4e5a62NOT3Joezx712T245zzzZ -Element, attribute, prefix, and URI IDs are: -1==foo, 2==bar, 3==root, 4==Person, 5==name, 6==mgr, 7==age -Statistics: -162 bytes of XML -103 bytes of binary + 8 byte header = 111 bytes of binary XML -27 -6.4 Example 4 – Namespaces with StringIDs -This example shows an XML document with multiple namespaces and its binary -encoding with stringIDs. -Encoding Flags: -Document Type: XML Document -StringIDs (required): On -XML: - - -Bill -35 - - -Joe -45 - - -Susan - - -Amy - - -Binary XML (Excluding the header): -X4root100I3foo2I3bar3X6Person400m23X4name500Y3mgr6002NOT4BillzX3age723T -235zzI3baz8e4m28e5y6282NOT3Joezx728T245zzI4food9e4m39e5y6393YEST5Susanz -ze4m32e5Y4exec10323YEST3AmyzzzZ -Element, attribute, prefix, and uri IDs are: -1==root, 2==foo, 3==bar, 4==Person, 5==name, 6==mgr, 7==age, 8==baz, 9==food, -10==exec -Statistics: -322 bytes of XML -173bytes of binary + 8 byte header = 181 bytes of binary XML -28 -6.5 Example 5 – Mixed Content -This example shows how mixed content is encoded. -Encoding Attributes: -Document Type: XML Document -StringIDs (required): On -XML: -textmore text -Binary XML (Excluding the header): -X1a100T4textX1b200zT9more textzZ -Element, attribute, prefix, and URI IDs are: -1==a, 2==b -Statistics -24 bytes of XML -32 bytes of binary + 8 byte header = 40 bytes of binary XML -29 -6.6 Example 6 – White Space -This example shows a binary XML document with all of the white space characters that -are shown in the corresponding serialized XML document. In the binary XML, 'b' is used -to denote a blank and 'a' is used to indicate a linefeed character. -Encoding Flags: -Document Type: XML Document -StringIDs (required): On -XML: - -Susan Smith -
-MA -
-
-Binary XML (Excluding the header): -X8employee100W4abbbX4name200I3xml3I5space4y4308preserveX2fn600T5SusanzT1 -bX2ln700T5SmithzzW4abbbX7address800y4307defaultW7abbbbbbX5state900T2MAz -W4abbbzW1azZ -Element, attribute, prefix, and URI IDs are: -1==employee, 2==name, 3==xml, 4==space, 5==name, 6==fn, 7==ln, 8==address, -9==state -Statistics: -160 bytes of XML -155 bytes of binary + 8 byte header = 163 bytes of binary XML -30 -Appendix A Complete XDBX BNF -XDBX ::= Header DocumentContent -Header ::= DocIdentifier HeaderLength MajorVersion -EncodingFlags HeaderFill -DocIdentifier ::= #xCA #x3B /* In binary: 11001010 00111011 */ -HeaderLength ::= #x5 -MajorVersion ::= #x1 -EncodingFlags ::= FourBytes -HeaderFill ::= Byte* -DocumentContent ::= (XMLDocument | XMLSequence) DocumentEnd -/* Which branch to choose is controlled -by EncodingFlags */ -DocumentEnd ::= 'Z' -XMLSequence ::= (SequenceItem ('@' SequenceItem)*)? -SequenceItem ::= Anywhere -(CompleteDoc | Comment | PI | AtomicValue -| Element) -Anywhere -CompleteDoc ::= 'd' XMLDocument -AtomicValue ::= 'V' LengthValue -XMLDocument ::= (Anywhere XMLDecl)? Misc* -(DocType | Misc*)? Element Misc* -Anywhere ::= (SI | Hint | Reserved)* -Misc ::= Comment | PI | SI | Hint -DocType ::= 'F' StringID StringID StringID -XMLDecl ::= XMLVersion Encoding? Standalone? -XMLVersion ::= 'L' LengthValue -/* The value is a valid XML version. "1.0" -or "1.1" for now */ -Encoding ::= 'D' LengthValue -Standalone ::= 't' BooleanValue -Element ::= (ElementI | ElementSII | ElementIII) -ElementContent -EndElement -ElementI ::= 'e' StringID -ElementSII ::= 'X' LengthValue StringID StringID StringID -31 -ElementIII ::= 'x' StringID StringID StringID -EndElement ::= 'z' -ElementContent ::= NSDecls Attributes Children -NSDecls ::= (Anywhere NSDecl)* -NSDecl ::= NSDeclII -NSDeclII ::= 'm' StringID StringID -Attributes ::= (Anywhere Attribute)* -Attribute ::= (AttributeI | AttributeSII | AttributeIII) -AttributeValue -AttributeI ::= 'a' StringID -AttributeSII ::= 'Y' LengthValue StringID StringID StringID -AttributeIII ::= ('y' | 'b') StringID StringID StringID -AttributeValue ::= LengthValue -/* If 'b' is used, then no &,',", -<,>,#xD,#xA,#x9 can appear in value */ -Children ::= (Misc | Element | Text)* -Text ::= ('T' | 'U' | 'C' | 'W' ) LengthValue -Comment ::= 'c' LengthValue -PI ::= PII -PII ::= 'P' StringID LengthValue -SI ::= 'I' LengthValue StringID -Hint ::= 'H' LengthValue LengthValue -Reserved ::= [#xC9 - #xFA] Byte* -LengthValue ::= Length Value -Length ::= VariableInteger -Value ::= Byte* -/* Number of bytes governed by preceding -length */ -StringID ::= VariableInteger -TypeID ::= VariableInteger -VariableInteger ::= (LongLeading | ShortLeading)? LastByte -LongLeading ::= [#x81-#x8F] [#x80-#xFF]? [#x80-#xFF]? -[#x80-#xFF]? -ShortLeading ::= [#x90-#xFF] [#x80-#xFF]? [#x80-#xFF]? -LastByte ::= [#x0-#x7F] -32 -BooleanValue ::= False | True -False ::= #x0 -True ::= #x1 -FourBytes ::= Byte Byte Byte Byte -Byte ::= [#x0-#xFF] -33 \ No newline at end of file diff --git a/nx01-x4o-driver/versions.txt b/nx01-x4o-driver/versions.txt deleted file mode 100644 index 340e535..0000000 --- a/nx01-x4o-driver/versions.txt +++ /dev/null @@ -1,34 +0,0 @@ - -=== X4O versions === - -Version 0.8.7: -- Created jdk7(CSS) javadoc compatible documentation. -- Create language task api and converted the current tasks. -- Renamed X4OLanguageContext to X4OLanguageSession. -- Renamed ElementNamespaceContext to ElementNamespace. -- Removed binding handler from element interface. -- Refactored all property keys with PropertyConfig bean. -- Change global attr to namespace attributes. -- Updated ant/maven plugins to new task and properties. -- Added options to xml writer like OUTPUT_LINE_BREAK_WIDTH. - - -Version 0.8.6: -- Changed to X4ODriver interface. -- Added (simple) write support -- Added ant and maven plugins - -Version 0.8.5: -- Made module loading system. -- Added eld to schema generator. -- Added eld to html generator. -- Cleaned xml uri nameing. -- refactored ELD tag names. -- Made elddoc ~working. -- Changed phase enum to text phases. - -Version 0.8.0: -- Changed packages to org.x4o -- Made converter two way -- Added debug writer - diff --git a/nx01-x4o-driver/todo.txt b/src/site/wigiti/README-x4o.md similarity index 54% rename from nx01-x4o-driver/todo.txt rename to src/site/wigiti/README-x4o.md index 89b4172..3cbbbb0 100644 --- a/nx01-x4o-driver/todo.txt +++ b/src/site/wigiti/README-x4o.md @@ -1,5 +1,20 @@ +# X4O --- x4o TODO list -- +X4O is not an XML parser but a recursive self configuring XML dialect language library. + +X4O is very old code from pre 1.5 non-generics nice object java. + +## 2025 TODO + +- Add 18 bit SAX4 XML read and write support +- RM 8 bit String DEP, replace javax.el by simple obj map +- Upgrade X4O element language to support 18 bit XML +- Remove some features to ease "write" and SAX4 support +- Move all XML uri's to oasis style thus replacing all internal http namespace locators +- Add jaxb annotation support to define a x4o language and have XSD and documention tools +- Cleanup old todo/ideas from below + +## OLD todo -- Fix debug output -- RM function methods from Element interface. @@ -32,7 +47,7 @@ - Add w3c html namespace in eld for description tag - move Boolean default from code to xml + conf. --- IDEAS -- +### OLD IDEAS - add mini xslt parse on top of streaming api. - add support javax.xml.xpath for xpath support @@ -42,21 +57,19 @@ - make element tree jdom api compatible - Test if possible to use threadpool for executing phases -?? v2; -x4o-driver -x4o-s4j-jaxp (dom,sax,stax,xslt) (jsr; 5,63,173) -x4o-s4j-sax -x4o-s4j-stax -x4o-s4j-jaxb (jsr; 222) +test v2; +- x4o-driver +- x4o-s4j-jaxp (dom,sax,stax,xslt) (jsr; 5,63,173) +- x4o-s4j-sax +- x4o-s4j-stax +- x4o-s4j-jaxb (jsr; 222) +### OLD v1 todo NON-CODE --- TODO for version 1.0 -- - -## NON-CODE - Add tutorial - doc eld and x4o lang files -##CODE +### OLD v1 todo CODE - Add (super) tag for extending tags of other namespace - XMLOverrideEvent - inboud sax parser !! @@ -76,3 +89,38 @@ x4o-s4j-jaxb (jsr; 222) - SAX events as input source - (70%) XML debug output + + +## OLD versions + +Version 0.8.7: +- Created jdk7(CSS) javadoc compatible documentation. +- Create language task api and converted the current tasks. +- Renamed X4OLanguageContext to X4OLanguageSession. +- Renamed ElementNamespaceContext to ElementNamespace. +- Removed binding handler from element interface. +- Refactored all property keys with PropertyConfig bean. +- Change global attr to namespace attributes. +- Updated ant/maven plugins to new task and properties. +- Added options to xml writer like OUTPUT_LINE_BREAK_WIDTH. + + +Version 0.8.6: +- Changed to X4ODriver interface. +- Added (simple) write support +- Added ant and maven plugins + +Version 0.8.5: +- Made module loading system. +- Added eld to schema generator. +- Added eld to html generator. +- Cleaned xml uri nameing. +- refactored ELD tag names. +- Made elddoc ~working. +- Changed phase enum to text phases. + +Version 0.8.0: +- Changed packages to org.x4o +- Made converter two way +- Added debug writer +