1. Parsing XML documents
This section and its subsection define the XML parser.
This specification defines the parsing rules for XML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The handling for parse errors is well-defined: user agents must either act as described below when encountering such problems, or must terminate processing at the first error that they encounter for which they do not wish to apply the rules described below.
1.1. Overview
The input to the XML parsing process consists of a stream of octets which is converted to a stream of code points, which in turn are tokenized, and finally those tokens are used to construct a tree.
1.2. Parse Errors
This specification defines the parsing rules for XML5 documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that’s the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.
Code | Description |
---|---|
abrupt-closing-of-empty-comment | This error occurs if the parser encounters an empty comment that is abruptly closed by a U+003E
(> ) code
point (i.e., <!--> or <!---> ). The parser behaves as if the comment is
closed correctly.
|
abrupt-closing-xml-declaration | This error occur if the parser encounters an unclosed quote in XML declaration. E.g. <?xml version="1?>
|
colon-before-attr | This error occurs if the parser encounters a U+003A COLON (: ) in tag after name but before
attribute name (e.g. <tag :attr ). Attributes can have namespaces but U+003A COLON but
namespaces can’t be empty.
|
eof-in-cdata | This error occurs if the parser encounters the end of the input stream in a CDATA section. The parser treats such CDATA sections as if they are closed immediately before the end of the input stream.. |
eof-in-comment | This error occurs if the parser encounters the end of the input stream in a comment. The parser treats such comments as if they are closed immediately before the end of the input stream. |
eof-in-doctype | This error occurs if the parser encounters the end of the input stream in a DOCTYPE section. |
eof-in-tag | This error occurs if the parser encounters the end of the input stream in a start tag or an end tag
(e.g.,<div id= ). Such a tag is ignored.
|
eof-in-xml-declaration | This error occurs if the parser encounters the end of the input stream in a XML Declaration e.g. <?xml
|
incorrectly-opened-comment | This error occurs if the parser encounters the <! code point sequence that is not
immediately
followed by two U+002D (- ) code points and that is not the start of a DOCTYPE or a CDATA
section.
|
invalid-xml-declaration | This error occurs if the parser encounters any code point sequence other than "PUBLIC "
and "SYSTEM " keywords after a DOCTYPE name. In such a case, the parser ignores any following
public or system identifiers
|
missing-whitespace-before-doctype-name | This error occurs if the parser encounters a DOCTYPE keyword and name are not separated by ASCII whitespace.
(e.g. <!DOCTYPE ) In this case the parser behaves as if ASCII whitespace is present.
|
missing-doctype-name | This error occurs if the parser encounters a DOCTYPE that is missing a name (e.g., <!DOCTYPE> ).
|
1.3. Input stream
The stream of Unicode characters that consists the input to the tokenization stage will be initially seen by the user agent as a stream of octets (typically coming over the network or from the local file system). The octets encode Unicode code points according to a particular encoding, which the user agent must use to decode the octets into code points.
Define how to find the encoding
Decide how to deal with null values
1.4. Tokenization
Implementations must act as if they used the following state machine to tokenise HTML. The state machine must start in the data state. Most states consume a single character, which may have various side-effects, and either switches the state machine to a new state to reconsume the current input character, or switches it to a new state to consume the next character, or stays in the same state to consume the next character. Some states have more complicated behavior and can consume several characters before switching to another state. In some cases, the tokenizer state is also changed by the tree construction stage.
When a state says to reconsume a matched character in a specified state, that means to switch to that state, but when it attempts to consume the next input character, provide it with the current input character instead.
The next input character is the first character in the input stream that has not yet been consumed or explicitly ignored by the requirements in this section. Initially, the next input character is the first character in the input. The current input character is the last character to have been consumed.
Decide how to deal with namespaces
-
Consume the next input character:
- U+0026 AMPERSAND (
&
) - Switch to character reference in data state.
- U+003C LESSER-THAN SIGN (
<
) - Switch to the tag open state.
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the current input character as character. Stay in this state.
1.4.2. Character reference in data state
- Switch to the data state.
Attempt to consume a character reference.
If nothing is returned emit a U+0026 AMPERSAND character (
&
) token.Otherwise, emit character tokens that were returned.
1.4.3. Tag open state
- Consume the next input character:
- U+002F SOLIDUS (
/
) - Switch to the end tag open state.
- U+003F QUESTION MARK(
?
) - Switch to the pi state.
- U+0021 (
!
) - Switch to the markup declaration state.
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
)- U+003A (
:
)- U+003C LESSER-THAN SIGN (
<
)- U+003E GREATER-THAN SIGN (
>
)- EOF
- U+000A LINE FEED (
- Parse error. Emit a U+003C LESSER-THAN SIGN (
<
) character. Reconsume the current input character in the data state. - Anything else
- Create a new tag token, then reconsume current input character in tag name state.
1.4.4. End tag open state
Consume the next input character:
- U+003E GREATER-THAN SIGN (
>
) - Emit a short end tag token and then switch to the data state.
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
)- U+003C LESSER-THAN SIGN (
<
)- U+003A (
:
)- EOF
- U+000A LINE FEED (
- Parse error. Emit a U+003C LESSER-THAN SIGN (
<
) character token and a U+002F SOLIDUS (/
) character token. Reconsume the current input character in the data state. - Anything else
- Create an end tag token, then reconsume the current input character in the end tag name state.
1.4.5. End tag name state
Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Switch to the end tag name after state.
- U+002F SOLIDUS (
/
) - Parse error. Switch to the end tag name after state.
- EOF
- Parse error. Emit the start tag token and then reprocess the current input character in the data state.
- U+003E GREATER-THAN SIGN (
>
) - Emit the end tag token and then switch to the data state.
- Anything else
- Append the current input character to the tag name and stay in the current state.
1.4.6. End tag name after state
Consume the next input character:
- U+003E GREATER-THAN SIGN (
>
) - Emit the end tag token and then switch to the data state.
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Stay in the current state.
- EOF
- Parse error. Emit the current token and then reprocess the current input character in the data state.
- Anything else
- Parse error. Stay in the current state.
1.4.7. Tag name state
- Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Switch to the tag attribute name before state.
- U+003E GREATER-THAN SIGN (
>
) - Emit the start tag token and then switch to the data state.
- EOF
- This an eof-in-tag parse error. Emit the current token and then reprocess the current input character in the data state.
- U+002F SOLIDUS (
/
) - Set current tag to empty tag. Switch to the empty tag state.
- Anything else
- Append the current input character to the tag name and stay in the current state.
1.4.8. Empty tag state
- Consume the next input character:
- U+003E GREATER-THAN SIGN (
>
) - Emit the current tag token as empty tag token and then switch to the data state.
- Anything else
- Reconsume in tag attribute value before state.
1.4.9. Tag attribute name before state
Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Stay in the current state.
- U+003E GREATER-THAN SIGN(
>
) - Emit the current token and then switch to the data state.
- U+002F SOLIDUS (
/
) - Set current tag to empty tag. Switch to the empty tag state.
- U+003A COLON (
:
) - This is a colon-before-attr parse error. Stay in the current state.
- EOF
- This is an eof-in-tag parse error. Emit the current token and then reprocess the current input character in the data state.
- Anything else
- Start a new attribute in the current tag token. Set that attribute’s name to the current input character and its value to the empty string and then switch to the tag attribute name state.
1.4.10. Tag attribute name state
Consume the next input character:
- U+003D EQUALS SIGN (
=
) - Switch to the tag attribute value before state.
- U+003E GREATER-THEN SIGN (
>
) - Emit the current token as start tag token. Switch to the data state.
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Switch to the tag attribute name after state.
- U+002F SOLIDUS (
/
) - Set current tag to empty tag. Switch to the empty tag state.
- EOF
- This is an eof-in-tag parse error. Emit the current token as start tag token and then reprocess the current input character in the data state.
- Anything else
- Append the current input character to the current attribute’s name. Stay in the current state.
When the user agent leaves this state (and before emitting the tag token, if appropriate), the complete attribute’s name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a parse error and the new attribute must be dropped, along with the value that gets associated with it (if any).
1.4.11. Tag attribute name after state
Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Stay in the current state.
- U+003D EQUALS SIGN(
=
) - Switch to the tag attribute value before state.
- U+003E GREATER-THEN SIGN(
>
) - Emit the current token and then switch to the data state.
- U+002F SOLIDUS (
/
) - Set current tag to empty tag. Switch to the empty tag state.
- EOF
- This is an eof-in-tag parse error. Emit the current token and then reprocess the current input character in the data state.
- Anything else
- Start a new attribute in the current tag token. Set that attribute’s name to the current input character and its value to the empty string and then switch to the tag attribute name state.
1.4.12. Tag attribute value before state
- Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Stay in the current state.
- U+0022 QUOTATION MARK (
"
) - Switch to the tag attribute value double quoted state.
- U+0027 APOSTROPHE (
'
) - Switch to the tag attribute value single quoted state.
- U+0026 AMPERSAND (
&
): - Reprocess the input character in the tag attribute value unquoted state.
- U+003E GREATER-THAN SIGN(
>
) - Emit the current token and then switch to the data state.
- EOF
- This is an eof-in-tag parse error. Emit the current token and then reprocess the current input character in the data state.
- Anything else
- Append the current input character to the current attribute’s value and then switch to the tag attribute value unquoted state.
1.4.13. Tag attribute value double quoted state
- Consume the next input character:
- U+0022 QUOTATION MARK (
"
) - Switch to the tag attribute name before state.
- U+0026 AMPERSAND (
&
) - Switch to character reference in attribute value state, with the additional allowed character being U+0022 QUOTATION MARK(
"
). - EOF
- This is an eof-in-tag parse error. Emit the current token and then reprocess the current input character in the data state.
- Anything else
- Append the input character to the current attribute’s value. Stay in the current state.
1.4.14. Tag attribute value single quoted state
- Consume the next input character:
- U+0022 QUOTATION MARK (
'
) - Switch to the tag attribute name before state.
- U+0026 AMPERSAND (
&
) - Switch to character reference in attribute value state, with the additional allowed character being APOSTROPHE (
'
). - EOF
- This is an eof-in-tag parse error. Emit the current token and then reprocess the current input character in the data state.
- Anything else
- Append the input character to the current attribute’s value. Stay in the current state.
1.4.15. Tag attribute value unquoted state
- Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Switch to the tag attribute name before state.
- U+0026 AMPERSAND (
&
): - Switch to character reference in attribute value state, with the additional allowed character being U+003E GREATER-THAN SIGN(
>
). - U+003E GREATER-THAN SIGN (
>
) - Emit the current token as start tag token and then switch to the data state.
- EOF
- This is an eof-in-tag parse error. Emit the current token as start tag token and then reprocess the current input character in the data state.
- Anything else
- Append the input character to the current attribute’s value. Stay in the current state.
1.4.16. Pi state
- If the next few characters are:
- Exact match for word "xml".
- Consume those characters and switch to xml declaration state
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
)- EOF
- U+000A LINE FEED (
- Parse error. Reconsume current input characters in the bogus comment state.
- Anything else
- Create a new processing instruction token. Reconsume current characters in pi target state.
1.4.17. XML declaration state
- Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Stay in current state
- U+0076 LATIN SMALL LETTER V (
v
)- U+0065 LATIN SMALL LETTER E (
E
)- U+0073 LATIN SMALL LETTER S (
S
) - U+0065 LATIN SMALL LETTER E (
- Reconsume current character in XML declaration attribute name state
- U+003F QUESTION MARK (
?
) - Switch to XML Declaration after state.
- EOF
- This is a eof-in-xml-declaration parse error. Append string "xml" to the processing instruction target, emit current processing instruction token and emit end-of-file token.
- Anything else
- This is an invalid-xml-declaration parse error. Append string "xml" to the processing instruction target, then reconsume current character in pi data state
1.4.18. XML declaration attribute name state
- If the next few characters are:
- Exact match for word "version".
- Set current xml declaration attribute name to version. Switch to XML declaration attribute name after.
- Exact match for word "encoding".
- Set current xml declaration attribute name to encoding. Switch to XML declaration attribute name after.
- Exact match for word "standalone".
- Set current xml declaration attribute name to standalone. Switch to XML declaration attribute name after.
- Anything else
- This is an invalid-xml-declaration parse error. Switch to pi target state
1.4.19. XML declaration attribute name after
- Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Stay in current state.
- U+003D EQUALS SIGN (
=
) - Switch to XML declaration attribute before value state.
- EOF
- This is an eof-in-xml-declaration parse error. Push to processing instruction target
xml
, then push to processing instruction dataversion=
. Emit processing instruction token. - Anything else
- This is an invalid-xml-declaration parse error. Push to processing instruction target
xml
, then push to processing instruction dataversion=
. Reconsume in pi target state.
- U+0026 AMPERSAND (
1.4.1. Data state
1.4.20. XML declaration attribute before value state
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Stay in current state.
- U+0027 APOSTROPHE (
'
) - Switch to XML declaration attribute value (single-quoted) state.
- U+0022 QUOTATION MARK (
"
) - Switch to XML declaration attribute value (double-quoted) state.
- EOF
- This is an eof-in-xml-declaration parse error. Push to processing instruction target
xml
, then push to processing instruction dataversion=
. Emit processing instruction token. - Anything else
- This is an invalid-xml-declaration parse error. Push to processing instruction target
xml
, then push to processing instruction dataversion=
. Reconsume in pi target state.
1.4.21. XML declaration attribute value (single-quoted) state
- U+0027 APOSTROPHE (
'
) - Switch to XML declaration state.
- U+003F QUESTION MARK (
?
) - This is an abrupt-closing-xml-declaration parse error. Switch to XML Declaration after state.
- EOF
- This is an eof-in-xml-declaration parse error. Emit current xml declaration. Emit end-of-file token.
- Anything else
- This is an invalid-xml-declaration parse error. Switch to pi target state
1.4.22. XML declaration attribute value (double-quoted) state
- U+0022 QUOTATION MARK (
"
) - Switch to XML declaration state.
- U+003F QUESTION MARK (
?
) - This is an abrupt-closing-xml-declaration parse error. Switch to XML Declaration after state.
- EOF
- This is an eof-in-xml-declaration parse error. Emit current xml declaration. Emit end-of-file token.
- Anything else
- This is an invalid-xml-declaration parse error. Switch to pi target state
1.4.23. XML declaration after state
- U+003E GREATER-THAN SIGN (
>
) - Emit the xml declaration token and then switch to the data state.
- U+003F QUESTION MARK(
?
) - Append the current input character to the PI’s data and stay in the current state.
- Anything else
- Reprocess the current input character in the pi data state.
1.4.24. Pi target state
Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Switch to the pi target after state.
- EOF
- Parse error. Emit the current processing instruction token and then reprocess the current input character in the data state.
- U+003F QUESTION MARK(
?
) - Switch to the pi after state.
- Anything else
- Append the current input character to the processing instruction target and stay in the current state.
1.4.25. Pi target after state
Consume the next input character:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Stay in the current state.
- Anything else
- Reprocess the current input character in the pi data state.
1.4.26. Pi data state
Consume the next input character:
- U+003F QUESTION MARK(
?
) - Switch to the pi after state.
- EOF
- This is a eof-in-cdata parse error. Emit the current processing instruction token and then reprocess the current input character in the data state.
- Anything else
- Append the current input character to the pi’s data and stay in the current state.
1.4.27. Pi after state
Consume the next input character:
- U+003E GREATER-THAN SIGN (
>
) - Emit the current token and then switch to the data state.
- U+003F QUESTION MARK(
?
) - Append the current input character to the PI’s data and stay in the current state.
- Anything else
- Reprocess the current input character in the pi data state.
1.4.28. Markup declaration state
If the next few characters are:
- Two U+002D HYPEN-MINUS characters (
-
) - Consume those two characters, create a comment token whose data is the empty string and switch to comment start state.
- Exact match for word "DOCTYPE"
- Consume those characters and switch to Doctype state
- Exact match for word "[CDATA[" with a (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before and after)
- Consume those characters and switch to CDATA state
- Anything else
- Emit an incorrectly-opened-comment parse error. Create a comment token whose data is an empty string. Switch to bogus comment state (don’t consume any characters)
1.4.29. Comment start state
- U+002D HYPHEN-MINUS (
-
) - Switch to comment start dash state
- U+003E GREATER-THAN SIGN (
>
) - This is an abrupt-closing-of-empty-comment parse error. Switch to data state. Emit the current comment token.
- Anything else
- Reconsume in the comment state
1.4.30. Comment start dash state
- U+002D HYPHEN-MINUS (
-
) - Switch to comment end state
- U+003E GREATER-THAN SIGN (
>
) - This is an abrupt-closing-of-empty-comment parse error. Switch to data state. Emit the current comment token.
- EOF
- This is an eof-in-comment parse error. Emit the comment token. Emit an end-of-file-token.
- Anything else
- Append a U+002D HYPHEN-MINUS character (
-
) to the comment token’s data. Reconsume in the comment state.
1.4.31. Comment state
Consume the next input character:
- U+003C LESS-THAN SIGN (
<
) - Append the current input character to the comment token’s data. Switch to the comment less-than sign state.
- U+002D HYPHEN-MINUS (
-
) - Switch to the comment end dash state.
- EOF
- This is an eof-in-comment parse error. Emit the current comment token. Emit an end-of-file token.
- Anything else
- Append the current input character to the comment token’s data.
1.4.32. Comment less-than sign state
Consume the next input character:
- U+0021 EXCLAMATION-MARK (
!
) - Append the current input character to the comment token’s data. Switch to the comment less-than sign bang state.
- U+003C LESS-THAN SIGN (
<
) - Append the current input character to the comment token’s data.
- Anything else
- Reconsume in the comment state.
1.4.33. Comment less-than sign bang state
- U+002D HYPHEN-MINUS (
-
) - Switch to the comment less-than sign bang dash state.
- Anything else
- Reconsume in the comment state.
1.4.34. Comment less-than sign bang dash state
- U+002D HYPHEN-MINUS (
-
) - Switch to the comment less-than sign bang dash dash state.
- Anything else
- Reconsume in the comment end dash state.
1.4.35. Comment less-than sign bang dash dash state
- U+003E GREATER-THAN-SIGN (
>
)- EOF
- Reconsume in the comment end state.
- Anything else
- Parse error.Reconsume in the comment end state.
1.4.36. Comment end dash state
Consume the next input character:
- U+002D HYPHEN-MINUS (
-
) - Switch to the comment end state.
- EOF
- Parse error. Emit the comment token. Emit an end-of-file token.
- Anything else
- Append a U+002D HYPHEN-MINUS (
-
) to the comment’s token data. Reconsume in the comment state.
1.4.37. Comment end state
Consume the next input character:
- U+003E GREATER-THAN SIGN (
>
) - Switch to the data state.Emit the comment token.
- U+0021 EXCLAMATION MARK(
!
) - Switch to the comment end bang state.
- U+002D HYPHEN-MINUS (
-
) - Append a U+002D HYPHEN-MINUS character (
-
) to the comment token’s data. - EOF
- Parse error. Emit the comment token. Emit an end-of-file token.
- Anything else
- Append two U+002D (
-
) characters and the current input character to the comment token’s data. Reconsume in the comment state.
1.4.38. Comment end bang state
- U+002D HYPHEN-MINUS (
-
) - Append a U+002D HYPHEN-MINUS character (
-
) and U+0021 EXCLAMATION MARK character(!
) to the comment token’s data. Switch to the comment end dash state. - U+003E GREATER-THAN SIGN (
>
) - Parse error. Switch to the data state.Emit the comment token.
- EOF
- Parse error. Emit the comment token. Emit an end-of-file token.
- Anything else
- Append two U+002D (
-
) characters and U+0021 EXCLAMATION MARK character(!
) to the comment token’s data. Reconsume in the comment state.
1.4.39. CDATA state
Consume the next input character:
- U+005D RIGHT SQUARE BRACKET (
]
) - Switch to the CDATA bracket state.
- EOF
- Parse error. Reprocess the current input character in the data state.
- Anything else
- Emit the current input character as character token. Stay in the current state.
1.4.40. CDATA bracket state
Consume the next input character:
- U+005D RIGHT SQUARE BRACKET (
]
) - Switch to the CDATA end state.
- EOF
- Parse error. Reprocess the current input character in the data state.
- Anything else
- Emit a U+005D RIGHT SQUARE BRACKET (
]
) character as character token and also emit the current input character as character token. Switch to CDATA bracket state.
1.4.41. CDATA end state
Consume the next input character:
- U+003E GREATER-THAN SIGN (
>
) - Switch to the data state.
- U+005D RIGHT SQUARE BRACKET (
]
) - Emit the current input character as character token. Stay in the current state.
- EOF
- Parse error. Reconsume the current input character in the data state.
- Anything else
- Emit two U+005D RIGHT SQUARE BRACKET (
]
) characters as character tokens and also emit the current input character as character token. Switch to the CDATA state.
1.4.42. Character reference in attribute value state
Attempt to consume a character reference.
If nothing is returned, append a U+0026 AMPERSAND (&) character to current attribute’s value.
Otherwise, append returned character tokens to current attribute’s value.
Finally, switch back to attribute value state that switched to this state.
1.4.43. Bogus comment state
- U+003E GREATER-THAN SIGN (
>
) - Switch to the data state. Emit the current comment token.
- EOF
- Emit the comment. Emit an end-of-file token
- Anything else
- Append the current input character to the comment token’s data.
1.4.44. Tokenizing character references
This section defines how to consume a character reference, optionally with an additional allowed character, which, if specified where the algorithm is invoked, adds a character to the list of characters that cause there to not be a character reference.
This definition is used when parsing character in text and in attributes.
The behavior depends on identity of next character (the one immediately after the U+0026 AMPERSAND character), as follows:
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
)- U+003C LESS-THAN SIGN (
<
)- U+0025 PERCENT SIGN (
%
)- U+0026 AMPERSAND (
&
)- EOF
- The additional allowed character if there is one
- U+000A LINE FEED (
- Not a character reference. No characters are consumed and nothing is returned (This is not an error, either).
- U+0023 NUMBER SIGN (
#
) -
Consume the U+0023 NUMBER SIGN.
The behaviour further depends on the character after the U+0023 NUMBER SIGN.
- U+0078 LATIN SMALL LETTER X
- U+0078 LATIN CAPITAL LETTER X
-
Consume the X.
Follow the steps below, but using ASCII hex digits.
When it comes to interpreting the number, interpret it as a hexadecimal number.
- Anything else
-
Follow the steps below, but using ASCII digits.
When it comes to interpreting the number, interpret it as a decimal number.
Consume as many characters as match the range of characters given above (ASCII hex digits or ASCII digits).
If no characters match the range, then don’t consume any characters. This is a parse error; return the U+0023 NUMBER SIGN character and if appropriate X character as string of text.
Otherwise, if the next character is a U+003B SEMICOLON, consume that too. If it isn’t, there is a parse error.
If one or more characters match the range, then take them all and interpret the string of characters as a number (either hexadecimal or decimal as appropriate).
Should we do HTML like replacement? At least for null?
Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER character token.
Should we refuse Unicode from ranges listed (0x0001 to 0x0008, 0x000D to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF)?
I’ve noted that Javascript implementation of XML5 is having to go around some characters in its version.
- U+0078 LATIN SMALL LETTER X
- Anything else
-
Consume characters until you reach a U+003B SEMICOLON character (
;
).What happens if there is no semicolon? Does it read rest of the file? Maybe better solution is to read all characters that are part of name char according to XML 1.1. spec.
Otherwise, a character reference is parsed. If the last character matched is not a U+003B SEMICOLON character (
;
), there is a parse error.If there was a parse error the consumed characters are interperted as part of a string and are returned.
If there wasn’t a parse error return a reference with name equal to consumed characters, omitting the U+003B SEMICOLON character (
;
).If the markup contains following attributeThis is a &ref;
, character tokenizer should return this as a reference named ref. However if the attribute defined is defined asThis is ¬ref
, then the tokenizer will interpret this as a textThis is ¬ref
, while emitting a parse error.
1.4.45. DOCTYPE state
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Switch to the before DOCTYPE name state.
- EOF
- Emit an eof-in-doctype parse error. Switch to data state. Create new DOCTYPE token. Emit DOCTYPE token. Emit an end-of-file token.
- Anything else
- Emit an missing-whitespace-before-doctype-name parse error parse error. Reconsume character in before DOCTYPE name state.
1.4.46. Before DOCTYPE name state
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Ignore the character.
- Uppercase ASCII letter
- Create a new DOCTYPE token. Set the token name to lowercase version of the current input character. Switch to the DOCTYPE name state.
- U+003E GREATER-THAN SIGN(
>
) - This is a missing-doctype-name parse error. Create a new DOCTYPE token. Emit DOCTYPE token. Switch to data state.
- EOF
- This is eof-in-doctype parse error. Switch to data state. Create new DOCTYPE token. Emit DOCTYPE token. Emit an end-of-file token.
- Anything else
- Create new DOCTYPE token. Set the token’s name to current input character. Switch to DOCTYPE name state.
1.4.47. DOCTYPE name state
- U+0009 CHARACTER TABULATION (
Tab
)- U+000A LINE FEED (
LF
)- U+0020 SPACE (
Space
) - U+000A LINE FEED (
- Set doctype depth to 0. Switch to the after DOCTYPE name state.
- Uppercase ASCII letter
- Append the lowercase of current input character to current DOCTYPE token.
- U+003E GREATER-THAN SIGN(
>
) - Create a new DOCTYPE token. Emit token. Switch to data state.
- EOF
- This is eof-in-doctype parse error. Emit the current DOCTYPE token. Emit an end-of-file token.
- Anything else
- Append the current input character to the current DOCTYPE token’s name. Reconsume the EOF character.
1.4.48. After DOCTYPE name state
- U+005B LEFT SQUARE BRACKET (
[
) - Increase doctype depth by 1. Remain in current state.
- U+005D RIGHT SQUARE BRACKET (
]
) - If current doctype depth is 0 switch to Bogus doctype state, otherwise decrease doctype depth by 1. Remain in current state.
- U+003E GREATER-THAN SIGN(
>
) - If current doctype depth is 0, emit current doctype and switch to data state.
- EOF
- This is eof-in-doctype parse error. Switch to the data state. Emit DOCTYPE token. Emit an end-of-file token.
- Anything else
- Remain in current state
1.4.49. Bogus DOCTYPE state
- U+003E GREATER-THAN SIGN(
>
) - Switch to data state. Emit DOCTYPE token.
- EOF
- Emit DOCTYPE token. Emit the end-of-file token.
- Anything else
- Ignore character.
1.5. Tree construction
The input to the tree construction stage is a sequence of tokens from the tokenization stage. The output of this stage is a tree model
represented by a Document
object.
The tree construction stage passes through several phases. The initial phase is the start phase.
The stack of open elements contains all elements of which the closing tag has not yet been encountered. Once the first start tag token in the start phase is encountered it will contain one open element. The rest of the elements are added during the main phase.
The current element is the bottommost node in this stack.
The stack of open elements is said to have an element in scope if the target element is in the stack of open elements.
When the steps below require the user agent to append a
character to a node, the user agent must collect it
and all subsequent consecutive characters that would be appended to that node
and insert one Text
node whose data is the concatenation of all
those characters.
Need to define create an element for the token...
When the steps below require the user agent to insert an element for a token the user agent must create an element for the token and then append it to the current element and push it into the stack of open elements so that it becomes the new current element.
- Start phase
-
Each token emitted from the tokenization stage must be processed as follows until the algorithm below switches to a different phase:
- A start tag token
-
Create an element for the token and then append it to the
Document
node and push it into the stack of open elements.This element is the root element and the first current element. Then switch to the main phase.
- An empty tag token
-
Create an element for the token and append it to the
Document
node. Then switch to the end phase. - A comment token
-
Append a
Comment
node to theDocument
node with thedata
attribute set to the data given in the token. - A processing instruction token
-
Append a
ProcessingInstruction
node to theDocument
node with thetarget
anddata
attributes set to the target and data given in the token. - An end-of-file token
-
Parse error. Reprocess the token in the end phase.
- Anything else
- Parse error. Ignore the token.
- Main phase
-
Once a start tag token has been encountered (as detailed in the previous phase) each token must be process using the following steps until further notice:
- A character token
-
Append a character to the current element.
- A start tag token
-
Insert an element for the token.
- An empty tag token
-
Create an element for the token and append it to the current element.
- An end tag token
-
If the tag name of the current node does not match the tag name of the end tag token this is a parse error.
If there is an element in scope with the same tag name as that of the token pop nodes from the stack of open elements until the first such element has been popped from the stack.
If there are no more elements on the stack of open elements at this point switch to the end phase.
- A short end tag token
-
Pop an element from the stack of open elements. If there are no more elements on the stack of open elements switch to the end phase.
- A comment token
-
Append a
Comment
node to the current element with thedata
attribute set to the data given in the token. - A processing instruction token
- Append a
ProcessingInstruction
node to the current element with thetarget
anddata
attributes set to the target and data given in the token. - An end-of-file token
- Parse error. Reprocess the token in the end phase.
- End phase before
-
Tokens in end phase must be handled as follows:
- A comment token
- Append a
Comment
node to theDocument
node with thedata
attribute set to the data given in the token. - A processing instruction token
-
Append a
ProcessingInstruction
node to theDocument
node with thetarget
anddata
attributes set to the target and data given in the token. - An end-of-file token
- Anything else
-
Parse error. Ignore the token.
Once the user agent stops parsing the document, it must follow these steps:
TODO
2. Writing XML documents
3. Common parser idioms
The ASCII digits are the characters in the range U+0030 DIGIT ZERO (0
) to U+0039 DIGIT
NINE ( 9
).
The ASCII hex digits are the characters in the ranges U+0030 DIGIT ZERO ( 0
) to U+0039 DIGIT NINE ( 9
), U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F, and U+0061 LATIN SMALL LETTER A
to U+0066 LATIN SMALL LETTER F.
The lowercase ASCII letters are characters in the range between U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z.
The uppercase ASCII letters are characters in the range between U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z.
Comparing two strings in an ASCII case-insensitive manner means comparing them exactly, code point for code point, except that the characters in the range U+0041 to U+005A (i.e. LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) and the corresponding characters in the range U+0061 to U+007A (i.e. LATIN SMALL LETTER A to LATIN SMALL LETTER Z) are considered to also match.