XML5

Living Standard — Last Updated

Issue Tracking:
GitHub
Editors:
(Mozilla)
(Unaffiliated)

Abstract

XML with well-defined error handling.

1. Parsing XML documents

This section and its subsection define the XML parser.

This specification defines the parsing rules for XML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The handling for parse errors is well-defined: user agents must either act as described below when encountering such problems, or must terminate processing at the first error that they encounter for which they do not wish to apply the rules described below.

1.1. Overview

The input to the XML parsing process consists of a stream of octets which is converted to a stream of code points, which in turn are tokenized, and finally those tokens are used to construct a tree.

1.2. Input stream

The stream of Unicode characters that consists the input to the tokenization stage will be initially seen by the user agent as a stream of octets (typically coming over the network or from the local file system). The octets encode Unicode code points according to a particular encoding, which the user agent must use to decode the octets into code points.

Define how to find the encoding

Decide how to deal with null values

1.3. Tokenization

Implementations must act as if they used the following state machine to tokenise HTML. The state machine must start in the data state. Most states consume a single character, which may have various side-effects, and either switches the state machine to a new state to reconsume the current input character, or switches it to a new state to consume the next character, or stays in the same state to consume the next character. Some states have more complicated behavior and can consume several characters before switching to another state. In some cases, the tokenizer state is also changed by the tree construction stage.

When a state says to reconsume a matched character in a specified state, that means to switch to that state, but when it attempts to consume the next input character, provide it with the current input character instead.

The next input character is the first character in the input stream that has not yet been consumed or explicitly ignored by the requirements in this section. Initially, the next input character is the first character in the input. The current input character is the last character to have been consumed.

Decide how to deal with namespaces

1.3.1. Data state

Consume the next input character:

U+0026 AMPERSAND (&)
Switch to character reference in data state.
U+003C LESSER-THAN SIGN (<)
Switch to the tag state.
EOF
Emit an end-of-file token.
Anything else
Emit the input character as character. Stay in this state.

1.3.2. Character reference in data state

Switch to the data state.

Attempt to consume a character reference.

If nothing is returned emit a U+0026 AMPERSAND character (&) token.

Otherwise, emit character tokens that were returned.

1.3.3. Tag state

Consume the next input character:

U+002F SOLIDUS (/)
Switch to the end tag state.
U+003F QUESTION MARK(?)
Switch to the pi state.
U+0021 (!)
Switch to the markup declaration state.
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
U+003A (:)
U+003C LESSER-THAN SIGN (<)
U+003E GREATER-THAN SIGN (>)
EOF
Parse error. Emit a U+003C LESSER-THAN SIGN (<) character. Reconsume the current input character in the data state.
Anything else
Create a new tag token and set its name to the input character, then switch to the tag name state.

1.3.4. End tag state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Emit a short end tag token and then switch to the data state.
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
U+003C LESSER-THAN SIGN (<)
U+003A (:)
EOF
Parse error. Emit a U+003C LESSER-THAN SIGN (<) character token and a U+002F SOLIDUS (/) character token. Reconsume the current input character in the data state.
Anything else
Create an end tag token and set its name to the input character, then switch to the end tag name state.

1.3.5. End tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Switch to the end tag name after state.
U+002F SOLIDUS (/)
Parse error. Switch to the end tag name after state.
EOF
Parse error. Emit the current token and then reprocess the current input character in the data state.
U+003E GREATER-THAN SIGN (>)
Emit the current token and then switch to the data state.
Anything else
Append the current input character to the tag name and stay in the current state.

1.3.6. End tag name after state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Emit the current token and then switch to the data state.
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Stay in the current state.
EOF
Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else
Parse error. Stay in the current state.

1.3.7. Pi state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
EOF
Parse error. Reprocess the current input character in the bogus comment state.
Anything else
Create a new processing instruction token. Set target to the current input character and data to the empty string. Then switch to the pi target state.

1.3.8. Pi target state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Switch to the pi target after state.
EOF
Parse error. Emit the current processing instruction token and then reprocess the current input character in the data state.
U+003F QUESTION MARK(?)
Switch to the pi after state.
Anything else
Append the current input character to the processing instruction target and stay in the current state.

1.3.9. Pi target after state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Stay in the current state.
Anything else
Reprocess the current input character in the pi data state.

1.3.10. Pi data state

Consume the next input character:

U+003F QUESTION MARK(?)
Switch to the pi after state.
EOF
Parse error. Emit the current processing instruction token and then reprocess the current input character in the data state.
Anything else
Append the current input character to the pi’s data and stay in the current state.

1.3.11. Pi after state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Emit the current token and then switch to the data state.
U+003F QUESTION MARK(?)
Append the current input character to the PI’s data and stay in the current state.
Anything else
Reprocess the current input character in the pi data state.

1.3.12. Markup declaration state

If the next two characters are both U+002D (-) characters, consume those two characters, create a comment token whose data is the empty string and then switch to the comment state.

Otherwise, if the next seven characters are an exact match for "[CDATA[", then consume those characters and switch to the CDATA state.

Otherwise, if the next seven characters are an exact match for "DOCTYPE", then this consume those characters and switch to the DOCTYPE state.

Otherwise, this is a parse error. Switch to the bogus comment state.

1.3.13. Comment state

Consume the next input character:

U+003C LESS-THAN SIGN (<)
Append the current input character to the comment tokens' date. Switch to the comment less-than sign state.
U+002D HYPHEN-MINUS (-)
Switch to the comment end dash state.
EOF
Parse error. Emit the comment token. Emit an end-of-file token.
Anything else
Append the current character to the comment data.

1.3.14. Comment less-than sign state

Consume the next input character:

U+0021 EXCLAMATION-MARK (!)
Append the current input character to the comment token’s data. Switch to the comment less-than sign state.
U+003C LESS-THAN SIGN (<)
Append the current input character to the comment token’s data.
Anything else
Reconsume in the comment state.

1.3.15. Comment less-than sign bang state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment less-than sign bang dash state.
Anything else
Reconsume in the comment state.

1.3.16. Comment less-than sign bang dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment less-than sign bang dash dash state.
Anything else
Reconsume in the comment end dash state.

1.3.17. Comment less-than sign bang dash dash state

Consume the next input character:

U+003E GREATER-THAN-SIGN (>)
EOF
Reconsume in the comment end state.
Anything else
Parse error.Reconsume in the comment end state.

1.3.18. Comment end dash state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Switch to the comment end state.
EOF
Parse error. Emit the comment token. Emit end-of-file token.
Anything else
Append a U+002D HYPHEN-MINUS (-) to the comment’s token data. Reconsume in the comment state.

1.3.19. Comment end state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Switch to the data state.Emit the comment token.
U+0021 EXCLAMATION MARK(!)
Switch to the comment end bang state.
U+002D HYPHEN-MINUS (-)
Append a U+002D HYPHEN-MINUS character (-) to the comment token’s data.
EOF
Parse error. Emit the comment token. Emit an end-of-file token.
Anything else
Append two U+002D (-) characters and the current input character to the comment token’s data. Reconsume in the comment state.

1.3.20. Comment end bang state

Consume the next input character:

U+002D HYPHEN-MINUS (-)
Append a U+002D HYPHEN-MINUS character (-) and U+0021 EXCLAMATION MARK character(!) to the comment token’s data. Switch to the comment end dash state.
U+003E GREATER-THAN SIGN (>)
Parse error. Switch to the data state.Emit the comment token.
EOF
Parse error. Emit the comment token. Emit an end-of-file token.
Anything else
Append two U+002D (-) characters and U+0021 EXCLAMATION MARK character(!) to the comment token’s data. Reconsume in the comment state.

1.3.21. CDATA state

Consume the next input character:

U+005D RIGHT SQUARE BRACKET (])
Switch to the CDATA bracket state.
EOF
Parse error. Reprocess the current input character in the data state.
Anything else
Emit the current input character as character token. Stay in the current state.

1.3.22. CDATA bracket state

Consume the next input character:

U+005D RIGHT SQUARE BRACKET (])
Switch to the CDATA end state.
EOF
Parse error. Reprocess the current input character in the data state.
Anything else
Emit a U+005D RIGHT SQUARE BRACKET (]) character as character token and also emit the current input character as character token. Switch to CDATA bracket state.

1.3.23. CDATA end state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Switch to the data state.
U+005D RIGHT SQUARE BRACKET (])
Emit the current input character as character token. Stay in the current state.
EOF
Parse error. Reconsume the current input character in the data state.
Anything else
Emit two U+005D RIGHT SQUARE BRACKET (]) characters as character tokens and also emit the current input character as character token. Switch to the CDATA state.

1.3.24. Tag name state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Switch to the tag attribute name before state.
U+003E GREATER-THAN SIGN (>)
Emit the current token and then switch to the data state.
EOF
Parse error. Emit the current token and then reprocess the current input character in the data state.
U+002F SOLIDUS (/)
Set current tag to empty tag. Switch to the empty tag state.
Anything else
Append the current input character to the tag name and stay in the current state.

1.3.25. Empty tag state

Consume the next input character:

U+003E GREATER-THAN SIGN (>)
Emit the current tag token as empty tag token and then switch to the data state.
Anything else
Parse error. Reprocess the current input character in the tag attribute name before state.

1.3.26. Tag attribute name before state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Stay in the current state.
U+003E GREATER-THAN SIGN(>)
Emit the current token and then switch to the data state.
U+002F SOLIDUS (/)
Set current tag to empty tag. Switch to the empty tag state.
U+003A COLON (:)
Parse error. Stay in the current state.
EOF
Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else
Start a new attribute in the current tag token. Set that attribute’s name to the current input character and its value to the empty string and then switch to the tag attribute name state.

1.3.27. Tag attribute name state

Consume the next input character:

U+003D EQUALS SIGN (=)
Switch to the tag attribute value before state.
U+003E GREATER-THEN SIGN (>)
Emit the current token as start tag token. Switch to the data state.
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Switch to the tag attribute name after state.
U+002F SOLIDUS (/)
Set current tag to empty tag. Switch to the empty tag state.
EOF
Parse error. Emit the current token as start tag token and then reprocess the current input character in the data state.
Anything else
Append the current input character to the current attribute’s name. Stay in the current state.

When the user agent leaves this state (and before emitting the tag token, if appropriate), the complete attribute’s name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a parse error and the new attribute must be dropped, along with the value that gets associated with it (if any).

1.3.28. Tag attribute name after state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Stay in the current state.
U+003D EQUALS SIGN(=)
Switch to the tag attribute value before state.
U+003E GREATER-THEN SIGN(>)
Emit the current token and then switch to the data state.
U+002F SOLIDUS (/)
Set current tag to empty tag. Switch to the empty tag state.
EOF
Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else
Start a new attribute in the current tag token. Set that attribute’s name to the current input character and its value to the empty string and then switch to the tag attribute name state.

1.3.29. Tag attribute value before state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Stay in the current state.
U+0022 QUOTATION MARK (")
Switch to the tag attribute value double quoted state.
U+0027 APOSTROPHE (')
Switch to the tag attribute value single quoted state.
U+0026 AMPERSAND (&):
Reprocess the input character in the tag attribute value unquoted state.
U+003E GREATER-THAN SIGN(>)
Emit the current token and then switch to the data state.
EOF
Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else
Append the current input character to the current attribute’s value and then switch to the tag attribute value unquoted state.

1.3.30. Tag attribute value double quoted state

Consume the next input character:

U+0022 QUOTATION MARK (")
Switch to the tag attribute name before state.
U+0026 AMPERSAND (&)
Switch to character reference in attribute value state, with the additional allowed character being U+0022 QUOTATION MARK(").
EOF
Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else
Append the input character to the current attribute’s value. Stay in the current state.

1.3.31. Tag attribute value single quoted state

Consume the next input character:

U+0022 APOSTROPHE (')
Switch to the tag attribute name before state.
U+0026 AMPERSAND (&)
Switch to character reference in attribute value state, with the additional allowed character being APOSTROPHE (').
EOF
Parse error. Emit the current token and then reprocess the current input character in the data state.
Anything else
Append the input character to the current attribute’s value. Stay in the current state.

1.3.32. Tag attribute value unquoted state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+0020 SPACE (Space)
Switch to the tag attribute name before state.
U+0026 AMPERSAND (&):
Switch to character reference in attribute value state, with the additional allowed character being U+003E GREATER-THAN SIGN(>).
U+003E GREATER-THAN SIGN (>)
Emit the current token as start tag token and then switch to the data state.
EOF
Parse error. Emit the current token as start tag token and then reprocess the current input character in the data state.
Anything else
Append the input character to the current attribute’s value. Stay in the current state.

1.3.33. Character reference in attribute value state

Attempt to consume a character reference.

If nothing is returned, append a U+0026 AMPERSAND (&) character to current attribute’s value.

Otherwise append returned character tokens to current attribute’s value.

Finally, switch back to attribute value state that switched to this state.

1.3.34. Bogus comment state

Consume every character up to the first U+003E GREATER-THAN SIGN (>) or EOF, whichever comes first. Emit a comment token whose data is the concatenation of all those consumed characters. Then consume the next input character and switch to the data state reprocessing the EOF character if that was the character consumed.

1.3.35. Tokenizing character references

This section defines how to consume a character reference, optionally with an additional allowed character, which, if specified where the algorithm is invoked, adds a character to the list of characters that cause there to not be a character reference.

This definition is used when parsing character in text and in attributes.

The behavior depends on identity of next character (the one immediately after the U+0026 AMPERSAND character), as follows:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
U+003C LESS-THAN SIGN (<)
U+0025 PERCENT SIGN (%)
U+0026 AMPERSAND (&)
EOF
The additional allowed character if there is one
Not a character reference. No characters are consumed and nothing is returned (This is not an error, either).
U+0023 NUMBER SIGN (#)

Consume the U+0023 NUMBER SIGN.

The behaviour further depends on the character after the U+0023 NUMBER SIGN.

U+0078 LATIN SMALL LETTER X
U+0078 LATIN CAPITAL LETTER X

Consume the X.

Follow the steps below, but using ASCII hex digits.

When it comes to interpreting the number, interpret it as a hexadecimal number.

Anything else
Follow the steps below, but using ASCII digits.

When it comes to interpreting the number, interpret it as a decimal number.

Consume as many characters as match the range of characters given above (ASCII hex digits or ASCII digits).

If no characters match the range, then don’t consume any characters. This is a parse error; return the U+0023 NUMBER SIGN character and if appropriate X character as string of text.

Otherwise, if the next character is a U+003B SEMICOLON, consume that too. If it isn’t, there is a parse error.

If one or more characters match the range, then take them all and interpret the string of characters as a number (either hexadecimal or decimal as appropriate).

Should we do HTML like replacement? At least for null?

Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER character token.

Should we refuse Unicode from ranges listed (0x0001 to 0x0008, 0x000D to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF)?

I’ve noted that Javascript implementation of XML5 is having to go around some characters in its version.

Anything else

Consume characters until you reach a U+003B SEMICOLON character (;).

What happens if there is no semicolon? Does it read rest of the file? Maybe better solution is to read all characters that are part of name char according to XML 1.1. spec.

Otherwise, a character reference is parsed. If the last character matched is not a U+003B SEMICOLON character (;), there is a parse error.

If there was a parse error the consumed characters are interperted as part of a string and are returned.

If there wasn’t a parse error return a reference with name equal to consumed characters, omitting the U+003B SEMICOLON character (;).

If the markup contains following attribute This is a &ref;, character tokenizer should return this as a reference named ref. However if the attribute defined is defined as This is &notref, then the tokenizer will interpret this as a text This is &notref, while emitting a parse error.

1.3.36. DOCTYPE state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Switch to the before DOCTYPE name state.
EOF
Parse error. Switch to data state. Create new Doctype token. Emit Doctype token. Reconsume the EOF character.
Anything else
Parse error. Switch to before DOCTYPE name state. Reconsume the character.

1.3.37. Before DOCTYPE name state

Consume the next input character:

U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Ignore the character.
Uppercase ASCII letter
Create a new DOCTYPE token. Set the token name to lowercase version of the current input character. Switch to the DOCTYPE name state.
U+003E GREATER-THAN SIGN(>)
Parse error Create a new DOCTYPE token. Emit token. Switch to data state.
EOF
Parse error. Switch to data state. Create new Doctype token. Emit Doctype token. Reconsume the EOF character.
Anything else
Create new DOCTYPE token. Set the token’s name to current input character. Switch to DOCTYPE name state.

1.3.38. DOCTYPE name state

Consume the next input character:
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Switch to the after DOCTYPE name state.
Uppercase ASCII letter
Append the lowercase version of current input character to current doctype token.
U+003E GREATER-THAN SIGN(>)
Create a new DOCTYPE token. Emit token. Switch to data state.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Append the current input character to the current DOCTYPE token’s name. Reconsume the EOF character.

1.3.39. After DOCTYPE name state

Consume the next input character:
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Ignore the character.
U+003E GREATER-THAN SIGN(>)
Switch to data state. Emit current DOCTYPE token.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
If the six characters starting from the current input character are an ASCII case-insensitive match for word "PUBLIC", then consume those characters and switch to the after DOCTYPE public keyword state.

Otherwise, if the six characters starting from the current input character are an ASCII case-insensitive match for word "SYSTEM", then consume those charactes and switch to the after DOCTYPE system keyword state.

Otherwise, this is a parse error. Switch to bogus DOCTYPE state.

1.3.40. After DOCTYPE public keyword state

Consume the next input character:
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Switch to the before DOCTYPE public identifier state.
U+0022 QUOTATION MARK(")
Parse error. Set the DOCTYPE token’s public identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE public identifier (double-quoted) state.
U+0027 APOSTROPHE(')
Parse error. Set the DOCTYPE token’s public identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE public identifier (single-quoted) state.
U+003E GREATER-THAN SIGN(>)
Parse error. Set the DOCTYPE token’s public identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE public identifier (single-quoted) state.
EOF
Parse error. Switch to the data state. Emit that DOCTYPE token. Reconsume the EOF character.
Anything else
Parse error. Switch to the bogus DOCTYPE state. Emit that DOCTYPE token. Reconsume the EOF character.

1.3.41. After DOCTYPE system keyword state

Consume the next input character:
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Switch to the before DOCTYPE system identifier state.
U+0022 QUOTATION MARK(")
Parse error. Set the DOCTYPE token’s public identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE(')
Parse error. Set the DOCTYPE token’s public identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
U+003E GREATER-THAN SIGN(>)
Parse error. Set the DOCTYPE token’s public identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
EOF
Parse error. Switch to the data state. Emit that DOCTYPE token. Reconsume the EOF character.
Anything else
Parse error. Switch to the bogus DOCTYPE state.

1.3.42. Before DOCTYPE system identifier state

Consume the next input character:
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Ignore the character.
U+0022 QUOTATION MARK(")
Set the DOCTYPE token’s system identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
U+0027 APOSTROPHE(')
Parse error. Set the DOCTYPE token’s system identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
U+003E GREATER-THAN SIGN(>)
Parse error. Switch to data state. Emit current DOCTYPE token.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Parse error. Switch to the bogus DOCTYPE state.

1.3.43. Before DOCTYPE public identifier state

Consume the next input character:
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Ignore the character.
U+0022 QUOTATION MARK(")
Parse error. Set the DOCTYPE token’s public identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE public identifier (double-quoted) state.
U+0027 APOSTROPHE(')
Parse error. Set the DOCTYPE token’s public identifier current DOCTYPE token to the empty string (not missing), then switch to the DOCTYPE public identifier (single-quoted) state.
U+003E GREATER-THAN SIGN(>)
Parse error. Switch to data state. Emit current DOCTYPE token.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Parse error. Switch to the bogus DOCTYPE state.

1.3.44. DOCTYPE public identifier (single-quoted) state

Consume the next input character:
U+0027 APOSTROPHE(')
Switch to the after DOCTYPE public identifier state.
U+003E GREATER-THAN SIGN(>)
Parse error. Switch to data state. Emit current DOCTYPE token.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Append the current input character to the current DOCTYPE token’s public identifier.

1.3.45. DOCTYPE public identifier (double-quoted) state

Consume the next input character:
U+0022 QUOTATION MARK(")
Switch to the after DOCTYPE public identifier state.
U+003E GREATER-THAN SIGN(>)
Parse error. Switch to data state. Emit current DOCTYPE token.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Append the current input character to the current DOCTYPE token’s public identifier.

1.3.46. After DOCTYPE public identifier state

Consume the next input character:
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Switch to the between DOCTYPE public and system identifiers state.
U+0027 APOSTROPHE(')
Parse error. Set the DOCTYPE token’s system identifier to the empty string (not missing) then switch to the DOCTYPE system identifier (single-quoted) state.
U+0022 QUOTATION MARK(")
Parse error. Set the DOCTYPE token’s system identifier to the empty string (not missing) then switch to the DOCTYPE system identifier (double-quoted) state.
U+003E GREATER-THAN SIGN(>)
Switch to data state. Emit current DOCTYPE token.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Parse error. Switch to bogus DOCTYPE state.

1.3.47. Between DOCTYPE public and system identifiers state

Consume the next input character:
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Ignore the character.
U+003E GREATER-THAN SIGN(>)
Switch to data state. Emit current DOCTYPE token.
U+0027 APOSTROPHE(')
Set the DOCTYPE token’s system identifier to the empty string (not missing) then switch to the DOCTYPE system identifier (single-quoted) state.
U+0022 QUOTATION MARK(")
Set the DOCTYPE token’s system identifier to the empty string (not missing) then switch to the DOCTYPE system identifier (double-quoted) state.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Parse error. Switch to Bogus DOCTYPE state.

1.3.48. DOCTYPE system identifier (single-quoted) state

Consume the next input character:
U+0027 APOSTROPHE(')
Switch to the after DOCTYPE system identifiers state.
U+003E GREATER-THAN SIGN(>)
Parse error. Switch to data state. Emit current DOCTYPE token.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Append the current input character to the current DOCTYPE token’s system identifier.

1.3.49. DOCTYPE system identifier (double-quoted) state

Consume the next input character:
U+0022 QUOTATION MARK(")
Switch to the after DOCTYPE system identifiers state.
U+003E GREATER-THAN SIGN(>)
Parse error. Switch to data state. Emit current DOCTYPE token.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Append the current input character to the current DOCTYPE token’s system identifier.

1.3.50. After DOCTYPE system identifiers state

Consume the next input character:
U+0009 CHARACTER TABULATION (Tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE (Space)
Ignore the character.
U+003E GREATER-THAN SIGN(>)
Switch to data state. Emit current DOCTYPE token.
EOF
Parse error. Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Parse error. Switch to Bogus DOCTYPE state.

1.3.51. Bogus DOCTYPE state

Consume the next input character:
U+003E GREATER-THAN SIGN(>)
Switch to data state. Emit DOCTYPE token.
EOF
Switch to the data state. Emit DOCTYPE token. Reconsume the EOF character.
Anything else
Ignore character.

1.4. Tree construction

The input to the tree construction stage is a sequence of tokens from the tokenization stage. The output of this stage is a tree model represented by a Document object.

The tree construction stage passes through several phases. The initial phase is the start phase.

The stack of open elements contains all elements of which the closing tag has not yet been encountered. Once the first start tag token in the start phase is encountered it will contain one open element. The rest of the elements are added during the main phase.

The current element is the bottommost node in this stack.

The stack of open elements is said to have an element in scope if the target element is in the stack of open elements.

When the steps below require the user agent to append a character to a node, the user agent must collect it and all subsequent consecutive characters that would be appended to that node and insert one Text node whose data is the concatenation of all those characters.

Need to define create an element for the token...

When the steps below require the user agent to insert an element for a token the user agent must create an element for the token and then append it to the current element and push it into the stack of open elements so that it becomes the new current element.

Start phase

Each token emitted from the tokenization stage must be processed as follows until the algorithm below switches to a different phase:

A start tag token

Create an element for the token and then append it to the Document node and push it into the stack of open elements.

This element is the root element and the first current element. Then switch to the main phase.

An empty tag token

Create an element for the token and append it to the Document node. Then switch to the end phase.

A comment token

Append a Comment node to the Document node with the data attribute set to the data given in the token.

A processing instruction token

Append a ProcessingInstruction node to the Document node with the target and data attributes set to the target and data given in the token.

An end-of-file token

Parse error. Reprocess the token in the end phase.

Anything else
Parse error. Ignore the token.
Main phase

Once a start tag token has been encountered (as detailed in the previous phase) each token must be process using the following steps until further notice:

A character token

Append a character to the current element.

A start tag token

Insert an element for the token.

An empty tag token

Create an element for the token and append it to the current element.

An end tag token

If the tag name of the current node does not match the tag name of the end tag token this is a parse error.

If there is an element in scope with the same tag name as that of the token pop nodes from the stack of open elements until the first such element has been popped from the stack.

If there are no more elements on the stack of open elements at this point switch to the end phase.

A short end tag token

Pop an element from the stack of open elements. If there are no more elements on the stack of open elements switch to the end phase.

A comment token

Append a Comment node to the current element with the data attribute set to the data given in the token.

A processing instruction token
Append a ProcessingInstruction node to the current element with the target and data attributes set to the target and data given in the token.
An end-of-file token
Parse error. Reprocess the token in the end phase.
End phase before

Tokens in end phase must be handled as follows:

A comment token
Append a Comment node to the Document node with the data attribute set to the data given in the token.
A processing instruction token

Append a ProcessingInstruction node to the Document node with the target and data attributes set to the target and data given in the token.

An end-of-file token

Stop parsing.

Anything else

Parse error. Ignore the token.

Once the user agent stops parsing the document, it must follow these steps:

TODO

2. Writing XML documents

3. Common parser idioms

The ASCII digits are the characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9).

The ASCII hex digits are the characters in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F, and U+0061 LATIN SMALL LETTER A to U+0066 LATIN SMALL LETTER F.

The lowercase ASCII letters are characters in the range between U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z.

The uppercase ASCII letters are characters in the range between U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z.

Comparing two strings in an ASCII case-insensitive manner means comparing them exactly, code point for code point, except that the characters in the range U+0041 to U+005A (i.e. LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) and the corresponding characters in the range U+0061 to U+007A (i.e. LATIN SMALL LETTER A to LATIN SMALL LETTER Z) are considered to also match.

Conformance

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this specification are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

Index

Terms defined by this specification

References

Normative References

[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119