3 A sax-style parser
for XML and HTML.
5 Designed with [node](http:
6 the browser or other CommonJS implementations.
10 * A very simple tool to parse through an XML
string.
11 * A stepping stone to a streaming HTML parser.
12 * A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML
15 ## What This Is (probably) Not
17 * An HTML Parser - That's a fine goal, but this isn't it. It's just
19 * A DOM Builder - You can use it to build an
object model out of XML,
20 but it doesn't do that out of the box.
21 * XSLT - No DOM = no querying.
22 * 100% Compliant with (some other SAX implementation) - Most SAX
23 implementations are in Java and do a lot more than this does.
24 * An XML Validator - It does a little validation when in strict mode, but
26 * A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic
28 * A DTD-aware Thing - Fetching DTDs is a much bigger job.
30 ## Regarding `<!DOCTYPE`s and `<!ENTITY`s
32 The parser will handle the basic XML entities in text nodes and attribute
33 values: `& < > ' "`. It
's possible to define additional
34 entities in XML by putting them in the DTD. This parser doesn't
do anything
35 with that. If you want to listen to the `ondoctype` event, and then fetch
36 the doctypes, and read the entities and add them to `parser.ENTITIES`, then
39 Unknown entities will fail in strict mode, and in loose mode, will pass
45 var sax = require(
"./lib/sax"),
47 parser = sax.parser(strict);
49 parser.onerror =
function (e) {
52 parser.ontext =
function (t) {
55 parser.onopentag =
function (node) {
58 parser.onattribute =
function (attr) {
61 parser.onend =
function () {
65 parser.write(
'<xml>Hello, <who name="world">world</who>!</xml>').close();
69 var saxStream = require(
"sax").createStream(strict, options)
70 saxStream.on(
"error",
function (e) {
73 console.error(
"error!", e)
75 this._parser.error = null
78 saxStream.on(
"opentag", function (node) {
83 fs.createReadStream(
"file.xml")
85 .pipe(fs.createWriteStream(
"file-copy.xml"))
91 Pass the following arguments to the parser
function. All are optional.
93 `strict` - Boolean. Whether or not to be a jerk. Default: `
false`.
95 `opt` - Object bag of settings regarding
string formatting. All
default to `
false`.
99 * `trim` - Boolean. Whether or not to trim text and comment nodes.
100 * `normalize` - Boolean. If
true, then turn any whitespace into a single
102 * `lowercase` - Boolean. If
true, then lowercase tag names and attribute names
103 in loose mode, rather than uppercasing them.
104 * `xmlns` - Boolean. If
true, then namespaces are supported.
105 * `position` - Boolean. If
false, then don
't track line/col/position.
106 * `strictEntities` - Boolean. If true, only parse [predefined XML
107 entities](http://www.w3.org/TR/REC-xml/#sec-predefined-ent)
108 (`&`, `'`, `>`, `<`, and `"`)
112 `write` - Write bytes onto the stream. You don't have to
do this all at
113 once. You can keep writing as much as you want.
115 `close` - Close the stream. Once closed, no more data may be written until
116 it is done processing the buffer, which is signaled by the `end` event.
118 `resume` - To gracefully handle errors, assign a listener to the `error`
119 event. Then, when the error is taken care of, you can call `resume` to
120 continue parsing. Otherwise, the parser will not
continue while in an error
125 At all times, the parser
object will have the following members:
127 `line`, `column`, `position` - Indications of the position in the XML
128 document where the parser currently is looking.
130 `startTagPosition` - Indicates the position where the current tag starts.
132 `closed` - Boolean indicating whether or not the parser can be written to.
133 If it
's `true`, then wait for the `ready` event to write again.
135 `strict` - Boolean indicating whether or not the parser is a jerk.
137 `opt` - Any options passed into the constructor.
139 `tag` - The current tag being dealt with.
141 And a bunch of other stuff that you probably shouldn't touch.
145 All events emit with a single argument. To listen to an event, assign a
146 function to `on<eventname>`. Functions
get executed in the
this-context of
147 the parser
object. The list of supported events are also in the exported
150 When
using the stream interface, assign handlers
using the EventEmitter
151 `on`
function in the normal fashion.
153 `error` - Indication that something bad happened. The error will be hanging
154 out on `parser.error`, and must be deleted before parsing can
continue. By
155 listening to
this event, you can keep an eye on that kind of stuff. Note:
156 this happens *much* more in strict mode. Argument: instance of `Error`.
158 `text` - Text node. Argument:
string of text.
160 `doctype` - The `<!DOCTYPE` declaration. Argument: doctype
string.
162 `processinginstruction` - Stuff like `<?xml foo=
"blerg" ?>`. Argument:
163 object with `name` and `body` members. Attributes are not parsed, as
164 processing instructions have implementation dependent semantics.
166 `sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>`
167 would trigger
this kind of event. This is a weird thing to support, so it
168 might go away at some point. SAX isn
't intended to be used to parse SGML,
171 `opentagstart` - Emitted immediately when the tag name is available,
172 but before any attributes are encountered. Argument: object with a
173 `name` field and an empty `attributes` set. Note that this is the
174 same object that will later be emitted in the `opentag` event.
176 `opentag` - An opening tag. Argument: object with `name` and `attributes`.
177 In non-strict mode, tag names are uppercased, unless the `lowercase`
178 option is set. If the `xmlns` option is set, then it will contain
179 namespace binding information on the `ns` member, and will have a
180 `local`, `prefix`, and `uri` member.
182 `closetag` - A closing tag. In loose mode, tags are auto-closed if their
183 parent closes. In strict mode, well-formedness is enforced. Note that
184 self-closing tags will have `closeTag` emitted immediately after `openTag`.
187 `attribute` - An attribute node. Argument: object with `name` and `value`.
188 In non-strict mode, attribute names are uppercased, unless the `lowercase`
189 option is set. If the `xmlns` option is set, it will also contains namespace
192 `comment` - A comment node. Argument: the string of the comment.
194 `opencdata` - The opening tag of a `<![CDATA[` block.
196 `cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get
197 quite large, this event may fire multiple times for a single block, if it
198 is broken up into multiple `write()`s. Argument: the string of random
201 `closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block.
203 `opennamespace` - If the `xmlns` option is set, then this event will
204 signal the start of a new namespace binding.
206 `closenamespace` - If the `xmlns` option is set, then this event will
207 signal the end of a namespace binding.
209 `end` - Indication that the closed stream has ended.
211 `ready` - Indication that the stream has reset, and is ready to be written
214 `noscript` - In non-strict mode, `<script>` tags trigger a `"script"`
215 event, and their contents are not checked for special xml characters.
216 If you pass `noscript: true`, then this behavior is suppressed.
218 ## Reporting Problems
220 It's best to write a failing test
if you find an issue. I will always
221 accept pull requests with failing tests
if they demonstrate intended
222 behavior, but it is very hard to figure out what issue you
're describing
223 without a test. Writing a test is also the best way for you yourself
224 to figure out if you really understand the issue you think you have with