00001 # sax js
00002
00003 A sax-style parser for XML and HTML.
00004
00005 Designed with [node](http:
00006 the browser or other CommonJS implementations.
00007
00008 ## What This Is
00009
00010 * A very simple tool to parse through an XML string.
00011 * A stepping stone to a streaming HTML parser.
00012 * A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML
00013 docs.
00014
00015 ## What This Is (probably) Not
00016
00017 * An HTML Parser - That's a fine goal, but this isn't it. It's just
00018 XML.
00019 * A DOM Builder - You can use it to build an object model out of XML,
00020 but it doesn't do that out of the box.
00021 * XSLT - No DOM = no querying.
00022 * 100% Compliant with (some other SAX implementation) - Most SAX
00023 implementations are in Java and do a lot more than this does.
00024 * An XML Validator - It does a little validation when in strict mode, but
00025 not much.
00026 * A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic
00027 masochism.
00028 * A DTD-aware Thing - Fetching DTDs is a much bigger job.
00029
00030 ## Regarding `<!DOCTYPE`s and `<!ENTITY`s
00031
00032 The parser will handle the basic XML entities in text nodes and attribute
00033 values: `& < > ' "`. It's possible to define additional
00034 entities in XML by putting them in the DTD. This parser doesn't do anything
00035 with that. If you want to listen to the `ondoctype` event, and then fetch
00036 the doctypes, and read the entities and add them to `parser.ENTITIES`, then
00037 be my guest.
00038
00039 Unknown entities will fail in strict mode, and in loose mode, will pass
00040 through unmolested.
00041
00042 ## Usage
00043
00044 ```javascript
00045 var sax = require("./lib/sax"),
00046 strict = true,
00047 parser = sax.parser(strict);
00048
00049 parser.onerror = function (e) {
00050
00051 };
00052 parser.ontext = function (t) {
00053
00054 };
00055 parser.onopentag = function (node) {
00056
00057 };
00058 parser.onattribute = function (attr) {
00059
00060 };
00061 parser.onend = function () {
00062
00063 };
00064
00065 parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
00066
00067
00068
00069 var saxStream = require("sax").createStream(strict, options)
00070 saxStream.on("error", function (e) {
00071
00072
00073 console.error("error!", e)
00074
00075 this._parser.error = null
00076 this._parser.resume()
00077 })
00078 saxStream.on("opentag", function (node) {
00079
00080 })
00081
00082
00083 fs.createReadStream("file.xml")
00084 .pipe(saxStream)
00085 .pipe(fs.createWriteStream("file-copy.xml"))
00086 ```
00087
00088
00089 ## Arguments
00090
00091 Pass the following arguments to the parser function. All are optional.
00092
00093 `strict` - Boolean. Whether or not to be a jerk. Default: `false`.
00094
00095 `opt` - Object bag of settings regarding string formatting. All default to `false`.
00096
00097 Settings supported:
00098
00099 * `trim` - Boolean. Whether or not to trim text and comment nodes.
00100 * `normalize` - Boolean. If true, then turn any whitespace into a single
00101 space.
00102 * `lowercase` - Boolean. If true, then lowercase tag names and attribute names
00103 in loose mode, rather than uppercasing them.
00104 * `xmlns` - Boolean. If true, then namespaces are supported.
00105 * `position` - Boolean. If false, then don't track line/col/position.
00106 * `strictEntities` - Boolean. If true, only parse [predefined XML
00107 entities](http://www.w3.org/TR/REC-xml/#sec-predefined-ent)
00108 (`&`, `'`, `>`, `<`, and `"`)
00109
00110 ## Methods
00111
00112 `write` - Write bytes onto the stream. You don't have to do this all at
00113 once. You can keep writing as much as you want.
00114
00115 `close` - Close the stream. Once closed, no more data may be written until
00116 it is done processing the buffer, which is signaled by the `end` event.
00117
00118 `resume` - To gracefully handle errors, assign a listener to the `error`
00119 event. Then, when the error is taken care of, you can call `resume` to
00120 continue parsing. Otherwise, the parser will not continue while in an error
00121 state.
00122
00123 ## Members
00124
00125 At all times, the parser object will have the following members:
00126
00127 `line`, `column`, `position` - Indications of the position in the XML
00128 document where the parser currently is looking.
00129
00130 `startTagPosition` - Indicates the position where the current tag starts.
00131
00132 `closed` - Boolean indicating whether or not the parser can be written to.
00133 If it's `true`, then wait for the `ready` event to write again.
00134
00135 `strict` - Boolean indicating whether or not the parser is a jerk.
00136
00137 `opt` - Any options passed into the constructor.
00138
00139 `tag` - The current tag being dealt with.
00140
00141 And a bunch of other stuff that you probably shouldn't touch.
00142
00143 ## Events
00144
00145 All events emit with a single argument. To listen to an event, assign a
00146 function to `on<eventname>`. Functions get executed in the this-context of
00147 the parser object. The list of supported events are also in the exported
00148 `EVENTS` array.
00149
00150 When using the stream interface, assign handlers using the EventEmitter
00151 `on` function in the normal fashion.
00152
00153 `error` - Indication that something bad happened. The error will be hanging
00154 out on `parser.error`, and must be deleted before parsing can continue. By
00155 listening to this event, you can keep an eye on that kind of stuff. Note:
00156 this happens *much* more in strict mode. Argument: instance of `Error`.
00157
00158 `text` - Text node. Argument: string of text.
00159
00160 `doctype` - The `<!DOCTYPE` declaration. Argument: doctype string.
00161
00162 `processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument:
00163 object with `name` and `body` members. Attributes are not parsed, as
00164 processing instructions have implementation dependent semantics.
00165
00166 `sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>`
00167 would trigger this kind of event. This is a weird thing to support, so it
00168 might go away at some point. SAX isn't intended to be used to parse SGML,
00169 after all.
00170
00171 `opentagstart` - Emitted immediately when the tag name is available,
00172 but before any attributes are encountered. Argument: object with a
00173 `name` field and an empty `attributes` set. Note that this is the
00174 same object that will later be emitted in the `opentag` event.
00175
00176 `opentag` - An opening tag. Argument: object with `name` and `attributes`.
00177 In non-strict mode, tag names are uppercased, unless the `lowercase`
00178 option is set. If the `xmlns` option is set, then it will contain
00179 namespace binding information on the `ns` member, and will have a
00180 `local`, `prefix`, and `uri` member.
00181
00182 `closetag` - A closing tag. In loose mode, tags are auto-closed if their
00183 parent closes. In strict mode, well-formedness is enforced. Note that
00184 self-closing tags will have `closeTag` emitted immediately after `openTag`.
00185 Argument: tag name.
00186
00187 `attribute` - An attribute node. Argument: object with `name` and `value`.
00188 In non-strict mode, attribute names are uppercased, unless the `lowercase`
00189 option is set. If the `xmlns` option is set, it will also contains namespace
00190 information.
00191
00192 `comment` - A comment node. Argument: the string of the comment.
00193
00194 `opencdata` - The opening tag of a `<![CDATA[` block.
00195
00196 `cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get
00197 quite large, this event may fire multiple times for a single block, if it
00198 is broken up into multiple `write()`s. Argument: the string of random
00199 character data.
00200
00201 `closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block.
00202
00203 `opennamespace` - If the `xmlns` option is set, then this event will
00204 signal the start of a new namespace binding.
00205
00206 `closenamespace` - If the `xmlns` option is set, then this event will
00207 signal the end of a namespace binding.
00208
00209 `end` - Indication that the closed stream has ended.
00210
00211 `ready` - Indication that the stream has reset, and is ready to be written
00212 to.
00213
00214 `noscript` - In non-strict mode, `<script>` tags trigger a `"script"`
00215 event, and their contents are not checked for special xml characters.
00216 If you pass `noscript: true`, then this behavior is suppressed.
00217
00218 ## Reporting Problems
00219
00220 It's best to write a failing test if you find an issue. I will always
00221 accept pull requests with failing tests if they demonstrate intended
00222 behavior, but it is very hard to figure out what issue you're describing
00223 without a test. Writing a test is also the best way for you yourself
00224 to figure out if you really understand the issue you think you have with
00225 sax-js.