<?xml version="1.0"?>

PiXTools by rjb

This is work in progress.

BUG WARNING: according to the XML specs, <foo bar=">"/> is a well-formed XML document, while <foo bar="<"/> is not. These are not parsed correctly by the current version of PiXPull.Parser. Will be fixed soon.

I am making available a couple of Pike modules for dealing with XML. Currently they are neither complete in any sense, nor well-tested; but hopefully already somewhat useful. Any bugfixes, suggestions for improvement, even flames about my coding style, are most welcome.

The modules contain lots of AutoDoc comments, but I haven't yet figured out how to generate pretty HTML doc pages for standalone Pike modules. Any help on this will be appreciated.

In case you want to use this code in your own apps, consider it free software, provided under the terms of the GNU LGPL.

PiXPull.pmod
demos for PiXPull.pmod
PiXTree.pmod

These modules and scripts were coded against Pike 7.4. Please let me know should they fail when used with 7.5.
They will not work currently with Pike 7.2, let me know if there's demand for fixing this.
Being added: sample scripts demonstrating usage of these modules.

PiXPull.pmod

PiXPull.pmod aims to be a Pike implementation of an API not unlike the one described and advocated at http://xmlpull.org/ (which was designed for Java). In brief, it is a streaming XML parser, based on the "pull" parsing model rather then the more usual event-callback scheme (SAX, etc.). Currently it neither validates XML nor parses DOCTYPE declarations (merely skips them), and is not fully compliant with the XML spec in detecting violations of well-formedness (many checks are missing that are somewhat expensive to do in Pike). XML Namespaces are not implemented (this is optional in the XMLPull API), but shouldn't be too hard to add (and are on the TODO list).

Other bugs/omissions:

input encoding detection isn't quite there (but at least utf-8, utf-16 and 8-bit encodings supported by Pike should work)
attribute value normalization is buggy (I think)
treatment of entity references is seriously incomplete
error reporting needs improvement.

It does however seem to work OK for about all XML input files I tried ;-)

Being written entirely in Pike, PiXPull is not lightning-fast... but it does not seem to be significantly slower than pure Java implementations (and is of course no match for compiled native code...)

In addition to the main Parser class, the module includes a simple XmlSerializer for XML output streaming. Both classes are mostly compliant with (a sizeable subset of) the XmlPull API (version 1), to the extent that the API ports from Java to Pike.

PiXPull.pmod can be downloaded here.

demo scripts of PiXPull usage

The sample scripts expect the PiXTools modules in the same directory as the script; if you have read this far, you will obviously know to edit the import statement(s) according to your liking.

testXPull: a trivial script that simply tallies the events observed by a PiXPull.Parser and prints some summary output.
Usage: If given the option -n the script will use the next() API, otherwise the parsing will proceed by calls to nextToken(). A single non-option argument may be given, this is the name of the XML file to parse; if none, standard input will be read.
Sample output:

$ ./testXPull much_ado.xml
Input encoding used: utf-8
Parsing time: 0.3267s
Characters read: 195263
PARSE_ERROR : 0
END_DOCUMENT : 1
START_DOCUMENT : 1
START_TAG : 4727
END_TAG : 4727
TEXT : 9421
ENTITY_REF : 3
CDSECT : 0
COMMENT : 0
DOCDECL : 1
IGNORABLE_WHITESPACE : 3
PROCESSING_INSTRUCTION : 0

slashnews: get your hourly fix of slashdot.org headlines on your terminal or text console -- no web browser required ;-) Note that it works best with a huge terminal window (such as I use: a full-screen KDE Konsole or xterm).
Caveat: this script fetches slashdot.xml every time it is run, and Slashdot says don't fetch this too often or your IP will be blocked (but AFAIK they never actually do this, well maybe if you are really obnoxious...)

PiXTree.pmod

PiXTree.pmod is a rough sketch of an interface for generating XML output. If PiXPull is pre-alpha, this is just an experimental prototype. Your feedback (if any ;-) will tell me whether this is at all a good idea.

The general idea is to provide a simple API for building an XML tree in memory, with liberal use of operator overloading to streamline the syntax of most common operations.

Download PiXTree.pmod.

Enjoy (or otherwise ;-)...

Comments?

Accessed: by since .

Robert J. Budzynski

Last modified:

PiXTools by rjb

Contents

PiXPull.pmod

demo scripts of PiXPull usage

PiXTree.pmod

Enjoy (or otherwise ;-)...