CryptoSys Home > xmlsq > xmlsq - XML Simple Query

xmlsq - XML Simple Query

xmlsq is a simple lightweight utility to query XML documents using XPath 1.0. New release 18 July 2021.

Did you ever want a simple utility that just went and got you the text value out of an XML file? And do this without having to use the overhead of a huge XML library? Our xmlsq utility will do that for you. xmlsq is provided both as a stand-alone Windows command-line CLI executable and a separate API you can call from various programming languages including C#, VB.NET, C, C++, VBA and VB6. And it's free.

Introduction

xmlsq uses XPath 1.0 and operates on either an XML file or a string that represents a valid XML document.

The default mode of xmlsq ("get-text") is simply to return the text value for the first occurrence of the node that matches the XPath query. That is the text content from an element or the attribute value from an attribute. There is also a "full-query" mode which returns the result of a full XPath query as a string, and "count" mode which returns the number of nodes that match the query.

In "get-text" and "count" mode the XPath query must point to a node in the XML document. Queries that return an integer or boolean type will fail. Queries in "full-query" mode will return the result in a string. For example, the integer value 1.5 will be returned as the string "1.500000".

We assume you are familiar with XPath 1.0. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document. For more information on XPath 1.0 see the References below.

For the command-line syntax, type xmlsq --help

Usage: xmlsq {[-g]|-f|-c} [OPTION]... QUERY [INFILE]
Perform XPath 1.0 queries on an XML document.
  -g, --get-text      get text for first matching node [default]
  -f, --full-query    execute full XPath 1.0 query
  -c, --count         return the integer value of `count(query)`
OPTIONS:
  -@, --stdin         read input from stdin [default=INFILE]
  -i, --input=INFILE  optional way to specify INFILE
  -a, --asciify       output non-ASCII chars as XML character references
  -r, --raw           output nodeset in raw format [default=prettify]
  -t, --trim          trim leading/trailing whitespace
                       (and collapse whitespace for an attribute value)
  -d, --delim=DELIM   enclose output in delimiter(s), eg ' or []
  -v, --version       print program version and exit
  -h, --help          print this help and exit
  -E, --examples      print examples and exit
INFILE must be specified unless `--stdin` option is used.
Exit status is 0 on success or 1 if error.

For example, if your input is the XML document hello.xml

<a>
  <b foo='baz'>hello</b>
  <b>world</b>
</a>

then the simple "get-text" query for "//b" will return just the character data inside the first element b which is "hello"

> xmlsq //b hello.xml
hello

Similarly, the query "//b/@foo" will return the value of the attribute foo for the first occurrence of element b, which is "baz".

> xmlsq //b/@foo hello.xml
baz

To get the text content of the second element b, use the element predicate [2]:

> xmlsq //b[2] hello.xml
world

By contrast, using the "full-query" mode (the full XPath 1.0 query) returns the set of all matching nodes:

> xmlsq --full-query //b hello.xml
<b foo="baz">hello</b>
<b>world</b>

In our work with XML documents (mostly security related, XML-DSIG and XMLENC) the simple "get-text" mode is exactly the behaviour we want.

More details below in Get-text mode and Full-query mode.

Get-text mode

The default "get-text" mode is a simplified form of XPath designed to return the text contents of the first element that matches the XPath query.

If the element is a text-only node with only character data inside, then this text is returned. If the query is for an attribute, then the attribute's text value is returned. If the element contains child elements or mixed child elements and character data, then the contents will be returned as a "prettified" string.

> xmlsq /a hello.xml
<b foo="baz">hello</b>
<b>world</b>

The XPath expression in get-text mode must evaluate to a node or node set (else **ERROR: BAD_XPATH:Expression does not evaluate to node set**).

Full-query mode

Using the --full-query or -f mode returns the full XPath result, not the simplified text-only results returned for the "get-text" mode. Any valid XPath expression may be used.

Query	Default	Full-query
//b	hello	<b foo="baz">hello</b> <b>world</b>
//b[2]	world	<b>world</b>
//b/@foo	baz	foo="baz"
/a	<b foo="baz">hello</b> <b>world</b>	<a> <b foo="baz">hello</b> <b>world</b> </a>

By default, a nodeset is "prettified" with tabs and newlines. Use the --raw option if you don't want this.

> xmlsq -f /a hello.xml
<a>
        <b foo="baz">hello</b>
        <b>world</b>
</a>

> xmlsq --raw -f /a hello.xml
<a><b foo="baz">hello</b><b>world</b></a>

You can use any valid XPath 1.0 expression in full-query mode. Use a dummy XML document, e.g. <a/>, for expressions that do not operate on an XML document. For example:

> xmlsq -f "3 + 5 div 2" "<a/>"
5.500000
> xmlsq -f "string-length('abc')" "<a/>"
3.000000
> xmlsq -f "substring('abcdefghij',3,4)" "<a/>"
cdef
> xmlsq -f "normalize-space('  my   node  ')" "<a/>"
my node

Count mode

The "count" mode (--count or -c) returns the integer value of count(query), the number of nodes that match the query.

> xmlsq --count "//b" hello.xml
2

Use the count mode to:

Get a count to iterate over all matching nodes: e.g. to evaluate (query)[i] for i = 1 to count.
- See Use the count to query each matching element.

Check for the existence of an element or attribute value that may either be missing (count = 0) or merely empty (count > 0).

> xmlsq -f "/" "<a><e /></a>"
<a>
        <e />
</a>
> xmlsq --count "//e" "<a><e /></a>"
1
> xmlsq --count "//notthere" "<a><e /></a>"
0

The XPath expression in count mode must evaluate to a node or node set (else **ERROR: BAD_XPATH:Expression does not evaluate to node set** or **ERROR -3).

> xmlsq --count "1+2" "<a/>"
**ERROR -3

Usage

Examples

Download

Download the latest version of xmlsq for Windows from one of the links below. This is free for personal and commercial use subject to the license conditions^†.

Most recent production version 1.0.0 compiled 18 July 2021. Use either

XmlsqInst.exe (413 kB) or
XmlsqInst.zip (388 kB)

Either unzip the zip file and run the install.exe program inside it, or download the exe program directly and run it. These installation programs should be signed by verified publisher "d.i. management services pty limited". Minimum required operating system is Windows XP-SP2 and above (that is, XP/Vista/W7/W8/W10) or Windows Server 2003 and above.

Please note it is a breach of copyright to put a copy of these installation files on another server or to distribute them in any manner except by providing a link to this page.

Trouble installing: If Microsoft Defender Smartscreen gives you a warning, see Unrecognized app error. (TL;DR Click "More info" then "Run anyway"). Check that you see "Publisher: D.I. MANAGEMENT SERVICES PTY LIMITED".

After installing, test by opening a command line window and typing xmlsq --help. See Command-line syntax for more details.

^† If your organisation has a problem with the license conditions as stated, please contact us and we'll work something out.

Python interface

The Python interface to xmlsq is available separately from the Python Package Index (PyPi). For details see Python programming.

C++ (STL) interface

Added 18 July 2021. The C++ interface is an alternative interface for C++ programmers who want to avoid the memory allocation hassles of using the "raw" C interface. All strings are std::string and all byte arrays are stored in a std::vector. It raises exceptions if the input parameters are wrong (invalid XML data or XPath expression) or a file is missing. We are not that keen on exceptions as a rule, but we'll go with the flow here. The documentation is here and example code is here.

The code should compile under the C++11 standard or later. We've tested it using MSVC++ Version 12.0 (2013) and g++ version 8.3.0 (x86_64-w64-mingw32/8.3.0). Could the code be more concise? Goodness, it's C++, not Haskell! If you think the C++ code could be improved, please let us know.

Conformance with W3C

xmlsq conforms to the W3C XPath 1.0 specification except for the following incompatibilities (Ref: pugixml v1.10: 8.6. Conformance to W3C specification).

Consecutive text nodes sharing the same parent are not merged, i.e. in <node>text1 <![CDATA[data]]> text2</node> node should have one text node child, but instead has three.
Since the document type declaration is not used for parsing, id() function always returns an empty node set.
Namespace nodes are not supported (affects namespace:: axis).
Name tests are performed on QNames in XML document instead of expanded names; for <foo xmlns:ns1='uri' xmlns:ns2='uri'><ns1:child/><ns2:child/></foo>, query foo/ns1:* will return only the first child, not both of them. Compliant XPath implementations can return both nodes if the user provides appropriate namespace declarations.

Namespaces

Note that xmlsq does not support namespace nodes. It is "namespace ignorant". We consider this a bonus! Just do a web search on "XML namespaces are keeping me from selecting nodes" or "XML namespaces are breaking my XPath searches" to see how much confusion namespaces causes with XPath. You either need to invoke a "Namespace Manager" (which we deliberately don't offer) or use incredibly complicated XPath expressions.

In xmlsq name tests are performed on qualified names (QNames) instead of expanded names. That means a search on "//ds:Signature" will work as expected provided there are not two different namespaces declared for the prefix "ds" (and, honestly, if you have documents like that you deserve all the heartache you get).

If you really need some sort of namespace manager for your XPath queries, then this lightweight tool is not for you.

TL;DR Just search on the actual tag name used in the document.

Example: xmlenc document using default xmlns for xmldsig element <KeyInfo>:

<EncryptedData xmlns="http://www.w3.org/2001/04/xmlenc#" MimeType="text/plain">
  <EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#aes128-cbc" />
  <KeyInfo xmlns="http://www.w3.org/2000/09/xmldsig#">
    <KeyName>job</KeyName>
  </KeyInfo>
  <!-- ... -->
</EncryptedData>

The query //KeyName works on the above document:

> xmlsq //KeyName encrypt-data-aes128-cbc.xml
job

Equivalent document using prefix "ds:"

<EncryptedData xmlns="http://www.w3.org/2001/04/xmlenc#" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" 
 MimeType="text/plain">
  <EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#aes128-cbc" />
  <ds:KeyInfo>
    <ds:KeyName>job</ds:KeyName>
  </ds:KeyInfo>
  <!-- ... -->
</EncryptedData>

In this case, we must use the prefix "ds:" in the query "//ds:KeyName" to get the value we want:

> xmlsq --delim=' //KeyName encrypt-data-aes128-cbc-ds.xml
''
> xmlsq --delim=' //ds:KeyName encrypt-data-aes128-cbc-ds.xml
'job'

Alternatively, use the XPath local-name() function:

> xmlsq "//*[local-name()='KeyName']" encrypt-data-aes128-cbc-ds.xml
job

Support

For support, please contact us and provide at least the following

The programming language you are using (C#, VB6, command-line, etc)
The exact error message you are getting.
A snippet of code where the error occurs (including the input arguments).

Before calling support, please open a command-line console and check that the following works

xmlsq --help

If this fails, re-install the program.

If you are getting error messages like "Cannot find DLL" or similar, please read Using on a 64-bit platform.

Acknowledgements

This software is based on the pugixml library (https://pugixml.org). Pugixml is Copyright (C) 2006-2019 Arseny Kapoulkine. Pugixml is a light-weight, simple and fast XML parser for C++ with XPath support and excellent documentation. We highly recommend it.

We use code from Bjoern Hoehrmann's Flexible and Economical UTF-8 Decoder Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>. "This page presents [a decoder] that is very easy to use correctly, short, small, fast, and free."

Some tests use reference files from W3C XML Encryption Implementation and Interoperability Report > interop samples > merlin-xmlenc-five.tar.gz.

References

XML Path Language (XPath) Version 1.0.
XPath Tutorial by ZVON.org (follow the examples starting from Example 1).
XPath 1.0 Functions by edankert.
Extensible Markup Language (XML) 1.0.

Revision History

2021-12-30 Recompiled the .NET interface library diXmlsqNet.dll using .NET 4.0 (instead of 4.5) for better backwards compatibility.
2021-07-18 Released version 1.0.0 with various minor fixes and a C++ STL interface.
2020-09-04: Released updated version 0.9.0a with fixes in xmlsq CLI for UTF-8 encoded characters in Windows console. See Dealing with non-ASCII characters
2020-06-02: First release version 0.9.0 of xmlsq.

Contact us

To contact us or comment on this page, please send us a message.

This page first published 2 June 2020. Last updated 30 December 2021.

[Go to top]