XPathle

How to play

Select one of the examples. The corresponding XML file will be rendered. Each XML file is associated with a secret XPath expression that selects some nodes (elements, text nodes, comments, processing instructions, and/or attributes). You need to enter an XPath expression such as /* in order to make a first guess.

For selecting namespaced elements with a prefix, such as xsl:for-each in the XSLT example, you can enter //xsl:for-each, //Q{http://www.w3.org/1999/XSL/Transform}for-each, or //*:for-each. Elements without a prefix can be selected without namespace if they are in the same namespace as the top-level element. If a prefix is declared on any element, you can also use this prefix in the expression. An example is the Atom/CAP feed in the 2022-06-24 daily challenge where you can use cap:*. The prefix xs with a binding for http://www.w3.org/2001/XMLSchema is always available, no matter whether declared in the document or not.

You cannot select //namespace-node() (well, you can, but they won’t be displayed). You cannot select processing instructions or comments that are outside the top-level element, either.

As a result of evaluating your guess, the number of selected items will be displayed so you can see whether at least the item count matches. But you are not finished until you selected exactly the same items that the secret expression selected, or until you run out of attempts.

In order to see how well your guess-selected items match, the items will be highlighted in the rendered XML (you might need to click into the rendering area and scroll down or sideways). The match quality will be signalled by color code and by tool tip. The tool tip contains both the hierarchical XPath to the item and the distance to the closest target item.

The number of items highlighted is limited to min((max((2 * count($secret-items), 4)), 20)) (that is, normally double the secret count; minimum of 4 items, maximum of 20). Otherwise you’d be able to select every node and quickly come up with an XPath expression that selects only the green ones. Only when your guess selects the same items as the secret expression, the number of highlighted items won’t be limited any more.

If you check the “Format & indent” box, the input document will be serialized with indent=true and parsed again. This might lead to a code rendition that is easier graspable. You can try this feature on the Saxonica Blog Atom feed.

Be aware that checking/unchecking the format & indent box after changing the XPath guess will increase your guess count. If you don’t change the guess, the guess count won’t change when you check/uncheck the box (or when you hit Submit).

Performance

The distance calculation scales with the product of secretly selected items and the items you guess. Therefore it can take a couple of seconds if these counts are around 30 or higher, depending on your computer. All calculations will be carried out by XSLT 3.0 in the browser (see the links at the bottom of the page).

Distance metric

The distance “metric” (which is not a metric, see below) works as follows for a given pair of items:

Find the common ancestor element of both items
If one item is ancestor of the other, the element distance is the number of steps you need to climb up on the ancestor axis
If an item is an attribute, text node, comment, or PI, calculate the element distance for its parent first and add 1 to the calculated distance later
The distance is the number of steps up to the common ancestor plus the number of steps down to the other element.
When the nodes below the common ancestor are siblings, count the number of siblings that you need to visit until you reach the other node, skipping intermediate whitespace-only text nodes.
Then don’t take the two steps to the common ancestor and down to the other node, but take the sibling path instead.

The following diagram shows several paths between nodes and indicates the corresponding distances using the same color.

The last requirement makes this a non-metric because it violates the triangle inequality. Suppose you have a list with 5 items. If it weren’t for the last requirement, the distance between the first and the last item would be 2: Go up one step to the list element, go down one step to the fifth item. Applying the last requirement, you won’t go up because the first item is already the penultimate element on its ancestor-or-self axis. Instead, you go sideways to the fifth item, and the distance then is 4. The individual distances for each list item to the list element equal 1, so the sum of both distances is 2. This violates the triangle inequality that says that the sum of the distances to some third point is at least as large as the direct distance between two points.

This non-metric was chosen because it seemed counter-intuitive that, for example, the first and the last paragraph of a novel have a distance of 4: Go up to the common ancestor, which might be body, passing a chapter element, then go down through another chapter to the last paragraph of the novel. On the other hand, using the number of intermediate nodes as a metric would make guessing a secretly selected item much easier: You take your first guess and go that many nodes back or forth, like in let $guess := //p[10] return (//node()[$guess << .])[24] where your initial guess was //p[10], the reported distance was 24 and you move 24 nodes forward in order to select an item of the secret set.

The “metric” applied here seeks to strike the balance between being intuitive and being ambiguous enough to avoid easy wins.

Tips

Query the document

Querying the document without selecting any nodes yields a warning, but doesn’t count as an attempt.

Try this expression in order to get a list of element names and their frequencies, name and frequency separated by ~, sorted by frequency:

might yield (when applied to this HTML document):

Warning (code XPathle03): path~31 text~29 tspan~29 g~25 p~19 rect~17 code~15 li~10 span~7 a~7 div~6 h3~5 ellipse~4 ul~3 script~2 html~1 head~1 title~1 meta~1 link~1 body~1 h1~1 details~1 summary~1 svg~1 defs~1 input~1 is not a node from the document.

Scatter your first shots evenly

It has been said that at most two times the secretly selected items (or 20, whatever is the lower number; at least 4) will be highlighted. Suppose that, as it is in the “HTML landing page for the transpect documentation” case, 33 items have been secretly selected. Then you can get hints for at most 20 guessed items. You can select 20 evenly spaced (in terms of position()) elements below the document node:

In that expression, replace 20 by two times the secret items if they are fewer than 10.

Once you identify a specific element as the common ancestor of the candidates, you can set $start to that element. In the landing page example, this common ancestor might be //body but this wouldn’t be a significant improvement over /. In other documents, the situation may be different.

The scatter method doesn’t work well in large documents with few secret items. If there’s just one secret item, you can randomly select an element at about a third of the document and another guess at two thirds. But it might be as good a first guess as anything else.

Draw distance samples among siblings

If the document contains many siblings on the same level, for example xs:element, xs:simpleType, and xs:complexType in the XSD that is the 2022-06-23 daily challenge, you can select as many of them as can be highlighted and see whether some of them are closer to target items than others.

A modified scatter expression might look like this:

In this expression, we want to skip initial xs:annotation and xs:import elements. Therefore we don’t use mod $dist = $dist idiv 2 which would select an item in the middle of each evenly spaced selected item groups. Specifying mod $dist = 0 instead causes the first $dist elements to be skipped.

Also note that $dist := $count idiv 4 although only one item has been selected by the secret expression. We didn’t use two times this count, which would amount to just two items, because the minimum number of highlighted items is set to 4, see above.

Once you have a candidate item that has the shortest distance of all equally scattered candidates, you can look at the preceding or following siblings whether one of them is even closer, like so:

Then you can either reassign $start to that candidate and apply the same procedure to its children or grandchildren, or you can enter a new expression that addresses the candidate more directly, such as:

Then you are only 2 items away from the solution (in that example).

Known Issues

Multiline attribute values, as they may appear in XPath expressions in XSLT @select attributes, might not be reproduced correctly. This is because XML parsers are required to turn newlines in attributes into plain spaces, which is a pity. As a heuristic, if there are at least 5 spaces in a row in an attribute value, it is assumed that the first space used to be a newline prior to parsing, and it will be converted to a newline in the rendered XML. Of course this heuristic may fail if indentation was done using tabs, if there were spaces before the newlines, or if there was no newline at all.

To Do

Obfuscate the configuration and the current attempt number better. Or leave it as weakly secure as it is.
Let users resume the game by storing session data in the browser.