http://qs1969.pair.com?node_id=213203


in reply to XML::Twig::Handlers - promoting laziness through magic

You are indeed a very lazy programer Mr PodMaster... which can be construed as a compliment around here I guess ;--)

This is a neat trick though, more or less the equivalent of the subs style for XML::Parser.

Note though that using this subclass will limit you to using handlers on element names, while there are _many_ other options. From the docs:

twig_handlers
This argument replaces the corresponding XML::Parser argument. It consists of a hash { expression = \&handler}> where expression is a generic_attribute_condition, string_condition, an attribute_condition,full_path, a partial_path, a gi, _default_ or <_all_>.

The idea is to support a usefull but efficient (thus limited) subset of XPATH. A fuller expression set will be supported in the future, as users ask for more and as I manage to implement it efficiently. This will never encompass all of XPATH due to the streaming nature of parsing (no lookahead after the element end tag).

A generic_attribute_condition is a condition on an attribute, in the form *[@att='val'] or *[@att], simple quotes can be used instead of double quotes and the leading '*' is actually optional. No matter what the gi of the element is, the handler will be triggered either if the attribute has the specified value or if it just exists.

A string_condition is a condition on the content of an element, in the form gi[string()='foo'], simple quotes can be used instead of double quotes, at the moment you cannot escape the quotes (this will be added as soon as I dig out my copy of Mastering Regular Expressions from its storage box). The text returned is, as per what I (and Matt Sergeant!) understood from the XPATH spec the concatenation of all the text in the element, excluding all markup. Thus to call a handler on the element<p>text <b>bold</b></p> the appropriate condition is p[string()='text bold']. Note that this is not exactly conformant to the XPATH spec, it just tries to mimic it while being still quite concise.

An extension of that notation is gi[string(child_gi)='foo'] where the handler will be called if a child of a gi element has a text value of foo. At the moment only direct children of the gi element are checked. If you need to test on descendants of the element let me know. The fix is trivial but would slow down the checks, so I'd like to keep it the way it is.

A regexp_condition is a condition on the content of an element, in the form gi[string()=~ /foo/']. This is the same as a string condition except that the text of the element is matched to the regexp. The i, m, s and o modifiers can be used on the regexp.

The gi[string(child_gi)=~ /foo/'] extension is also supported.

An attribute_condition is a simple condition of an attribute of the current element in the form gi[@att='val'] (simple quotes can be used instead of double quotes, you can escape quotes either). If several attribute_condition are true the same element all the handlers can be called in turn (in the order in which they were first defined). If the ='val' part is ommited ( the condition is then gi[@att]) then the handler is triggered if the attribute actually exists for the element, no matter what it's value is.

A full_path looks like '/doc/section/chapter/title', it starts with a / then gives all the gi's to the element. The handler will be called if the path to the current element (in the input document) is exactly as defined by the full_path.

A partial_path is like a full_path except it does not start with a /: 'chapter/title' for example. The handler will be called if the path to the element (in the input document) ends as defined in the partial_path.

WARNING: (hopefully temporary) at the moment string_condition, regexp_condition and attribute_condition are only supported on a simple gi, not on a path.

A gi (generic identifier) is just a tag name.

#CDATA can be used to call a handler for a CDATA section respectively.

A special gi _all_ is used to call a function for each element. The special gi _default_ is used to call a handler for each element that does NOT have a specific handler.

The order of precedence to trigger a handler is: generic_attribute_condition, string_condition, regexp_condition, attribute_condition, full_path, longer partial_path, shorter partial_path, gi, _default_ .