Re: Parsing HTML/XML with Regular Expressions (XML::Twig)

Hello haukex

I normally use XML::Twig in the sad occasions I need to deal with XML. With small xml data i use __DATA__ and $twig->parse(<DATA>) but with your sample I got no element found at line 2, column 0, byte 39 at.. even if W3C validator parses the file as correct. Using a real file I had no errors. I dunno why and I rarely inspect XML with my eyes; doctor said is no good ;=)

I have no managed to strip out nbsp from Sunday, but now it's to late to deal with entities and the biiig XML::Twig manpage. See you Sundaynbsp at the Pubnbsp ;=)

use strict;
use warnings;
use XML::Twig;

my @days;
my $twig= XML::Twig->new(
   twig_handlers=>{
                    'div[@class="data"]'=>sub{
                                (my $txt =  $_[1]->text)=~s/\W//g;
                                 push @days, $_[1]->att('id')."=$txt";
                    }
   }
);

$twig->parsefile ('example.html');
print join ', ', @days;

# output
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sundaynbsp
[download]

PS i bet tybalt89 will come out with some working regex solution! ;=)

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Comment on Re: Parsing HTML/XML with Regular Expressions (XML::Twig) Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig) by haukex (Archbishop) on Oct 17, 2017 at 11:25 UTC
Thanks very much for the contribution! Regarding the `DATA` and ` ` issues, see my reply here - although I assume you meant `$twig->parse(*DATA)` instead of `$twig->parse(<DATA>)`? With the updated example in the root node, your code works! And yes, I assumed someone might take up the challenge of actually using a regex - but of course then I'd have to try to break it with more test cases ;-)	[reply] [d/l] [select]
Re^3: Parsing HTML/XML with Regular Expressions (XML::Twig) by Discipulus (Canon) on Oct 17, 2017 at 19:45 UTC
You presumed ~right about `DATA` filehandle. The xmltwig.org and docs specify `parse $string or \OPEN_FILEHANDLE` among twig's methods. So you are right: I had to pass an handle not an iterator (?) like `<DATA>` I dunno when I took this bad habit but if you look at this and this other one and this other too and probably many others of mines, `$twig->parse(<DATA>)` works!! So `$twig->parse(<DATA>)` does not works with your example but i can confirm that passing the filehandle `$twig->parse(\DATA)` or even `$twig->parse(DATA)` works as expected. Can be that wrong form works (at least sometimes) because of the XML::Twig ability to parse streams of XML? L There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig) by haukex (Archbishop) on Oct 18, 2017 at 19:21 UTC
In those three examples you linked to, right before you say `<DATA>` you do `$/='';`, which enables "paragraph mode", it's as if the input record separator $/ was `/\n\n+/`. So you are right: I had to pass an handle not an iterator (?) like `<DATA>` `<DATA>` is the equivalent of `readline(DATA)`, and since readline is being called in list context, it'll read all the records from the handle and return a list of them. So as long as your `__DATA__` section doesn't contain any empty lines, it's essentially the same as a slurp - this is probably why the "wrong form" still works.	[reply] [d/l] [select]
Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig) by holli (Abbot) on Oct 17, 2017 at 09:58 UTC
some working regex solution That's certainly possible. It was possible to produce a regex that parses all of Perl, why not one for HTML? holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^3: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 00:05 UTC
It was possible to produce a regex that parses all of Perl, why not one for HTML? There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of `<div>` nesting to find the end of the contained text. # Not tested and assumes proper nesting of <div> elements (and valid X +ML syntax) # (Warning: Messy hack. Read at your own risk.) my $nest = 0; my $out = ''; my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero +n/REX.html#AppA for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h=\h['"]data['"]/); # \h is horizontal w +hite space next unless (/id\h=\h['"](\w+)['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; say "$out\n"; [download] Update: Changed title to indicate (regex)	[reply] [d/l] [select]
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig) by haukex (Archbishop) on Oct 19, 2017 at 16:27 UTC
Interesting post, thank you! I tested it and except that I had to strip non-word characters out of the values, it mostly works - it doesn't pick up the `id` of the ~~`Sunday`~~ `Saturday` entry, and it also picks up the values "`bbbdddeeeggg`", but overall it's a very interesting start. Regexes are a fine tool for lexing, and by adding some logic around them keeping track of the nested tags etc., it's basically like building a simple parser.	[reply] [d/l] [select]
Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 23:50 UTC
Re^6: Parsing HTML/XML with Regular Expressions (regex) by haukex (Archbishop) on Oct 20, 2017 at 09:03 UTC
Some notes below your chosen depth have not been shown here
Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 22:13 UTC
Re^3: Parsing HTML/XML with Regular Expressions (XML::Twig) by soonix (Chancellor) on Oct 17, 2017 at 11:45 UTC
I am not sure wether such a regex would fit even into the 18 Exabyte-limit of most modern file systems … :-)	[reply]
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig) by holli (Abbot) on Oct 17, 2017 at 17:27 UTC
Perl is a bit more complex to parse than HTML, don't you think? holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^5: Parsing HTML/XML with Regular Expressions (XML::Twig) by soonix (Chancellor) on Oct 18, 2017 at 06:22 UTC