Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Parsing HTML/XML with Regular Expressions (XML::Twig)

by Discipulus (Canon)
on Oct 16, 2017 at 22:00 UTC ( [id://1201473]=note: print w/replies, xml ) Need Help??


in reply to Parsing HTML/XML with Regular Expressions

Hello haukex

I normally use XML::Twig in the sad occasions I need to deal with XML. With small xml data i use __DATA__ and $twig->parse(<DATA>) but with your sample I got no element found at line 2, column 0, byte 39 at.. even if W3C validator parses the file as correct. Using a real file I had no errors. I dunno why and I rarely inspect XML with my eyes; doctor said is no good ;=)

I have no managed to strip out nbsp from Sunday, but now it's to late to deal with entities and the biiig XML::Twig manpage. See you Sundaynbsp at the Pubnbsp ;=)

use strict; use warnings; use XML::Twig; my @days; my $twig= XML::Twig->new( twig_handlers=>{ 'div[@class="data"]'=>sub{ (my $txt = $_[1]->text)=~s/\W//g; push @days, $_[1]->att('id')."=$txt"; } } ); $twig->parsefile ('example.html'); print join ', ', @days; # output Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sundaynbsp

PS i bet tybalt89 will come out with some working regex solution! ;=)

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig)
by haukex (Archbishop) on Oct 17, 2017 at 11:25 UTC

    Thanks very much for the contribution! Regarding the DATA and &nbsp; issues, see my reply here - although I assume you meant $twig->parse(*DATA) instead of $twig->parse(<DATA>)? With the updated example in the root node, your code works!

    And yes, I assumed someone might take up the challenge of actually using a regex - but of course then I'd have to try to break it with more test cases ;-)

      You presumed ~right about DATA filehandle.

      The xmltwig.org and docs specify parse    $string or \*OPEN_FILEHANDLE among twig's methods.

      So you are right: I had to pass an handle not an iterator (?) like <DATA>

      I dunno when I took this bad habit but if you look at this and this other one and this other too and probably many others of mines, $twig->parse(<DATA>) works!!

      So $twig->parse(<DATA>) does not works with your example but i can confirm that passing the filehandle $twig->parse(\*DATA) or even $twig->parse(*DATA) works as expected.

      Can be that wrong form works (at least sometimes) because of the XML::Twig ability to parse streams of XML?

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        In those three examples you linked to, right before you say <DATA> you do $/='';, which enables "paragraph mode", it's as if the input record separator $/ was /\n\n+/.

        So you are right: I had to pass an handle not an iterator (?) like <DATA>

        <DATA> is the equivalent of readline(DATA), and since readline is being called in list context, it'll read all the records from the handle and return a list of them. So as long as your __DATA__ section doesn't contain any empty lines, it's essentially the same as a slurp - this is probably why the "wrong form" still works.

Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig)
by holli (Abbot) on Oct 17, 2017 at 09:58 UTC
    some working regex solution
    That's certainly possible. It was possible to produce a regex that parses all of Perl, why not one for HTML?


    holli

    You can lead your users to water, but alas, you cannot drown them.
      It was possible to produce a regex that parses all of Perl, why not one for HTML?

      There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing

      That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of <div> nesting to find the end of the contained text.

      # Not tested and assumes proper nesting of <div> elements (and valid X +ML syntax) # (Warning: Messy hack. Read at your own risk.) my $nest = 0; my $out = ''; my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero +n/REX.html#AppA for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal w +hite space next unless (/id\h*=\h*['"](\w+)['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; say "$out\n";

      Update: Changed title to indicate (regex)

        Interesting post, thank you! I tested it and except that I had to strip non-word characters out of the values, it mostly works - it doesn't pick up the id of the Sunday Saturday entry, and it also picks up the values "bbbdddeeeggg", but overall it's a very interesting start. Regexes are a fine tool for lexing, and by adding some logic around them keeping track of the nested tags etc., it's basically like building a simple parser.

      I am not sure wether such a regex would fit even into the 18 Exabyte-limit of most modern file systems …
      :-)
        Perl is a bit more complex to parse than HTML, don't you think?


        holli

        You can lead your users to water, but alas, you cannot drown them.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201473]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-18 04:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found