Split/Match Question

esmadmin has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Split/Match Question by toolic (Bishop) on May 14, 2010 at 21:15 UTC
You should use an HTML parser, such as HTML::TokeParser, rather than a regular expression. `use strict; use warnings; use HTML::TokeParser; my $document = <<EOF; <div><label>Emp ID:</label> AASDFG <br><label>Mobile Num:</label> 9999 +999999 <br><label>location:</label> India <br><label>Inservice:</labe +l>Yes </div> EOF my $p = HTML::TokeParser->new( \$document ); while ($p->get_tag('label')) { my $text = $p->get_text('br'); my $t2 = (split /:/, $text)[1]; print "$t2\n"; } __END__ AASDFG 9999999999 India Yes` [download]	[reply] [d/l]
Re: Split/Match Question by JavaFan (Canon) on May 14, 2010 at 21:13 UTC
There's a difference between want to get the details from this between the HTML tags and The value I need is between "> " and " <". The latter is trivial: `/>([^<]*)</g` will do that. But consider: `a > 3 && b < 5 <img src = "..." alt = "a > 3 && b < 5"> <!--> comments are fun! <!-->` [download] Three examples of valid HTML. Three example of cases you do not want the things between > and <.	[reply] [d/l] [select]
Re: Split/Match Question by Anonymous Monk on May 14, 2010 at 21:01 UTC
1) use a html parser 2) use a html parser :D	[reply]
Re^2: Split/Match Question by ikegami (Patriarch) on May 14, 2010 at 23:07 UTC
use strict; use warnings; use XML::LibXML qw( XML_TEXT_NODE ); my $html = '<div><label>Emp ID:</label> AASDFG <br><label>Mobile Num:< +/label> 9999999999 <br><label>location:</label> India <br><label>Inse +rvice:</label>Yes </div>'; my $doc = XML::LibXML->new->parse_html_string($html); my %pairs; for my $label_node ( $doc->findnodes('/html/body/div/label') ) { my $label = $label_node->textContent(); $label =~ s/:\z//; $pairs{$label} = ''; my $node = $label_node; while ($node = $node->nextSibling() && $node->nodeType() == XML_TEXT_NODE) { $pairs{$label} .= $node->getValue(); } s/^\s+//, s/\s+\z// for $pairs{$label}; } # Now do whatever with the data for my $k (keys(%pairs)) { printf("%-12s %s\n", $k, $pairs{$k}); } [download] `Inservice Yes location India Emp ID AASDFG Mobile Num 9999999999` [download]	[reply] [d/l] [select]
Re: Split/Match Question by Marshall (Canon) on May 16, 2010 at 09:59 UTC
The original question was to my understanding: "can I parse this HTML in a single regex". And the answer is yes! One solution is shown below. The code is a bit tedious but it is straightforward and can be understood with some methodical thinking. However there are a lot of pitfalls with this approach. Not the least of which is that the user layout of these HTML pages can change from one day to the next. Some of these HTML parser modules are more robust in terms of being able to handle something that "didn't quite look like it did before" and there are a zillion ways that can happen. These "one-off" things like below tend to be very single purpose rather than general purpose. So there are some trade-offs that evolve things that we haven't even begun to discuss here. Anyway, I think you have a number of excellent approaches in this thread and one of them or a derivative of it will work find for you. `#!/usr/bin/perl -w use strict; my $doc =<<FORM; <div><label>Emp ID:</label> AASDFG <br><label>Mobile Num:</label> 9999 +999999 <br><label>location:</label> India <br><label>Inservice:</labe +l>Yes </div> FORM my @pairs = ($doc =~ m~<label>\s(.?)\s</label>\s(.?)\s<~g); while (@pairs) { my ($field, $value) = splice(@pairs,0,2); printf "%-15s %s\n", $field, $value; } __END__ Emp ID: AASDFG Mobile Num: 9999999999 location: India Inservice: Yes` [download]	[reply] [d/l]
Re^2: Split/Match Question by afoken (Chancellor) on May 16, 2010 at 20:51 UTC
"can I parse this HTML in a single regex". And the answer is yes! ... with a BIG emphasis on this HTML. HTML is a beast to parse correctly, due to its inheritance from SGML, and due to the error correction / guessing algorithms used in most browsers. Simple regular expressions may work as long as the HTML has a well-known format and does not use too many SGML or encoding tricks. Just yesterday, I stubled over this nice piece of valid(!) HTML, hand-crafted to defeat most simple-minded string parsers and regular expressions: <h1>My-IP-Service</h1> <h1 class="myip"><!--- > A comment about the abuse of their service they want to prevent ... <a href="/netze/tools/whois-abfrage/?rm=whois_formular">nicht ermittel +bar</a> <a href="/netze/tools/whois-abfrage/?rm=whois_formular">127.0.0.1</a> <a href="/netze/tools/whois-abfrage/?rm=whois_formular">198.18.0.15</a +> < --><a href="/netze/tools/whois-abfrage/?rm=whois_formular">9<!-- + >226.180.195.155 < -->2<!-- > 253.159.244.9 < -->.<!-- > 253.239.61.182< -->2<!-- >230.121.254.208 < -->2<!-- > 251.168.157.152 < -->4<!-- > 254.121.189.15< -->.<!-- > 237.24.153.213< -->8<!-- >246.217.119.248 < -->.<!-- > 245.167.107.28 < -->1<!-- >226.204.198.25 < -->1<!-- > 233.167.179.189 < -->7<!-- > 228.193.179.191< --></a></h1> [download] (From: http://www.heise.de/netze/tools/ip) Returning the correct IP address (92.224.8.117 in this case) from this piece of HTML is not impossible, and with enough effort, someone may be able to write a regexp that does the job for this special obfuscation. But with HTML::Parser, it is essentially a no-brainer requiring about 10 lines of code (Sorry, Heise ...). And unless the author finds a way to confuse HTML::Parser without breaking browsers, it will not fail when the author modifies the obfuscation. (Well, changing the H1 tag or its class attribute would break this special implementation.) `#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::Parser; my $ip=''; my $wanted=0; HTML::Parser->new( api_version => 3, start_h => [sub { $wanted=1 if ($_[0] eq 'h1') && ($_[1]->{'class' +} eq 'myip') }, 'tagname,attr'], end_h => [sub { $wanted=0 if $_[0] eq 'h1' }, 'tagname'], text_h => [sub { $ip.=$_[0] if $wanted }, 'dtext'], )->parse(get('http://www.heise.de/netze/tools/ip')); print "$ip\n";` [download] So, DON'T use regular expressions to parse HTML or XML. Except perhaps in very special cases where you control how the HTML/XML is generated. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^3: Split/Match Question by JavaFan (Canon) on May 16, 2010 at 22:18 UTC
Returning the correct IP address (92.224.8.117 in this case) from this piece of HTML is not impossible, and with enough effort, someone may be able to write a regexp that does the job for this special obfuscation. But with HTML::Parser, it is essentially a no-brainer requiring about 10 lines of code. Sounds like a challenge.... I wrote this on my first try, and it seems to work: `s{(?:<!(?:--[^-](?:-[^-]+)--\s)>)\|(?:</?\w[^"'>](?:(?:(?:"[^"]") +\|(?:'[^']'))[^"'>])*>)}{}g; s{&#([0-9]+);}{chr $1}eg;` [download] Only two lines, and still a no-brainer. ;-) The code above should remove all tags and comments, keep any `<` and `>` characters that aren't part of a tag, and translate any numeric entities. Things it won't do correctly: declared sections, and short tags. But most browsers won't deal with them correctly either. Oh, and the `\w` is a short cut, and not quite correct.	[reply] [d/l]
Re^4: Split/Match Question by afoken (Chancellor) on May 16, 2010 at 22:40 UTC
Re^3: Split/Match Question by Marshall (Canon) on May 16, 2010 at 22:18 UTC
I put However underlined for a reason. Simple Regex works but one must understand the limitations of which there are many, ;-)	[reply]