in reply to Split/Match Question

The original question was to my understanding: "can I parse this HTML in a single regex". And the answer is yes! One solution is shown below. The code is a bit tedious but it is straightforward and can be understood with some methodical thinking.

However there are a lot of pitfalls with this approach. Not the least of which is that the user layout of these HTML pages can change from one day to the next. Some of these HTML parser modules are more robust in terms of being able to handle something that "didn't quite look like it did before" and there are a zillion ways that can happen. These "one-off" things like below tend to be very single purpose rather than general purpose. So there are some trade-offs that evolve things that we haven't even begun to discuss here.

Anyway, I think you have a number of excellent approaches in this thread and one of them or a derivative of it will work find for you.

#!/usr/bin/perl -w use strict; my $doc =<<FORM; <div><label>Emp ID:</label> AASDFG <br><label>Mobile Num:</label> 9999 +999999 <br><label>location:</label> India <br><label>Inservice:</labe +l>Yes </div> FORM my @pairs = ($doc =~ m~<label>\s*(.*?)\s*</label>\s*(.*?)\s*<~g); while (@pairs) { my ($field, $value) = splice(@pairs,0,2); printf "%-15s %s\n", $field, $value; } __END__ Emp ID: AASDFG Mobile Num: 9999999999 location: India Inservice: Yes

Replies are listed 'Best First'.
Re^2: Split/Match Question
by afoken (Chancellor) on May 16, 2010 at 20:51 UTC
    "can I parse this HTML in a single regex". And the answer is yes!

    ... with a BIG emphasis on this HTML. HTML is a beast to parse correctly, due to its inheritance from SGML, and due to the error correction / guessing algorithms used in most browsers. Simple regular expressions may work as long as the HTML has a well-known format and does not use too many SGML or encoding tricks.

    Just yesterday, I stubled over this nice piece of valid(!) HTML, hand-crafted to defeat most simple-minded string parsers and regular expressions:

    <h1>My-IP-Service</h1> <h1 class="myip"><!--- > A comment about the abuse of their service they want to prevent ... <a href="/netze/tools/whois-abfrage/?rm=whois_formular">nicht ermittel +bar</a> <a href="/netze/tools/whois-abfrage/?rm=whois_formular">127.0.0.1</a> <a href="/netze/tools/whois-abfrage/?rm=whois_formular">198.18.0.15</a +> < --><a href="/netze/tools/whois-abfrage/?rm=whois_formular">&#57;<!-- + >226.180.195.155 < -->&#50;<!-- > 253.159.244.9 < -->&#46;<!-- > 253.239.61.182< -->&#50;<!-- >230.121.254.208 < -->&#50;<!-- > 251.168.157.152 < -->&#52;<!-- > 254.121.189.15< -->&#46;<!-- > 237.24.153.213< -->&#56;<!-- >246.217.119.248 < -->&#46;<!-- > 245.167.107.28 < -->&#49;<!-- >226.204.198.25 < -->&#49;<!-- > 233.167.179.189 < -->&#55;<!-- > 228.193.179.191< --></a></h1>

    (From: http://www.heise.de/netze/tools/ip)

    Returning the correct IP address (92.224.8.117 in this case) from this piece of HTML is not impossible, and with enough effort, someone may be able to write a regexp that does the job for this special obfuscation. But with HTML::Parser, it is essentially a no-brainer requiring about 10 lines of code (Sorry, Heise ...). And unless the author finds a way to confuse HTML::Parser without breaking browsers, it will not fail when the author modifies the obfuscation. (Well, changing the H1 tag or its class attribute would break this special implementation.)

    #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::Parser; my $ip=''; my $wanted=0; HTML::Parser->new( api_version => 3, start_h => [sub { $wanted=1 if ($_[0] eq 'h1') && ($_[1]->{'class' +} eq 'myip') }, 'tagname,attr'], end_h => [sub { $wanted=0 if $_[0] eq 'h1' }, 'tagname'], text_h => [sub { $ip.=$_[0] if $wanted }, 'dtext'], )->parse(get('http://www.heise.de/netze/tools/ip')); print "$ip\n";

    So, DON'T use regular expressions to parse HTML or XML. Except perhaps in very special cases where you control how the HTML/XML is generated.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      Returning the correct IP address (92.224.8.117 in this case) from this piece of HTML is not impossible, and with enough effort, someone may be able to write a regexp that does the job for this special obfuscation. But with HTML::Parser, it is essentially a no-brainer requiring about 10 lines of code.
      Sounds like a challenge....

      I wrote this on my first try, and it seems to work:

      s{(?:<!(?:--[^-]*(?:-[^-]+)*--\s*)*>)|(?:</?\w[^"'>]*(?:(?:(?:"[^"]*") +|(?:'[^']*'))[^"'>]*)*>)}{}g; s{&#([0-9]+);}{chr $1}eg;
      Only two lines, and still a no-brainer. ;-)
      The code above should remove all tags and comments, keep any < and > characters that aren't part of a tag, and translate any numeric entities. Things it won't do correctly: declared sections, and short tags. But most browsers won't deal with them correctly either. Oh, and the \w is a short cut, and not quite correct.
        Sounds like a challenge....

        ... for perl golf? A little bit too easy, I think.

        I wrote this on my first try

        Nice, but it doesn't work when applied to the entire page (not just the fragment I posted). I see a lot of page fragments in the result. The IP is there, but burried in a lot of junk.

        Only two lines, and still a no-brainer. ;-)

        It seems you can write REs with just your muscle memory ... ;-) My brain is already in sleep mode, so I can see only line noise. I will look again tomorrow ...

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      I put However underlined for a reason. Simple Regex works but one must understand the limitations of which there are many, ;-)