in reply to Re^2: Split/Match Question
in thread Split/Match Question

Returning the correct IP address (92.224.8.117 in this case) from this piece of HTML is not impossible, and with enough effort, someone may be able to write a regexp that does the job for this special obfuscation. But with HTML::Parser, it is essentially a no-brainer requiring about 10 lines of code.
Sounds like a challenge....

I wrote this on my first try, and it seems to work:

s{(?:<!(?:--[^-]*(?:-[^-]+)*--\s*)*>)|(?:</?\w[^"'>]*(?:(?:(?:"[^"]*") +|(?:'[^']*'))[^"'>]*)*>)}{}g; s{&#([0-9]+);}{chr $1}eg;
Only two lines, and still a no-brainer. ;-)
The code above should remove all tags and comments, keep any < and > characters that aren't part of a tag, and translate any numeric entities. Things it won't do correctly: declared sections, and short tags. But most browsers won't deal with them correctly either. Oh, and the \w is a short cut, and not quite correct.

Replies are listed 'Best First'.
Re^4: Split/Match Question
by afoken (Chancellor) on May 16, 2010 at 22:40 UTC
    Sounds like a challenge....

    ... for perl golf? A little bit too easy, I think.

    I wrote this on my first try

    Nice, but it doesn't work when applied to the entire page (not just the fragment I posted). I see a lot of page fragments in the result. The IP is there, but burried in a lot of junk.

    Only two lines, and still a no-brainer. ;-)

    It seems you can write REs with just your muscle memory ... ;-) My brain is already in sleep mode, so I can see only line noise. I will look again tomorrow ...

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)