Parsing HTML tags with regex

jithoosin has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have a problem. I wanted to get every html tag in a file,that is every thing between < and > using RegEx (without using HTML::TokenParser). So i used m/<([^>]+)>/.

But the problem occurs in cases like this <select name="url>adee" value="wq<ew">.Here the ">" inside "url>adee" is stopping the regEx .Is there any good solution using regEx.

After reading the first 2 replies i will explain my situation .I am in a BETwith my friends that it is possible with regEx to do it. Please help me.There is a solution for everything .Please help me. I wanna win the bet.

2005-11-11 Retitled by g0n, as per Monastery guidelines
Original title: 'simple regExpr'

Comment on Parsing HTML tags with regex Select or Download Code

Replies are listed 'Best First'.
Re: Parsing HTML tags with regex by tphyahoo (Vicar) on Nov 11, 2005 at 10:46 UTC
Not so fast pal. Did you really win the bet? Can your regex process html comments with brackets in them, such as <!-- Html comment with a bracket... > --!> No? Use one of the HTML::? modules and go crawl back to your friend and admit you were wrong.	[reply]
Re^2: Parsing HTML tags with regex by jithoosin (Scribe) on Nov 11, 2005 at 11:55 UTC
Thanks for the notification	[reply]
Re: Parsing HTML tags with regex by gopalr (Priest) on Nov 11, 2005 at 08:50 UTC
Hi jithoosin, Here is the regex to match the tag with attributes value. `m#<([^">]+(?:"[^"]+")*[^>]+)>#` [download] Thanks, Gopal.R	[reply] [d/l]
Re^2: Parsing HTML tags with regex by Perl Mouse (Chaplain) on Nov 14, 2005 at 11:22 UTC
But that would match on: `a < b implies b > a` [download] which does not contain an HTML tag. Oh, and it won't match all HTML tags correctly either. Consider for instance: `<tag attr1="one" attr2="two"> <tag attr='"'> <tag attr1='"'>` [download] The first one fails to match because your regex requires that if there are double quoted values inside a tag, they must follow each other. And the second fails because your regex doesn't consider single quoted values. `Perl --((8:>*`	[reply] [d/l] [select]
Re^2: Parsing HTML tags with regex by Anonymous Monk on Jan 19, 2012 at 09:11 UTC
thanks gopal the above regex was usefull	[reply]
Re^2: Parsing HTML tags with regex by jithoosin (Scribe) on Nov 11, 2005 at 09:23 UTC
Hi gopal, THANK YOU VERY MUCH. I won the bet .But now i am in bit of trouble. I donot know how to explain the working to my friends.So could you PLEASE explain the working of the regular expression.Once again THANK YOU VERY MUCH GOPAL.	[reply]
Re^3: Parsing HTML tags with regex by gopalr (Priest) on Nov 11, 2005 at 10:09 UTC
`m# < ## start with < ( ## group start [^">]+ ## text but Not match " and > (?:"[^"]+")* ## if " found, match till end quote found. Its optional [^>]+ ## text but Not match and > ) ## group end > ## End with > #` [download]	[reply] [d/l]
Re^4: Parsing HTML tags with regex by jithoosin (Scribe) on Nov 11, 2005 at 12:21 UTC
Re: Parsing HTML tags with regex by pg (Canon) on Nov 11, 2005 at 08:21 UTC
"without using HTML::TokenParser" Why? This is simply not the right decision. In this case, it is more important to do it right, with the right tool - HTML parser (for example what murugu mentioned), but not strugling with the "right regexp".	[reply]
Re: Parsing HTML tags with regex by Skeeve (Parson) on Nov 11, 2005 at 09:47 UTC
Being picky again and, correct me anyone knowing better, but `<select name="url>adee" value="wq<ew">` is not legal HTML. It has to be encoded as `<select name="url>adee" value="wq<ew">` `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^2: Parsing HTML tags with regex by Perl Mouse (Chaplain) on Nov 14, 2005 at 11:17 UTC
Being picky again and, correct me anyone knowing better, but `<select name="url>adee" value="wq<ew">` is not legal HTML. I know better. You are wrong. It is legal HTML. Don't let the fact some browsers can't parse it fool you. `Perl --((8:>*`	[reply] [d/l]
Re^2: Parsing HTML tags with regex by jithoosin (Scribe) on Nov 11, 2005 at 11:59 UTC
Hi skeeve, the actual thing was `<select name="url" style="width:125px" size="1" onchange="if (this.selectedIndex>0) parent.location.href=this.options[this.selectedIndex].value;">.` I just used replaced it.	[reply] [d/l]
Re^3: Parsing HTML tags with regex by fizbin (Chaplain) on Nov 11, 2005 at 12:40 UTC
That's not legal html either. Oh, sure, people put crap like that on their html pages, but it's not legal html - throw it at any html validator. The legal version of that is: `<select name="url" style="width:125px" size="1" onchange="if (this.sel +ectedIndex>0) parent.location.href=this.options[this.selectedIndex +].value;">` [download] -- `@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/` [download]	[reply] [d/l] [select]
Re: Parsing HTML tags with regex by murugu (Curate) on Nov 11, 2005 at 08:20 UTC
Try HTML::Parser. Regards, Murugesan Kandasamy use perl for(;;);	[reply]
Re: Parsing HTML tags with regex by BUU (Prior) on Nov 11, 2005 at 08:32 UTC
It's not really possible with a real regex. HTML is an arbitrarily nested grammar, which doesn't work very well with a "regular" expression. However, given than perl's regexen are of the scary, non regular kind, you could probably manage to do it. Like so.. `/(.*)(?{HTML::TokeParser->new( $1 )}/` [download]	[reply] [d/l]

Back to Seekers of Perl Wisdom