simple regex help

nmerriweather has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: simple regex help by wfsp (Abbot) on Apr 18, 2007 at 16:57 UTC
hi nmerriweather, i'd like to not use an html tree module to handle this -- and keep it all in regex. is this possible? In a word yes. But, imo, very tricky. You didn't say what exactly you're looking for. Perhaps some examples, including those nested `<li>`s? It's very easy to "get at" all the html elements. I'd wager a solution could be found using something like the following. `#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do{local $/; <DATA>}; my $p = HTML::TokeParser::Simple->new(\$html) or die "can't parse string: $!\n"; while (my $t = $p->get_token){ printf "%s\n", $t->as_is; } __DATA__ <li><span class="title">Title</span> MATCH HERE </li>` [download] output: `<li> <span class="title"> Title </span> * MATCH HERE * </li> * *` [download]	[reply] [d/l] [select]
Re^2: simple regex help by nmerriweather (Friar) on Apr 18, 2007 at 17:08 UTC
<quote>You didn't say what exactly you're looking for. Perhaps some examples, including those nested s?</quote> Well, anything in the 'match here' -- the content changes. the only given i know, is that the outermost match is this: `<li><span class="title">Title</span> MATCH HERE </li>` [download] MATCH here could be a single letter, or it could be an html structure that potentially matches the regex i really need to keep this in regex if possible -- using the tree objects is a last resort	[reply] [d/l]
Re^3: simple regex help by wfsp (Abbot) on Apr 18, 2007 at 17:19 UTC
I've used a stack to keep track of opening/closing li tags. #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do{local $/; <DATA>}; my $p = HTML::TokeParser::Simple->new(\$html) or die "can't parse string: $!\n"; while (my $t = $p->get_token){ last if $t->is_end_tag('span'); } my ($match, @li_stack); while (my $t = $p->get_token){ if ($t->is_start_tag('li')){ push @li_stack, 'li'; } if ($t->is_end_tag('li')){ if (@li_stack){ pop @li_stack; } else{ last; } } $match .= $t->as_is; } print "$match\n"; __DATA__ <li><span class="title">Title</span><ul><li>one</li><li>two</li></ul> +MATCH HERE </li> [download] output: `<ul><li>one</li><li>two</li> MATCH HERE` [download] update: Added output. uptdate 2 see ikegami's reply below.	[reply] [d/l] [select]
Re^4: simple regex help by ikegami (Patriarch) on Apr 18, 2007 at 17:52 UTC
Re^5: simple regex help by Fletch (Bishop) on Apr 18, 2007 at 17:59 UTC
Some notes below your chosen depth have not been shown here
Re: simple regex help by ikegami (Patriarch) on Apr 18, 2007 at 17:14 UTC
Complication: The closing LI tag is optional. Simplification: An LI element can only be closed by a few tags, they will always close an LI element, and one of them must be present. They are: `</LI>`, `<LI>`, `</OL>` and `</UL>`. Simplification: OL, UL and LI elements cannot be placed inside of a SPAN element. Complication: OL, UL and (indirectly) LI elements can be placed inside of an LI element. `our $ul_or_ol; local $ul_or_ol = qr{ (?: <ul \b (?: (??{ $ul_or_ol }) \| (?! </ul \b ). )* </ul \b [^>]* > \| <ol \b (?: (??{ $ul_or_ol }) \| (?! </ol \b ). )* </ol \b [^>]* > ) }xi; my $re = qr{ <li \b [^>]* > <span class="[^"]+"> [^<]+ </span> ( (?: $ul_or_ol \| (?! < (?:li\|/li\|/ol\|/ul) \b ). )* ) (?: </li \b [^>]* > \| (?= < (?:li\|/ol\|/ul) \b ) ) }xi;` [download] Bug/Assumption: The above can fail if there's a `<` or a `>` inside an attribute value, a comment, a script or a style. Bug/Assumption: The above can fail if some valid SGML constructs are used. Fortunately, noone uses them. Bug: Unmaintainable. Note: It can be simplified if more assumptions are made. It can be optimized as well. Untested. Update: Fixed to handle nested lists in the "MATCH HERE" portion.	[reply] [d/l] [select]
Re: simple regex help by ikegami (Patriarch) on Apr 18, 2007 at 19:43 UTC
It's super easy to find the desired LI using XPath. `use HTML::TreeBuilder qw( ); use HTML::TreeBuilder::XPath qw( ); my $html = do{local $/; <DATA>}; my $tree = HTML::TreeBuilder->new_from_content($html); my @results = $tree->findnodes( '//li/[1][name()="span"]/@class/parent::/parent::*' ); foreach my $li (@results) { # $li holds the HTML::Element object for the LI. ... } $tree->delete();` [download] It's a little bit tricky to serialize the LI without the LI start tag, the SPAN element and the LI end tag. `use HTML::Entities qw( encode_entities ); sub node_as_html { my ($node) = @_; if (ref($node)) { my $html = $node->as_HTML(undef, undef, {}); chomp($html); return $html; } else { return encode_entities($node); } } ... my @children = $li->content_list(); shift(@children); # Skip SPAN print(node_as_html($_)) foreach @children; print("\n"); ...` [download]	[reply] [d/l] [select]
Re: simple regex help by Moron (Curate) on Apr 18, 2007 at 17:19 UTC
You should expect to process tags recursively. You could use XML::Parser or one of the XML wrappers for that to do it for you. If you roll your own, the area of logic you are stumbling on should first check for a closing tag of the current tag. Otherwise any other '<' should invoke a recursive call to get a nested tag. '</current-tag-name>' should be just before returning from the recursive sub that gets tags. __________________________________________________________________________________ ^M Free your mind!	[reply]