Hi, it sounds like the files you are parsing are HTML. There is a great little widget called HTML::TokeParser that parses the tokens (tags) in an HTML file. Say you want to find "foo", here are some examples:
<p>Here is a foo <p>Here is another foo. <p>Here <b>is a foo in bold</b> <p><a href="http://foo.com"> foo.com </a>
and here is how that renders in a browser - this is that exact HTML:
Here is a foo
Here is another foo.
Here is a foo in bold
Some of the problems include the fact that the opening tag may not be on the same line. In HTML newlines are ignored when the content is rendered. There may of may not be a closing tag. Also you may note that technically everything is 'within' some sort of tags so you need to specify what you want more exactly. Assuming you mean within like in the href example, or even if you don't TokeParser is your friend.
To show you how useful it is here is a little TokeParser example that finds all the heading tags (h1 h2 h3 h4) in an HTML doc, gets the trimmed text between the opening and closing tag (minus other tags), color codes it and then prints out the color coded headings producing a quick and dirty index. Anyway if you only want to look for stuff in the text this makes it easy! You can rebuild the line from the tokens.
So have a look at the docs for TokeParser. It breaks everything down into little bits. Once you have done this testing if it is in a tag (whatever you mean by that) is easy as TokeParser has done all the work for you. A regex solution will almost always be a kludge and broken in some cases. Reliability == TokeParser
#!/usr/bin/perl -w use strict; use HTML::TokeParser; my $dir = "c:/windows/desktop/book/"; my $file = $dir."work_index.htm"; my $p = HTML::TokeParser->new($file) || die "Can't open $file: $!"; my %font = ( h1 => '#0000ff', h2 => '#0000a0', h3 => '#000060', h4 => '#000000'); while (my $token = $p->get_tag(qw(h1 h2 h3 h4))) { my $open = $token->[0]; my $close = '/'.$open; my $text = $p->get_trimmed_text($close); print "<$open><font color='$font{$open}'>$text</font><$close>\n"; }
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
In reply to Re: Regular Expression Help
by tachyon
in thread Regular Expression Help
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |