Hi, it sounds like the files you are parsing are HTML. There is a great little widget called HTML::TokeParser that parses the tokens (tags) in an HTML file. Say you want to find "foo", here are some examples:

<p>Here is a foo <p>Here is another foo. <p>Here <b>is a foo in bold</b> <p><a href="http://foo.com"> foo.com </a>

and here is how that renders in a browser - this is that exact HTML:

Here is a foo

Here is another foo.

Here is a foo in bold

foo.com

Some of the problems include the fact that the opening tag may not be on the same line. In HTML newlines are ignored when the content is rendered. There may of may not be a closing tag. Also you may note that technically everything is 'within' some sort of tags so you need to specify what you want more exactly. Assuming you mean within like in the href example, or even if you don't TokeParser is your friend.

To show you how useful it is here is a little TokeParser example that finds all the heading tags (h1 h2 h3 h4) in an HTML doc, gets the trimmed text between the opening and closing tag (minus other tags), color codes it and then prints out the color coded headings producing a quick and dirty index. Anyway if you only want to look for stuff in the text this makes it easy! You can rebuild the line from the tokens.

So have a look at the docs for TokeParser. It breaks everything down into little bits. Once you have done this testing if it is in a tag (whatever you mean by that) is easy as TokeParser has done all the work for you. A regex solution will almost always be a kludge and broken in some cases. Reliability == TokeParser

#!/usr/bin/perl -w use strict; use HTML::TokeParser; my $dir = "c:/windows/desktop/book/"; my $file = $dir."work_index.htm"; my $p = HTML::TokeParser->new($file) || die "Can't open $file: $!"; my %font = ( h1 => '#0000ff', h2 => '#0000a0', h3 => '#000060', h4 => '#000000'); while (my $token = $p->get_tag(qw(h1 h2 h3 h4))) { my $open = $token->[0]; my $close = '/'.$open; my $text = $p->get_trimmed_text($close); print "<$open><font color='$font{$open}'>$text</font><$close>\n"; }

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print


In reply to Re: Regular Expression Help by tachyon
in thread Regular Expression Help by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.