comment on

I'm trying to see a way to solve it without using HTML::Parser

My solution will probably make you rethink that...

I'd use a regex lookahead to catch and strip paired HTML tags such as bolding (blah) or italics(blah). However, note that my snippet breaks on the final line of DATA, because the lookahead in the regex assumes that the closing HTML tag will just be </$1>

#!/usr/bin/perl

use strict;
use warnings;

while(<DATA>)
{
  chomp;

  while ( m{ <([^>]*?)> [^<]*? </\1> }gx )
  {
    my $token = $1;

    s{<$token>}{}g;
    s{</$token>}{}g;
  }

  
  # Of course, this will be hard to do if you
  # "don't know how many dashes, if any, will
  # be there."
  print "\t$_\n" for ( split (/-- /,$_) );
  print "\n";

}

__DATA__
This is a -- string of -- words
<b>This is a -- string of -- words</b>
This <b>is a -- string</b> of -- words
This <i>is</i> a -- <b>string</b> of -- words
This <i>is a -- <b>nested set</b> of</i> -- tokens
This is -- a nifty -- <A HREF="http://google.com">search engine</A>
[download]

Update: Tweaks to make above script handle unpaired open/close tags, such as
<A HREF="http://google.com">Google</A>

This can't handle paired and unpaired tags in the same line (see last line of data, which causes script to hang, hence the # and the skip condition)

#!/usr/bin/perl

use strict;
use warnings;

while(<DATA>) {
  /^#/ and next;
  chomp;

  if ( m{ <([^>]*?)> [^<]*? </\1> }x )
  {
    while ( m{ <([^>]*?)> [^<]*? </\1> }gx )
    {
      my $token = $1;

      # Some verbose info. Note first line doesn't
      # get printed because it doesn't match regex
      print $_, "\n";
      print "Found <$token> and </$token>, removing...\n";

      s{<$token>}{}g;
      s{</$token>}{}g;
    }
  }

  else
  {
    # <A HREF="http://google.com">search engine</A>
    while ( m{ </([^>]*?)> }x )
    {
      my $close = $1;
      if ( m{ <($close[^>]*?)> [^<]*? </$close> }x )
      {
        my $open  = $1;
        print $_,"\n";
        print "Found <$open> and </$close>, removing\n";
        s{<$open>}{}g;
        s{</$close>}{}g;
        print $_,"\n";
      }
    }
  }

  # Of course, this will be hard to do if you
  # "don't know how many dashes, if any, will
  # be there."
  print "\t$_\n" for ( split (/-- /,$_) );
  print "\n";

}

__DATA__
This is a -- string of -- words
<b>This is a -- string of -- words</b>
This <b>is a -- string</b> of -- words
This <i>is</i> a -- <b>string</b> of -- words
This <i>is a -- <b>nested set</b> of</i> -- tokens
This is -- an awesome -- <A HREF="http://google.com">search engine</A>
Truly an -- ugly -- <A HREF="http://perl.com"><FONT COLOR="RED">nested
+</FONT> st
ring</A>
#This string -- <b>causes -- <A HREF="http://perl.com"><FONT COLOR="RE
+D">my box</
b></FONT> to hang</A>
[download]

Urgh. HTML::Parser really is your friend here. btw I wanted to comment my regexes but found myself unable to adequately describe them.

blyman
setenv EXINIT 'set noai ts=2'

In reply to Re: A nice text processing question by belden
in thread A nice text processing question by moseley

Posts are HTML formatted. Put   tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.