Re: Re: A nice text processing question

Well, I'm starting to believe that HTML::Parser is the way to go. I was trying to avoid it for size reasons (running under mod_perl), and since I'll have many of these to do I'd like the fastest way possible.

Looking at your code I'm not sure you understood.

I'm not trying to remove the tags. Rather imagine a long string of text that may or may not have some words (or group of words) bolded or marked up in some way.

Now, what I'm then doing is splitting it up into chunks, which may end up splitting a tagged words. So one chunk may have the opening tag, where another tag may have the closing tag. Or it might get split in the middle of two tags, so that a given chunk might have the *closing* tag from the previous chunk, and the *opening* tag that's not closed until the next chunk.

In other words:

Starting text:

<tag>This is a -- bunch</tag> of words <tag>where maybe -- some have</tag> tags.

Splitting on the double dash:

<tag>This is a
bunch</tag> of words <tag>where maybe
some have</tag> tags.

Which should then be corrected to:

<tag>This is a</tag>
<tag>bunch</tag> of words <tag>where maybe</tag>
<tag>some have</tag> tags.

I might check on the lwp list, too, since I'll probably move to HTML::Parser.

Comment on Re: Re: A nice text processing question

Replies are listed 'Best First'.
Re: Re: Re: A nice text processing question by belden (Friar) on Jan 06, 2002 at 00:25 UTC
Bah. Here's a version that kind of does what you want. You're still much better off with HTML::Parser - hopefully you'll go far with that. #!/usr/bin/perl use strict; use warnings; while(<DATA>) { /^#/ and next; chomp; $_ = join ("\n", split (/-- /, $_ ) ); $_ = &tagparity($_); print $_,"\n\n"; # my $no_tags_stripped; # do { $no_tags_stripped = &tagstrip($_) } unless $no_tags_stripped; # print $_,"\n"; } exit; #<one>Here's -- lotsa</one> tagged words <two>that-- need</two>parity. # This is a pretty stupid sub that will break very easily - lines # 0, 3, 4, 5, 6, and 8 in DATA are mangled beyond belief by this sub sub tagparity() { my $string = shift; my @tags; my @temp; my %pairs; # Find pairs of tags and mate 'em up while ( $string =~ / <([^>]?)> /gx ) { push(@temp, $1); @temp %2 == 0 or next; $pairs{$temp[0]} = $temp[1]; $pairs{$temp[1]} = $temp[0]; push(@tags, @temp); @temp = (); }; # Try to insert mates my $tag_regex = join('\|',@tags); for my $stringbit( split(/\n/, $string ) ) { while ( $stringbit =~ /($tag_regex)/g ) { my $match = $1; my $mate = $pairs{$match}; # If we've matched a closing tag, insert an opening # tag, and vice versa. This is the part that mangles # lines 0, 3, 4, 5, 6, and 8 in DATA if($match =~ m{^/}) { $string =~ s/$stringbit/<$mate>$stringbit/ } else { $string =~ s/$stringbit/$stringbit<$mate>/ } } } return $string; } sub tagstrip() { my $changed = undef; if ( m{ <([^>]?)> [^<]? </\1> }x ) { while ( m{ <([^>]?)> [^<]? </\1> }gx ) { my $token = $1; s{<$token>}{}g; s{</$token>}{}g; } $changed = 1; } else { while ( m{ </([^>]?)> }x ) { my $close = $1; if ( m{ <($close[^>]?)> [^<]? </$close> }x ) { my $open = $1; s{<$open>}{}g; s{</$close>}{}g; } } $changed = 1; } return $_, $changed; } __DATA__ #0This is a -- string of -- words <b>1This is a -- string of -- words</b> 2This <b>is a -- string</b> of -- words #3This <i>is</i> a -- <b>nested tagged string </b> of -- words #4This <i>is a -- <b>nested set</b> of</i> -- tokens #5This is -- an awesome -- <A HREF="http://google.com">search engine</ +A> #6Truly an -- ugly -- <A HREF="http://perl.com"><FONT COLOR="RED">nest +ed</FONT> string</A> 7<one>Here's -- lotsa</one> tagged words <two>that-- need</two> tag pa +rity. #8This string -- <b>causes -- <A HREF="http://perl.com"><FONT COLOR="R +ED">my box</b></FONT> to hang</A> [download] blyman setenv EXINIT 'set noai ts=2'	[reply] [d/l]

Replies are listed 'Best First'.

Re: Re: Re: A nice text processing question
by belden (Friar) on Jan 06, 2002 at 00:25 UTC


#!/usr/bin/perl

use strict;
use warnings;

while(<DATA>)
{
  /^#/ and next;

  chomp;

  $_ = join ("\n", split (/-- /, $_ ) );
  $_ = &tagparity($_);
  print $_,"\n\n";

#  my $no_tags_stripped;
#  do { $no_tags_stripped = &tagstrip($_) } unless $no_tags_stripped;
#  print $_,"\n";

}


exit;


#<one>Here's -- lotsa</one> tagged words <two>that-- need</two>parity.
# This is a pretty stupid sub that will break very easily - lines
# 0, 3, 4, 5, 6, and 8 in DATA are mangled beyond belief by this sub
sub tagparity()
{

  my $string = shift;
  my @tags;
  my @temp;
  my %pairs;

  # Find pairs of tags and mate 'em up
  while ( $string =~ / <([^>]*?)> /gx ) 
  {
    push(@temp, $1);
    @temp %2 == 0 or next;
    $pairs{$temp[0]} = $temp[1];
    $pairs{$temp[1]} = $temp[0];

    push(@tags, @temp);
    @temp = ();
  };

  # Try to insert mates
  my $tag_regex = join('|',@tags);
  for my $stringbit( split(/\n/, $string ) )
  {
    while ( $stringbit =~ /($tag_regex)/g )
    {
      my $match = $1;
      my $mate = $pairs{$match};

      # If we've matched a closing tag, insert an opening
      # tag, and vice versa. This is the part that mangles
      # lines 0, 3, 4, 5, 6, and 8 in DATA
      if($match =~ m{^/}) {
        $string =~ s/$stringbit/<$mate>$stringbit/
      }
      else {
        $string =~ s/$stringbit/$stringbit<$mate>/
      }

    }

  }
  return $string;

}



sub tagstrip()
{
  my $changed = undef;

  if ( m{ <([^>]*?)> [^<]*? </\1> }x )
  {
    while ( m{ <([^>]*?)> [^<]*? </\1> }gx )
    {
      my $token = $1;
      s{<$token>}{}g;
      s{</$token>}{}g;
    }
    $changed = 1;
  }

  else
  {
    while ( m{ </([^>]*?)> }x )
    {
      my $close = $1;
      if ( m{ <($close[^>]*?)> [^<]*? </$close> }x )
      {
        my $open = $1;
        s{<$open>}{}g;
        s{</$close>}{}g;
      }
    }
    $changed = 1;
  }

  return $_, $changed;
}


__DATA__
#0This is a -- string of -- words
<b>1This is a -- string of -- words</b>
2This <b>is a -- string</b> of -- words
#3This <i>is</i> a -- <b>nested tagged string </b> of -- words
#4This <i>is a -- <b>nested set</b> of</i> -- tokens
#5This is -- an awesome -- <A HREF="http://google.com">search engine</
+A>
#6Truly an -- ugly -- <A HREF="http://perl.com"><FONT COLOR="RED">nest
+ed</FONT> string</A>
7<one>Here's -- lotsa</one> tagged words <two>that-- need</two> tag pa
+rity.
#8This string -- <b>causes -- <A HREF="http://perl.com"><FONT COLOR="R
+ED">my box</b></FONT> to hang</A>
[download]

blyman

setenv EXINIT 'set noai ts=2'

[reply]
[d/l]