moseley has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Ok, this is your standard type of problem that perl should be good at. I'm trying to see a way to solve it without using HTML::Parser. I've got a string that may or may not include some HTML formatting, and the string is broken up into sections by double-dashed lines (think of breadcrumb links)

This is a -- string of -- words
<b>This is a -- string of -- words</b>
This <b>is a -- string</b> of -- words

So, the first one is plain text, the second is wrapped completely in a tag, and the third has a few words wrapped in the bold tag which spans the first and second sections.

And do note that the bold is just an example, it might be some other type of formatting such as a style tag or a font change or whatever, I do know it has a start tag and an end tag, and that's all.

Also, I don't know how many dashes, if any will be there.

Now, I need to break that into three links (split by the double dashed line), in a kind of "breadcrumb" type of link. So, looking at the third example, and just the first section, by splitting on the double-dash I'm also splitting up the start and end tags, so I'll need to supply the closing tag after the word "a":

This <b>is a -- string</b> of -- words
Would result in (for the first segment):
This <b>is a</b>
I can do this by tokenizing and keeping track of what words are in the tag and what words are not, but I'm wondering how others would attack this problem. Thanks,

Replies are listed 'Best First'.
Re: A nice text processing question
by belden (Friar) on Jan 05, 2002 at 15:07 UTC
    I'm trying to see a way to solve it without using HTML::Parser

    My solution will probably make you rethink that...

    I'd use a regex lookahead to catch and strip paired HTML tags such as bolding (<b>blah</b>) or italics(<i>blah</i>). However, note that my snippet breaks on the final line of DATA, because the lookahead in the regex assumes that the closing HTML tag will just be </$1>

    #!/usr/bin/perl use strict; use warnings; while(<DATA>) { chomp; while ( m{ <([^>]*?)> [^<]*? </\1> }gx ) { my $token = $1; s{<$token>}{}g; s{</$token>}{}g; } # Of course, this will be hard to do if you # "don't know how many dashes, if any, will # be there." print "\t$_\n" for ( split (/-- /,$_) ); print "\n"; } __DATA__ This is a -- string of -- words <b>This is a -- string of -- words</b> This <b>is a -- string</b> of -- words This <i>is</i> a -- <b>string</b> of -- words This <i>is a -- <b>nested set</b> of</i> -- tokens This is -- a nifty -- <A HREF="http://google.com">search engine</A>
    Update: Tweaks to make above script handle unpaired open/close tags, such as
    <A HREF="http://google.com">Google</A>

    This can't handle paired and unpaired tags in the same line (see last line of data, which causes script to hang, hence the # and the skip condition)

    #!/usr/bin/perl use strict; use warnings; while(<DATA>) { /^#/ and next; chomp; if ( m{ <([^>]*?)> [^<]*? </\1> }x ) { while ( m{ <([^>]*?)> [^<]*? </\1> }gx ) { my $token = $1; # Some verbose info. Note first line doesn't # get printed because it doesn't match regex print $_, "\n"; print "Found <$token> and </$token>, removing...\n"; s{<$token>}{}g; s{</$token>}{}g; } } else { # <A HREF="http://google.com">search engine</A> while ( m{ </([^>]*?)> }x ) { my $close = $1; if ( m{ <($close[^>]*?)> [^<]*? </$close> }x ) { my $open = $1; print $_,"\n"; print "Found <$open> and </$close>, removing\n"; s{<$open>}{}g; s{</$close>}{}g; print $_,"\n"; } } } # Of course, this will be hard to do if you # "don't know how many dashes, if any, will # be there." print "\t$_\n" for ( split (/-- /,$_) ); print "\n"; } __DATA__ This is a -- string of -- words <b>This is a -- string of -- words</b> This <b>is a -- string</b> of -- words This <i>is</i> a -- <b>string</b> of -- words This <i>is a -- <b>nested set</b> of</i> -- tokens This is -- an awesome -- <A HREF="http://google.com">search engine</A> Truly an -- ugly -- <A HREF="http://perl.com"><FONT COLOR="RED">nested +</FONT> st ring</A> #This string -- <b>causes -- <A HREF="http://perl.com"><FONT COLOR="RE +D">my box</ b></FONT> to hang</A>
    Urgh. HTML::Parser really is your friend here. btw I wanted to comment my regexes but found myself unable to adequately describe them.

    blyman
    setenv EXINIT 'set noai ts=2'

      Well, I'm starting to believe that HTML::Parser is the way to go. I was trying to avoid it for size reasons (running under mod_perl), and since I'll have many of these to do I'd like the fastest way possible.

      Looking at your code I'm not sure you understood.

      I'm not trying to remove the tags. Rather imagine a long string of text that may or may not have some words (or group of words) bolded or marked up in some way.

      Now, what I'm then doing is splitting it up into chunks, which may end up splitting a tagged words. So one chunk may have the opening tag, where another tag may have the closing tag. Or it might get split in the middle of two tags, so that a given chunk might have the *closing* tag from the previous chunk, and the *opening* tag that's not closed until the next chunk.

      In other words:

      Starting text:

      <tag>This is a -- bunch</tag> of words <tag>where maybe -- some have</tag> tags.

      Splitting on the double dash:

      <tag>This is a
      bunch</tag> of words <tag>where maybe
      some have</tag> tags.

      Which should then be corrected to:

      <tag>This is a</tag>
      <tag>bunch</tag> of words <tag>where maybe</tag>
      <tag>some have</tag> tags.

      I might check on the lwp list, too, since I'll probably move to HTML::Parser.

        Bah. Here's a version that kind of does what you want. You're still much
        better off with HTML::Parser - hopefully you'll go far with that.

        #!/usr/bin/perl use strict; use warnings; while(<DATA>) { /^#/ and next; chomp; $_ = join ("\n", split (/-- /, $_ ) ); $_ = &tagparity($_); print $_,"\n\n"; # my $no_tags_stripped; # do { $no_tags_stripped = &tagstrip($_) } unless $no_tags_stripped; # print $_,"\n"; } exit; #<one>Here's -- lotsa</one> tagged words <two>that-- need</two>parity. # This is a pretty stupid sub that will break very easily - lines # 0, 3, 4, 5, 6, and 8 in DATA are mangled beyond belief by this sub sub tagparity() { my $string = shift; my @tags; my @temp; my %pairs; # Find pairs of tags and mate 'em up while ( $string =~ / <([^>]*?)> /gx ) { push(@temp, $1); @temp %2 == 0 or next; $pairs{$temp[0]} = $temp[1]; $pairs{$temp[1]} = $temp[0]; push(@tags, @temp); @temp = (); }; # Try to insert mates my $tag_regex = join('|',@tags); for my $stringbit( split(/\n/, $string ) ) { while ( $stringbit =~ /($tag_regex)/g ) { my $match = $1; my $mate = $pairs{$match}; # If we've matched a closing tag, insert an opening # tag, and vice versa. This is the part that mangles # lines 0, 3, 4, 5, 6, and 8 in DATA if($match =~ m{^/}) { $string =~ s/$stringbit/<$mate>$stringbit/ } else { $string =~ s/$stringbit/$stringbit<$mate>/ } } } return $string; } sub tagstrip() { my $changed = undef; if ( m{ <([^>]*?)> [^<]*? </\1> }x ) { while ( m{ <([^>]*?)> [^<]*? </\1> }gx ) { my $token = $1; s{<$token>}{}g; s{</$token>}{}g; } $changed = 1; } else { while ( m{ </([^>]*?)> }x ) { my $close = $1; if ( m{ <($close[^>]*?)> [^<]*? </$close> }x ) { my $open = $1; s{<$open>}{}g; s{</$close>}{}g; } } $changed = 1; } return $_, $changed; } __DATA__ #0This is a -- string of -- words <b>1This is a -- string of -- words</b> 2This <b>is a -- string</b> of -- words #3This <i>is</i> a -- <b>nested tagged string </b> of -- words #4This <i>is a -- <b>nested set</b> of</i> -- tokens #5This is -- an awesome -- <A HREF="http://google.com">search engine</ +A> #6Truly an -- ugly -- <A HREF="http://perl.com"><FONT COLOR="RED">nest +ed</FONT> string</A> 7<one>Here's -- lotsa</one> tagged words <two>that-- need</two> tag pa +rity. #8This string -- <b>causes -- <A HREF="http://perl.com"><FONT COLOR="R +ED">my box</b></FONT> to hang</A>
        blyman
        setenv EXINIT 'set noai ts=2'
      Very interesting solution. However, you don't solve the problem as stated. You are able to strip out all the tags, but that wasn't the needed solution. The solution was to break up the breadcrumbs, then put together any tags that were supposed to be pairs, but now were broken up.

      I haven't really used HTML::Parser very much, but I don't think it's as much a friend as you might think. The problem should've been stated as such:

      1. I need to split on some number of dashes
      2. foreach string created this way, I need to fill in the tags that might have been orphaned
      while (<DATA>) { my @tokens = split /-+/, $_; foreach my $token (@tokens) { $token = do_html_balancing($token); do_something_else_with_token($token); } }

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.