GrandFather has asked for the wisdom of the Perl Monks concerning the following question:

Oh venerable Monks, I'm writing an HTML to TWiki converter to ease the porting of some documentation written using Word to a TWIki based Wiki. I'm about 90% there, but my brain is starting to seize. My current stumbling block is:

I have lines containing anchors like <a name="_Toc00123998"> that need to be moved to the start of the line for later processing. There may be many of these anchor tags per line. They may, or may not, be adjacent to each other.

Can anyone come up with a regex to do this trick? Before and after would look like:
Before: This is an anchor <a name="_Toc00123998">and these are another +<a name="_Toc00123999"><a name="_Toc00124000"> couple. After: <a name="_Toc00123998"><a name="_Toc00123999"><a name="_Toc0012 +4000">This is an anchor and these are another couple.

Update 1: Arggh. Fix the missing quotes!

Replies are listed 'Best First'.
Re: Moving sub-strings to the start of a line
by tlm (Prior) on Jun 01, 2005 at 01:02 UTC

    I would do something like

    push @anchors, $1 while $html =~ s/(<a .*?>)//s; $html = join '', @anchors, $html;

    the lowliest monk

      Oh yes. I like that!

      My first question here and I get wonderfull answers almost before I finished asking.
Re: Moving sub-strings to the start of a line
by gube (Parson) on Jun 01, 2005 at 01:03 UTC

    Hi try this,

    $a = 'This is an anchor <a name=_Toc00123998">and these are another<a +name=_Toc00123999"><a name=_Toc00124000"> couple.'; (@a) = $a =~ m#(<.*?>)#gsi; local $"=""; $a =~ s#(<.*?>)##gsi; print "@a".$a;

    o/p:<a name=_Toc00123998"><a name=_Toc00123999"><a name=_Toc00124000">This is an anchor and these are another couple.

Re: Moving sub-strings to the start of a line
by Zaxo (Archbishop) on Jun 01, 2005 at 00:57 UTC

    Both those samples look broken to me. What happened to the </a> tags?

    After Compline,
    Zaxo

      The lines have been preprocessed so that only "interesting" stuff remains. It may look like HTML, but it ain't.
Re: Moving sub-strings to the start of a line
by marnanel (Beadle) on Jun 01, 2005 at 01:13 UTC
    How about
    my $p = 'This is an anchor <a name="_Toc00123998">and these are anothe +r<a name=_Toc00123999"><a name=_Toc00124000"> couple.'; $prefixes = ''; while ($p =~ s/(<a name=[^>]*>)//) { $prefixes .= $1; } $p = "$prefixes$p";
Re: Moving sub-strings to the start of a line
by thundergnat (Deacon) on Jun 01, 2005 at 01:18 UTC

    First of all, those anchors are broken. They are missing an open quote and need ether to be self closed or explicitly closed.

    Second, it is a bad idea in general to try to parse HTML with regexes. HTML is not necessarily regular markup and can not be reliably parsed with a regular expression.

    However, within those constraints, something like the following should do what you ask with either self closed or explicitly closed anchors.

    use warnings; use strict; while (my $line = <DATA>){ my @anchors = $line =~ m#<a name="[^>]+?(?:/>|></a>)#g; $line =~ s#<a name="[^>]+?(?:/>|></a>)##g; print @anchors,$line; } __DATA__ Self: This is an anchor <a name="_Toc00123998" />and these are another +<a name="_Toc00123999" /><a name="_Toc00124000" /> couple. Explicit: This is an anchor <a name="_Toc00123998">no cap</a>and these + are another<a name="_Toc00123999"></a><a name="_Toc00124000" /></a>c +ouple.

    Update: modified to not capture anchors with enclosed text.

    Update2: Arggh. Try that one more time

Re: Moving sub-strings to the start of a line
by TedPride (Priest) on Jun 01, 2005 at 05:58 UTC
    I'm assuming your anchor format is fairly rigid. If it's not, you will need a more complicated regex.
    use strict; use warnings; while (<DATA>) { @_ = (); push @_, $1 while s/(<a name=".*?">)//i; $_ = join '', @_, $_; print; } __DATA__ This is an anchor <a name="_Toc00123998">and these are another <a name +="_Toc00123999"><a name="_Toc00124000">couple. This more anchors <a name="_Toc00123998">and more anchors <a name="_To +c00124000">.