Kage has asked for the wisdom of the Perl Monks concerning the following question:

I have news data that is like this:

<a name="36561357542"></a><!--start--> Data... <!--end-->

Every news post is pushed together, so that, if the <a name......</a> crap were removed, the <!--end--> and <!--start--> would be end-to-end, and I could easily split with that.
However, that is not the case. I need to be able to split on <!--end--><a name="(.+)"></a><!--start--> while retaining the data in the (.+) and put each (.+) back into an <a name..</a> at the beginning of each array value.

How?
“A script is what you give the actors. A program is what you give the audience.” ~ Larry Wall

Replies are listed 'Best First'.
Re: Split with data keep
by fruiture (Curate) on Nov 30, 2002 at 11:39 UTC

    I'm not sure whether I've understood your problem. Can it be that your problem is solved using backreferences in the split// regexp?

    @stuff = split m{ <!--end--> <a \s name=" ( \d+ ) "></a> <!--start--> }x => $data;

    See `perldoc -f split` to see what happens with your backreference $1.

    Another option is to use a parsing while(REGEXP):

    while( $data =~ m{ \G <a \s+ href = " (\d+) "></a> <!--start--> ( .+? ) <!--end--> }xg ){ my $number = $1; my $data = $2; #... }
    --
    http://fruiture.de
Re: Split with data keep
by dws (Chancellor) on Nov 30, 2002 at 18:22 UTC
    split() is powerful, but it isn't the only tool in the bag. If what you're after is either "Data..." or the named anchor, a better way to approach the problem might be to first isolate the text within the start and end tags, and then decide what to do with it. Assuming text is in $text, and scan span several lines, something like the following should do the trick:
    while ( $text =~ m/<!--start-->(.*?)<--end-->/s ) { my $chunk = $1; if ( $chunk =~ /<a name="(.+?)"></a>/ ) { # do something with $1 } else { # do something with $chunk } }
Re: Split with data keep
by rir (Vicar) on Dec 01, 2002 at 06:12 UTC
    It is not completely clear what you wish to extract. This will extract the variable parts. It assumes a name value may not contain a double-quote. If that is not correct match on the quote and following tag, like the second half of the regex.
    #!/usr/bin/perl use strict; use warnings; $_ = q|<a name="a name"></a><!--start-->some stuff<!--end-->| . q|<a name="Mae B Arthur"></a><!--start-->various text<!--end-->| . q|<a name="36561357542"></a><!--start-->What kind of #'s that<!--end +-->| . q|<a name="aafq0w4tyu89[ "></a><!--start-->aeo;utrq[134[ a<!--end--> +| ; while ( m|<a name="([^"]+?)"></a>(?:<!--start-->(.*?)(<!--end-->)+?)|s +gc ) { print "name: |$1|\n" . "art: |$2|\n\n"; } __DATA__ name: |a name| art: |some stuff| name: |Mae B Arthur| art: |various text| name: |36561357542| art: |What kind of #'s that| name: |aafq0w4tyu89[ | art: |aeo;utrq[134[ a|
Re: Split with data keep
by dbp (Pilgrim) on Dec 01, 2002 at 00:53 UTC
    Given that you've read your file into a scalar:
    my (@articles) = ($text =~ /(<a name.*?<!--start-->.*?<!--end-->)/gs);

    Of course this is doing two stingy matches in one pattern which is probably godawful slow.

    Update: Or use a hash instead

    my (%hash) = ($text =~ /(<a name.*?)(<!--start-->.*?<!--end-->)/gs);