Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Regex for simple parsing job

by toadi (Chaplain)
on Jul 27, 2004 at 09:06 UTC ( [id://377687]=perlquestion: print w/replies, xml ) Need Help??

toadi has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I program perl for a while now and still suck at regexes.

My file looks like:

STARTP TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE ENDP STARTP TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE ENDP
I read following link: "How do I extract all text between two keywords like start and end?". But it didn't make me any wiser.

I want to get everyting from STARTP until ENDP. En then cut stuff up between TITLE and ENDTITLE. But if I do like the suggested link I get everything fron first STARTP until last ENNDP. And I want to match first from First STARTP until first ENDP in the file and then from next STARTP until next ENDP. And the same for TITLE and ENDTITLE.

And no there is no recursion in these tags.

thanx



--
My opinions may have changed,
but not the fact that I am right

janitored by ybiC: Retitle from one-word "regex" nodetitle to avoid hindering site searching.   Also converted node link from <a href...> to Monastery style [id://nnnn] to avoid logging out monks with cookie set from different PM domain (perlmonks.(org|net), sans leading "www"...)

Replies are listed 'Best First'.
Re: Regex for simple parsing job
by pbeckingham (Parson) on Jul 27, 2004 at 12:26 UTC

    Wow, a perfect use of the scalar range operator.

    #! /usr/bin/perl -w use strict; my @titles; while (<DATA>) { if (my $num = /TITLE/ .. /ENDTITLE/) { push @titles, $_ unless $num == 1 || $num =~ /E/; } } print for @titles; __DATA__ STARTP TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE ENDP STARTP TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE ENDP

      Although I like a spicy regex as much as the next guy, I think that the use of the flip-flop operator, as pointed out by pbeckingham and diebyz is much more elegant and (probably) more efficient, since it allows you to avoid slurping the file that is being parsed.

      The 'scalar range' / 'flip-flop' operator is one of the sweetest pieces of syntactical sugar that Perl offers, if you ask me.

      Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
        For the first Part From STARTP to ENDP. But as I stated previous not for the TITLE part, because there is no endtag. But I'm going to use the flip flop operator because the file can be very large.

        Thanx for pointing me to this!!!



        --
        My opinions may have changed,
        but not the fact that I am right

Re: Regex for simple parsing job
by davorg (Chancellor) on Jul 27, 2004 at 09:23 UTC

    Would have been nice to see you code so we could show you where you are going wrong. But this seems to do what you want.

    #!/usr/bin/perl use strict; use warnings; my $data = do { local $/; <DATA> }; my @data = $data =~ /STARTP(.*?)ENDP/sg; foreach (@data) { my @titles = /TITLE(.*?)ENDTITLE/sg; $_ = \@titles; } foreach my $i (0 .. $#data) { print "Block $i\n"; foreach my $j (0 .. $#{$data[$i]}) { print "Title $j:\n$data[$i][$j]\n"; } print "\n"; } __DATA__ STARTP TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE ENDP STARTP TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE ENDP
    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      actually I was still looking for the /sg switch in the regex. That's why my code didn't match anything.

      But thanx for helping.



      --
      My opinions may have changed,
      but not the fact that I am right

      In the file I don't have ENDTITLE ass ending but just TITLE until next TITLE until next TITLE. How do I match that? Update Seems split does the job :)


      --
      My opinions may have changed,
      but not the fact that I am right

        Won't "split" give you an extra empty title?

        I fixed the second regex like this:

        my @titles = /TITLE(.*?)(?=TITLE|$)/sg;

        Update: regex re-fixed

        --
        <http://www.dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg

Re: Regex for simple parsing job
by ccn (Vicar) on Jul 27, 2004 at 09:17 UTC

    my $str = <<TEXT; STARTP TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE ENDP STARTP TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE TITLE some gibberish some more gibberish ENDTITLE ENDP TEXT my @ary; foreach my $p ($str =~ /^STARTP\n(.*?)ENDP/msg) { my @p; foreach my $t ($p =~ /^TITLE\n(.*?)ENDTITLE/msg) { push @p, $t; } push @ary, \@p; } use Data::Dumper; print Dumper(\@ary);

    $VAR1 = [ [ 'some gibberish some more gibberish ', 'some gibberish some more gibberish ', 'some gibberish some more gibberish ' ], [ 'some gibberish some more gibberish ', 'some gibberish some more gibberish ', 'some gibberish some more gibberish ' ] ];

    update: a little bug fixed, 'm' modifier added to second regexp, thanks to guha

Re: Regex for simple parsing job
by deibyz (Hermit) on Jul 27, 2004 at 11:26 UTC
Re: Regex for simple parsing job
by Anonymous Monk on Jul 27, 2004 at 09:21 UTC
    $_ = "STARTP ... ENDP"; @data = map {[/^TITLE\n([^E]*(?:E(?!NDTITLE)[^E]*)*)ENDTITLE/gm]} /^ST +ARTP\n([^E]*(?:E(?!NDP)[^E]*)*)ENDP/gm;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://377687]
Approved by ccn
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (3)
As of 2024-04-24 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found