bxjoh has asked for the wisdom of the Perl Monks concerning the following question:

Ok there has to be an easier/faster way to extract this data. I have 29 fields and am interested in grabbing the 15 and 16 fields from the line. This way works but is sooo slow. ($junk1,$clliA,$clliZ) = /^(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.* )\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)$/; $clliA and $clliZ are what I am interested in. Any ideas???? Thx

Replies are listed 'Best First'.
Re: There has to be an easier way...
by btrott (Parson) on Aug 16, 2000 at 01:00 UTC
    Did you try split?
    my($clliA, $clliZ) = (split /\|/)[14,15];
    You may need to adjust those indices a bit, because I didn't know whether you meant columns 15 and 16 (starting from 1), or what.
      I know that the split would work but I split the line later. At this point in the program I am just trying to pull Those 2 fields from the line so that I can pass those fields and the line into my sub-routine. I am trying to do it without having to split the line a bunch of times. Thx though
        Not only does split not damage your data, it is much more efficient then a reg-ex. And if you tell it exactly which elements you are looking for, it gets even more efficient.

        Trust the monks, re-read the man page for split and use it.

        oops well your way is quite a bit faster..... Thx man
Re: There has to be an easier way...
by Shendal (Hermit) on Aug 16, 2000 at 01:01 UTC
    How about something like this:
    #!/usr/bin/Perl -w use strict; my($str) = 'one|two|three|four|five|six|seven|eight|nine|ten'; my($this,$that,$other) = (split /\|/,$str)[2,7,8]; print "this:$this\nthat:$that\nother:$other\n";
    Hope that helps,
    Shendal
Re: There has to be an easier way...
by BlaisePascal (Monk) on Aug 16, 2000 at 01:14 UTC
    Your data is delineated by "|"? How about:
    ($clliA,$clliB) = (split "|",$source,16)[14,15];
    This splits $source into fields that were separated by |, but only the first 16 of them, and puts them into a list, of which you the 15th and 16th elemensts thereof.

    Update I guess everyone thought of split...and are faster typers!

    Your use of .* is particularly inefficient. Going with a simpler 3-field case: /^.*A.*A.*$/, it would match "shAzAm" by:

    match "shAzAm" with .*, can't find A, backtrack. match "shAzA" with .*, can't find A, backtrack. match "shAz" with .* match "shAzA" with .*A match "shAzAm" with .*A.*, can't find A, backtrack match "shAzA" with .*A.*, can't find A, backtrack match "shA" with .*, can't find A, backtrack match "sh" with .* match "shA" with .*A match "shAzAm" with .*A.*, can't find A, backtrack,
    and so forth. Imagine that with 16 .*A combinations, like you had.
Re: There has to be an easier way...
by turnstep (Parson) on Aug 16, 2000 at 01:21 UTC

    Also, assuming that the fields in between the | characters do not themselves have |'s in them, you could keep it as a regex by writing:

    if (m#([^\|]+)\|([^\|]+)$#) { $a=$1; $b=$1; } else { chomp; die "Bad line: $_\n"; }
    Note that this is better than just just saying
    ($a,$b,$c) = m/(your)(regex)(here)/;
    because a line that does not fit your idea of what should be there (in other words, if the regex fails) will put nothing into the variables on the left.

    I'd use the split myself, but this shows another way to do it, and it checks the data a little, too. Although it does *not* grab the 14th and 15th column, but merely the last two. This could be a bug or a feature: your data, your call. :)

Re: There has to be an easier way...
by ar0n (Priest) on Aug 16, 2000 at 01:02 UTC
    my @fields = split /\|/, $_; my($clliA,$clliB) = @fields[14,15];

    update: sorry, didn't mean to be redundant. my fellow monks are so fast! please ignore me.

    -- ar0n || Just Another Perl Joe

Re: There has to be an easier way...
by ferrency (Deacon) on Aug 16, 2000 at 01:06 UTC
    You probably want to be using split. This takes a string and splits it into an array of strings, breaking it up on a specified delimiter. For example:

    my $data = "f1|f2|f3|f4|f5"; my @data = split /|/, $data; print $data[2], "\n"; # prints f3, of course- arrays start on 0.
    Or, using array slices:

    my $data = "f1|f2|f3|f4|f5"; my ($clliA, $clliZ) = (split /|/, $d)[1,3]; # $clliA = "f2" # $clliZ = "f4"
    Fun fun fun...

    Alan

    update: Yikes, everyone beat me to it... oops.

RE (tilly) 1: There has to be an easier way...
by tilly (Archbishop) on Aug 17, 2000 at 03:47 UTC
    Everyone else has said what the preferred solution is.

    But nobody has explained why what you did was so slow.

    Full details are in Mastering Regular Expressions. However the basic theory is that Perl does a recursive search for ways to try to match your pattern to the string. The match goes from left to right in the pattern and the string. So it first tries to match the first (.*) to the end of the string. Well then it fails to get the pipe. So it backs off and tries again. And it turns out that you are doing a scenario where there are a lot of wrong partial matches you have to try first.

    If you change all of the (.*)s to (.*?)s then the RE would be faster. It would be safer still to change them to ([^\|]*)s. Split is even faster, but as you learn REs keep in mind the principle that ambiguity in the RE can result in unexpected slowdowns...

    Cheers,
    Ben

    PS Style point. Split your data into data structures early and then access the data structures directly rather than using formatted strings. In the long run I have found that to be faster, safer, and simpler.