tgolf4fun has asked for the wisdom of the Perl Monks concerning the following question:

Hello oh knowledge masters.   You guys have helped me along the way to learning about PERL, so I find myself stuck again.   I have a large file that I read into an array, and I need to strip out information that is not needed.   Now the first 16 characters of the data I want to keep(for every record), and then after that I want to keep 30 characters before and 30 charaters after another set string.

I wrote the regex like this:
foreach $i(@n){ $i =~s/\s//gm; } foreach $i(@n){ unless ($i =~ /^\d{16}/) else ($i =~ /.{30}Some search string.{30} +/) $i =~ s/.//gm;} }
The first part is for clean up, the second part is where I am having the problems.   After my matches are found then everything else should just be deleted.   Am I atleast in the right nieghborhood?

Please help.

janitored by ybiC: Minor format cleanup, including balanced <code> tags around snippet

Replies are listed 'Best First'.
Re: regex man they are tough
by Transient (Hermit) on Apr 28, 2005 at 16:01 UTC
    Is this search string a regexp or a normal string?

    If not:
    foreach $line ( @data ) { $line =~ s/ //gm; $first_part = substr($line, 0, 16); # get first 16 $idx = index( $line, "Search String", 16 ); $second_start = $idx > 30 ? $idx - 30 : 0; if ( $idx != -1 ) { $second_part = substr( $line, $second_start, 30 ); $third_part = substr( $line, $idx+length("Search String"), 30 ); } else { $second_part = $third_part = ""; } $line = ""; }
    (untested)

    Update: Added some bounds checking to avoid errors
      the search string is a normal string sorry about that, but this is giving me ideas of which way to go. Trying out
      I prefer to use regexps instead of your example due to brevity of handling the failure cases. Even so, you can tweak yours to be a bit more resilant.

      This bit doesn't take into account not having 30 characters before the search string:

      $idx = index( $line, "Search String", 16 );
      Since we know there must be 30 bytes of data before the Search String, do this:
      $idx = index($line, "Search String", 16+30);
      The same idea is true for the "third_part" bit in your code and isn't hard to handle.
      where does the scalar $index come from in your code above?
        The scalar $idx contains the return value from the builtin function index.
Re: regex man they are tough
by cowboy (Friar) on Apr 28, 2005 at 16:08 UTC

    You should insert some code tags to make it easier to read. Other than that, your logic seems messed up here.

    You might try capturing what you want to keep in a s/// substition, like below (untested):

    foreach my $i (@n) { # strip any white space $i =~ s/\s//gm; # is the /m multi-line needed? # this checks for beginning with 16 digits, plus a matching string. # this will not match unless both the digits, and the string (plus 3 +0 chars on each end) # are in your string. $i =~ s/^(\d{16}).*?(.{30}Some search tring.{30})/$1$2/gm; # again, +is the /m needed? }
    I would highly recommend reading a tutorial, such as
    perldoc perlrequick perldoc perlretut

    Update: fixed typo in my regex

      Might want to make the last group optional, as it should still match for the first 16.
Re: regex man they are tough
by tlm (Prior) on Apr 28, 2005 at 17:11 UTC

    Is this what you want?

    foreach $i (@n) { my ( $first_sixteen ) = $i =~ /^\s*(\d{16})/; my ( $pre, $post ) = $i =~ /(.{30})Some search string(.{30})/; warn "something not right with $i\n", next if grep !defined $_, $first_sixteen, $pre, $post; # do something with $first_sixteen, $pre, $post }

    the lowliest monk

Re: regex man they are tough
by blazar (Canon) on Apr 28, 2005 at 16:29 UTC
    Hello oh knowledge masters. You guys have helped me along the way to learning about PERL, so I find myself stuck again. I have a large file that I read into an array, and I need to strip out information that is not needed. Now the first 16 characters of the data I want to keep(for every record), and then after that I want to keep 30 characters before and 30 charaters after another set string.
    Regexen are not only tough, they're cool too, which is probably the reason why you want to use them in the first place. Said all this, and considering that I've not read in detail the description of your problem nor your attempt, just take into account that:
    • It's {sometimes,often} convenient to use two or more regexen rather than a single one (although trying to do it with the latter can be tempting or even appealing),
    • For fixed length substring extraxction it's convenient to take a look at substr and unpack as well.

    Update: Incidentally, no such a thing as PERL. See: perldoc -q 'difference between "perl" and "Perl"'.

Re: regex man they are tough
by tlm (Prior) on Apr 28, 2005 at 17:34 UTC

    It occurred to me that the snippet I posted, even if it does what you want, it probably it doesn't tell you much about why yours is not working, so here are just a couple of comments on your regexps. The /m modifier is useful only if you are matching a string that contains multiple lines; it tells perl to match ^ and $ to the beginnings and ends of lines. Study this example and you will see what I mean:

    use strict; use warnings; my $string = "foo\nbar\nbaz\n"; print "1st match ", $string =~ /^bar/ ? "succeeded\n" : "failed\n"; print "2nd match ", $string =~ /^bar/m ? "succeeded\n" : "failed\n"; print "3rd match ", $string =~ /bar$/ ? "succeeded\n" : "failed\n"; print "4th match ", $string =~ /bar$/m ? "succeeded\n" : "failed\n"; __END__ 1st match failed 2nd match succeeded 3rd match failed 4th match succeeded
    Next, the /g modifier makes sense only if you are matching the same string multiple times want to match the same regexp multiple times in the same string. For example, using the same string as for the example above:
    while ( $string =~ /(a\w+)/g ) { print "$1\n"; } __END__ ar az
    Lastly, the expression $i =~ s/.//gm simply sets $i to the empty string (in this case the /m modifier does nothing; you'd get the same result without it). I don't think this gets you anything, but if that's what you wanted to do, it is simpler to just assign the empty string: $i = ''.

    Update: Corrected sloppy wording. Thanks to Roy Johnson.

    the lowliest monk

      the /g modifier makes sense only if you are matching the same string multiple times
      Not entirely true. In scalar context (such as when it's the conditional of a while), it's about repeated matching. In list context, it will do the global match and return all the captures at once.
      $_ = 'pig dog goggles'; my @hits = /g/g; print "@hits\n";

      Caution: Contents may have been coded under pressure.