jlope043 has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone, I am having a little trouble with my find and unable to get it to find the characters I am looking for at a certain column, column 32, in each row. If someone can point out my mistake I would greatly appreciate it.

use strict; my $find = qr/^(?:H0|HT)/; open (NEW, ">", "OUTPUT.txt" ) or die "could not open:$!"; open (FILE, "<", "INPUT.txt") or die "could not open:$!"; while (<FILE>) { print NEW if (/^\s{32}\S\s$find/); } close (FILE); close (NEW);

input.txt example

NAME PT # AT DOE HT00000000 I DOE HT00000000 S DOE HT00000000 I SMITH HT00000000 M DOE HT00000000 I DOE HT00000000 I DOE H000000000 I DOE H000000000 O SMITH H000000000 I

expected output.txt

HT00000000 I HT00000000 S HT00000000 I HT00000000 M HT00000000 I HT00000000 I H000000000 I H000000000 O H000000000 I

Replies are listed 'Best First'.
Re: Find Not Working
by davido (Cardinal) on Jun 02, 2016 at 23:49 UTC

    This regular expression:

    m/^\s{32}\S\s$find/

    ...can match only at the beginning of a string because of the ^ metacharacter -- an anchor that matches only at the start of a string, or a line if the /m modifier is in use.

    This subpattern:

    qr/^(?:H0|HT)/

    ...can match only at the beginning of a string because it starts with the ^ metacharacter. But you are embedding $find at a position within the consuming pattern that cannot be at the beginning of the string. Consequently there is no string that could match.

    At minimum, you probably should remove the ^ metacharacter from the embedded subpattern.

    Also, this: /(?:H0|HT)/ might be more clearly written as /H[0T]/.


    Dave

Re: Find Not Working
by stevieb (Canon) on Jun 02, 2016 at 23:42 UTC

    Right off the bat, your regex will never match, as this: ^\s{32} says "match exactly 32 whitespace characters at the very beginning of the string", but each line starts with a word character (\w). That's not the only issue, but I digress. Try this:

    use warnings; use strict; my $find = qr/ ^ # start of string \w+ # one or more word chars (last name) \s+ # one or more whitespace ( # begin capture (goes into $1) (?:H0|HT) # H0 or HT .* # everything to end of string ) # end capture /x; open my $fh, '<', 'in.txt' or die $!; while (<$fh>){ if (/$find/){ my $string = $1; # $1 contains what we captured in the rex print "$string\n"; } }

    Output:

    HT00000000 I HT00000000 S HT00000000 I HT00000000 M HT00000000 I HT00000000 I H000000000 I H000000000 O H000000000 I

    Here's the regex without breaking it up for explanation: /^\w+\s+((?:H0|HT).*)/

    Have a read of perlretut and perlre.

      I liked your solution, and posted a version using split at Re: Find Not Working.

      After some reflection, I think that something like this is probably better than either:

      while (<DATA>) { if (my ($name_column_deleted) = m/((?:H0|HT)\d{8,}.*)/) { print "$name_column_deleted\n"; } }
      The OP doesn't show what exactly can go in the "NAME" field but I suspect that it could contain spaces. "John Smith, Jr." or whatever. In that case, both of our solutions fail the general case. There could be multiple space separated tokens in name.

      My suggestion now is to go with the regex approach, but do not anchor this to the beginning of the line. Instead use a regex that qualifies HO (or HT) with a minimum number of digits (could be 4,5,6, or above I used 8). That way, this field will not be confused with a name. HO could be a last name.

      There was a suggestion to use a fixed field solution like unpack or substr. That can work well if there is one producer of the file. However, I often work with files that say "field X is 32 columns", but some guys put 30,31,32,33 columns in the output! As a defense, I write files like that exactly as spec'd, but allow more flexibility when reading files generated by others when I can.

      As a PS: I prefer to assign directly to a variable rather than using the intermediate $1. I think the code "reads" better, but of course, your call on that.

        ++... very nice Marshall. I know my method wasn't overly efficient for the data supplied, so I just wanted to give an example of what a full string regex would look like. I was going to give a substr example, but didn't for the reason above.

        I assign direct to a variable instead of the special numbered vars (mostly), but since it didn't seem like OP knew much about regexes, I wanted to be explicit in my example.

Re: Find Not Working
by Marshall (Canon) on Jun 03, 2016 at 02:24 UTC
    As another possible idea, you could just use a split.
    #!/usr/bin/perl use warnings; use strict; while (<DATA>) { next if /^NAME/ or /^\s*$/; #update changed \s+ to \s* print ''.(split (' ',$_,2))[1]; } =prints HT00000000 I HT00000000 S HT00000000 I HT00000000 M HT00000000 I HT00000000 I H000000000 I H000000000 O H000000000 I =cut __DATA__ NAME PT # AT DOE HT00000000 I DOE HT00000000 S DOE HT00000000 I SMITH HT00000000 M DOE HT00000000 I DOE HT00000000 I DOE H000000000 I DOE H000000000 O SMITH H000000000 I
    update: See above reply to stevieb.
Re: Find Not Working
by Laurent_R (Canon) on Jun 03, 2016 at 06:15 UTC
    Given your data format, you might also consider substr or unpack instead of a regex.
Re: Find Not Working
by Anonymous Monk on Jun 03, 2016 at 00:02 UTC
    If you only interested in 32nd column and don't care about what is at the start then
    my $find = qr/^.{31}((?:H0|HT).*)/; while ( <FILE> ) { print NEW "$1\n" if /$find/; }
Re: Find Not Working
by Anonymous Monk on Jun 02, 2016 at 23:03 UTC
    Please explain each what each part of each regex pattern means and the answer will become clear
    ^(?:H0|HT) ^\s{32}\S\s