Modifying a regex

seni has asked for the wisdom of the Perl Monks concerning the following question:

Hello Everyone, I am pretty new to Perl, but have been doing some independent learning. I am working on a project in which I am trying to parse through some data. But, I have gotten stuck on what seems to be an easy problem to fix, except I don't know how! :) I am trying to use the following code, but now my user IDs actually have NA in the front of them (ie. NA12324). How can I either: do a step before this to remove the NA, OR get this code to accept this sort of ID? I hope this makes sense...thanks in advance!

#!/usr/bin/perl

use strict;

my $inFile = 'fanca.txt';

open (IN, $inFile) or die "open $inFile: $!";

my %user;

while (my $line = <IN>) {
    next unless $line =~ m{^(\S+) (\d+) (.*)};
    my ($site, $userID, $data, $data2) = ($1, $2, $3, $4);
    
    $user{$userID}{$site} = $data, $data2;
}

close(IN) or die "close $inFile: $!";

my $outfile = "parsingoutput_for_fanca.txt";
open(REPORT, ">$outfile") or die "open >$outfile: $!";

foreach my $userID (sort {$a <=> $b} keys %user) {
    my %sites = %{$user{$userID}};

    my $line1 =  'SITES';
    my $line2 = "$userID";

    while (my ($site, $data, $data2) = each %sites) {
        $line1 .= ' ' x (length($line2)-length($line1));
        $line2 .= ' ' x (length($line1)-length($line2));

        #add on next site
        $line1 .= ' '. ' ' . $site;
        $line2 .= ' '. ' '. $data . ' ' . ' '. $data2;
    }

    print REPORT $line1 . "\n";
    print REPORT $line2 . "\n";
    print REPORT "\n";
}

close (REPORT) or die "close $outfile: $!";
[download]

2006-10-28 Retitled by GrandFather, as per Monastery guidelines
Original title: 'Should be easy....'

Comment on Modifying a regex Download Code

Replies are listed 'Best First'.

Re: Modifying a regex
by grep (Monsignor) on Oct 27, 2006 at 17:48 UTC

while (my $line = <IN>) {
    #next unless $line =~ m{^(\S+) (\d+) (.*)};
    next unless $line =~ m{^(\S+) NA(\d+) (.*)};
    # Now 'NA' is not captured.
    my ($site, $userID, $data, $data2) = ($1, $2, $3, $4);
    
    $user{$userID}{$site} = $data, $data2;
}
[download]

grep

One dead unjugged rabbit fish later

[reply]
[d/l]

Re: Modifying a regex
by bobf (Monsignor) on Oct 27, 2006 at 17:53 UTC

If you simply want to strip the NA from the beginning of a string, you can use s/PATTERN/REPLACEMENT/ (see perlop). This will not affect IDs that do not start with NA. For example,

use strict;
use warnings;

for ( 'NA12345', 67890 )
{
    my $id = $_;
    print "$id -> ";
    $id =~ s/^NA//;
    print "$id\n";
}

__END__
NA12345 -> 12345
67890 -> 67890
[download]

Update: I think I misread the question. If you want to allow an optional NA in the line that reads

next unless $line =~ m{^(\S+) (\d+) (.*)};
[download]

perlre

use strict;
use warnings;

for( 'string1 NA12345 other stuff',
      'string2 67890 more stuff' )
{
    if( $_ =~ m/^(\S+) ((?:NA)?\d+) (.*)/ )
    {
        print "matched: $2\n";
    }
}

__END__
matched: NA12345
matched: 67890
[download]

Update 2: It looks like your regex is simply capturing 3 fields separated by a single space. If that is the case, split might be more appropriate.

use strict;
use warnings;

for( 'string1 NA12345 other stuff',
      'string2 67890 more stuff' )
{
    my @elements = split( /\s/, $_, 3 );
    print( '[', join( '][', @elements ), "]\n" );
}

__END__
[string1][NA12345][other stuff]
[string2][67890][more stuff]
[download]

HTH

[reply]
[d/l]
[select]

Re^2: Modifying a regex

by seni (Initiate) on Oct 27, 2006 at 18:03 UTC

Thank you to grep and bobf!!

[reply]

Re^2: Modifying a regex

by seni (Initiate) on Oct 27, 2006 at 18:31 UTC

Hi bobf, The thing is, for this second data set I only have IDs with the NA prefix. So I don't have to be concerned with the data that do not have the NA prefix. Now, I have tried both yours and grep's inital suggestions and your last updated one, however the output file comes up blank...what is going on?

[reply]

Re^3: Modifying a regex

by grep (Monsignor) on Oct 27, 2006 at 19:34 UTC

something like:

@lines = ('Foo9 NA1234 blah blah blah',
          'Bar8 NA2345 blah blah blah',
          'Baz7 NA3456 blah blah blah');
foreach my $line (@lines) {
    next unless $line =~ m{^(\S+) NA(\d+) (.*)};
    my ($site, $userID, $data) = ($1, $2, $3);
    print "SITE: $site   USER: $userID   DATA: $data\n";
}
[download]

grep

One dead unjugged rabbit fish later

[reply]
[d/l]

Re^4: Modifying a regex

by Anonymous Monk on Oct 27, 2006 at 20:15 UTC

Re^5: Modifying a regex

by grep (Monsignor) on Oct 27, 2006 at 21:00 UTC

Re^4: Modifying a regex

by seni (Initiate) on Oct 27, 2006 at 20:21 UTC