seni has asked for the wisdom of the Perl Monks concerning the following question:

Hello Everyone, I am pretty new to Perl, but have been doing some independent learning. I am working on a project in which I am trying to parse through some data. But, I have gotten stuck on what seems to be an easy problem to fix, except I don't know how! :) I am trying to use the following code, but now my user IDs actually have NA in the front of them (ie. NA12324). How can I either: do a step before this to remove the NA, OR get this code to accept this sort of ID? I hope this makes sense...thanks in advance!
#!/usr/bin/perl use strict; my $inFile = 'fanca.txt'; open (IN, $inFile) or die "open $inFile: $!"; my %user; while (my $line = <IN>) { next unless $line =~ m{^(\S+) (\d+) (.*)}; my ($site, $userID, $data, $data2) = ($1, $2, $3, $4); $user{$userID}{$site} = $data, $data2; } close(IN) or die "close $inFile: $!"; my $outfile = "parsingoutput_for_fanca.txt"; open(REPORT, ">$outfile") or die "open >$outfile: $!"; foreach my $userID (sort {$a <=> $b} keys %user) { my %sites = %{$user{$userID}}; my $line1 = 'SITES'; my $line2 = "$userID"; while (my ($site, $data, $data2) = each %sites) { $line1 .= ' ' x (length($line2)-length($line1)); $line2 .= ' ' x (length($line1)-length($line2)); #add on next site $line1 .= ' '. ' ' . $site; $line2 .= ' '. ' '. $data . ' ' . ' '. $data2; } print REPORT $line1 . "\n"; print REPORT $line2 . "\n"; print REPORT "\n"; } close (REPORT) or die "close $outfile: $!";

2006-10-28 Retitled by GrandFather, as per Monastery guidelines
Original title: 'Should be easy....'

Replies are listed 'Best First'.
Re: Modifying a regex
by grep (Monsignor) on Oct 27, 2006 at 17:48 UTC
    You can do it right here:
    while (my $line = <IN>) { #next unless $line =~ m{^(\S+) (\d+) (.*)}; next unless $line =~ m{^(\S+) NA(\d+) (.*)}; # Now 'NA' is not captured. my ($site, $userID, $data, $data2) = ($1, $2, $3, $4); $user{$userID}{$site} = $data, $data2; }
    Coupl'a notes:
    $4 is never populated - you only have 3 sets of captureing paran's, so $data2 is empty and meaningless
    This regex should've failed all together because 'NA1234' doesn't pass (\d+) in your regex.


    grep
    One dead unjugged rabbit fish later
Re: Modifying a regex
by bobf (Monsignor) on Oct 27, 2006 at 17:53 UTC

    If you simply want to strip the NA from the beginning of a string, you can use s/PATTERN/REPLACEMENT/ (see perlop). This will not affect IDs that do not start with NA. For example,

    use strict; use warnings; for ( 'NA12345', 67890 ) { my $id = $_; print "$id -> "; $id =~ s/^NA//; print "$id\n"; } __END__ NA12345 -> 12345 67890 -> 67890

    Update: I think I misread the question. If you want to allow an optional NA in the line that reads

    next unless $line =~ m{^(\S+) (\d+) (.*)};
    then you can change it by using a noncapturing set of parens (see perlre). For example:
    use strict; use warnings; for( 'string1 NA12345 other stuff', 'string2 67890 more stuff' ) { if( $_ =~ m/^(\S+) ((?:NA)?\d+) (.*)/ ) { print "matched: $2\n"; } } __END__ matched: NA12345 matched: 67890

    Update 2: It looks like your regex is simply capturing 3 fields separated by a single space. If that is the case, split might be more appropriate.

    use strict; use warnings; for( 'string1 NA12345 other stuff', 'string2 67890 more stuff' ) { my @elements = split( /\s/, $_, 3 ); print( '[', join( '][', @elements ), "]\n" ); } __END__ [string1][NA12345][other stuff] [string2][67890][more stuff]

    HTH

      Thank you to grep and bobf!!
      Hi bobf, The thing is, for this second data set I only have IDs with the NA prefix. So I don't have to be concerned with the data that do not have the NA prefix. Now, I have tried both yours and grep's inital suggestions and your last updated one, however the output file comes up blank...what is going on?
        Take a couple of lines from your data, then write a small test script to parse it. If you can't get that to work post it here (munging any sensitive data).

        something like:

        @lines = ('Foo9 NA1234 blah blah blah', 'Bar8 NA2345 blah blah blah', 'Baz7 NA3456 blah blah blah'); foreach my $line (@lines) { next unless $line =~ m{^(\S+) NA(\d+) (.*)}; my ($site, $userID, $data) = ($1, $2, $3); print "SITE: $site USER: $userID DATA: $data\n"; }


        grep
        One dead unjugged rabbit fish later