Newbie, need help extracting strings from two files and comparing

soulbleach has asked for the wisdom of the Perl Monks concerning the following question:

I have two files file1.txt and file2.txt each file contains lines with certain strings that I need to extract and pass into array's (at least thats how I think it should be done). I then want to compare the contents of the two arrays and remove all entries that have the same string in both so I end up with a list that contains only the unique entries/ lines from file2.txt.

So far I have two seperate but fairly similar scripts that will perform part of each job. I'm completely new to Perl and have very limited knowledge of scripting in general, small amount of Lua, PHP and a bit more knowledge of bash: please be gentle.

I would like to know how to join these two together so they do both parts of the job in one script. I've tried a number of options where I nest loops within each other but I find that it will do one part then the next and then loop back to the beginning and miss out the subsequent next steps e.g. I'll get RC0000 - RC9999 printed out but it won't then go and pull the lines from file 2.

This most likely sounds nonsensical so anything I can do to make it clearer please ask.

Thanks very much in advance.

script1 which acts on file1 and outputs an array containing entries in the format RC0000:

use strict;
use warnings;


my $filename = "file1.txt";
my $filename2 = "file2.txt";
my $logger = "log.txt";
my %seen;
my @unique;
my @strings;
my %seen2;
my @unique2;
my @strings2;
my $handle2;
my $s;
my $s2;
my $seen;
my $seen2;
my $line2;


open my $handle, '<', $filename || die "$0: Can't open $filename for r
+eading: $!";

open my $FH, '>', $logger || die "$0: Can't open $logger for writing: 
+$!";




while (my $line = <$handle>) {
    my @strings = $line =~ m/(RC[0-9][0-9][0-9][0-9])/;



        foreach my $s ( @strings ) {
            next if $seen{ $s }++;
            push @unique, $s;
            print $FH "@strings\n";  # push output to file log.txt for
+ testing
            


    }
    
    }
    
close $FH or die $!;
[download]

script2 which works on file2 and pulls out all lines that contain an entry in the format RC=0000:

use strict;
use warnings;


my $filename2 = "file2.txt";
my %seen2;
my @unique2;
my @strings2;
my @line2;



open my $handle2, '<', $filename2 || die "$0: Can't open $filename2 fo
+r reading: $!";




while (my $line2 = <$handle2>) {
    my @strings2 = $line2 =~ m/(RC=[0-9][0-9][0-9][0-9])/;


    
    foreach my $s2 (@strings2) {



        foreach my $s2 ( @strings2 ) {
        next if $seen2{ $s2 }++;
        push @unique2, $s2;
        print "$line2\n";
    }
    
    }
}
[download]

Comment on Newbie, need help extracting strings from two files and comparing Select or Download Code

Replies are listed 'Best First'.

Re: Newbie, need help extracting strings from two files and comparing
by Athanasius (Archbishop) on Dec 03, 2014 at 17:10 UTC

Hello soulbleach, and welcome to the Monastery!

Note that the regex in this line:

my @strings = $line =~ m/(RC[0-9][0-9][0-9][0-9])/;
[download]

contains only a single capture, so if the match succeeds @strings will contain only one element. Did you intend more than one capture per $line? If so, you need to add a /g modifier to the regex:

my @strings = $line =~ m/(RC[0-9]{4})/g;
[download]

Update: See “Global matching” in perlretut. (Also shortened regex: see “Quantifiers” in Regular Expressions.)

But without sample input data and desired output, it’s hard to know what you’re trying to do.

Anyway, hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Newbie, need help extracting strings from two files and comparing
by GotToBTru (Prior) on Dec 03, 2014 at 16:56 UTC

The similarity in the two programs should suggest 'subroutine' to you. Call it once for each file, have it pass back the array of unique lines you are already creating. Use a similar method to your %seen to populate your final hash with the values from the file2 array, removing entries that appear in the file1 array.

It would be nice to see an example of what the input file looks like.

UPDATE - suggested subroutine

my @array1 = unique($filename1);

sub unique {
  my ($file) = shift;
  my (%seen);
  open my $fh, '<', $file or die "Could not open $file\n";
  while(<$fh>) {
    chomp;
    $_ =~ m/(RC=\d{4})/;
    $seen{$_} = 1;
  }
  close $fh;
  return keys %seen;
}
[download]

1 Peter 4:10

[reply]
[d/l]

Re: Newbie, need help extracting strings from two files and comparing
by stevieb (Canon) on Dec 03, 2014 at 17:34 UTC

Don't know exactly if this is what you were after, but I'm sitting at a hotel before I cross the border and thought I'd slap something together quickly. There are many ways to do this; this was just a quick idea.

It reads in two files (see below), each with unique RC#### and some overlap. It keeps all the individual lines (that have unique RC numbers), and if file2 has one that overlaps, it overwrites the line entry from file1.

Hopefully this is kind of what you were looking for. If not, please clarify a little better. Also note that in file2, it uses the '=' character in the RC number, but I removed that before any comparisons.

#!/usr/bin/perl

use 5.12.0;

my ( $file1, $file2 ) = qw( 1.txt 2.txt );

my %seen;

for my $file ( $file1, $file2 ){
    open my $fh, '<', $file
      or die "Can't open file $file: $!";

    while ( my $line = <$fh> ){
        chomp;
        next if $line !~ /RC=?\d{4}\s+/;
        $line =~ s/RC=/RC/;
        $line =~ /(RC\d{4})\s+/;
        my $string = $1 if $1;
        $seen{ $string } = $line;
    }

    close $fh;
}

for my $key ( keys %seen ){
    say $seen{ $key };
}
[download]

Output:

This is line RC0003 in file 2.
This is line RC0001 in file 2.
This is line RC0002 in file 1
This is line RC0004 in file 2.
[download]

file1:

This is line RC0001 in file 1
This is line RC0002 in file 1
This is line RC0003 in file 1
[download]

file2:

This is line RC=0001 in file 2.
This is line RC=0003 in file 2.
This is line RC=0004 in file 2.
[download]

-stevieb

[reply]
[d/l]
[select]

Re^2: Newbie, need help extracting strings from two files and comparing

by soulbleach (Initiate) on Dec 09, 2014 at 12:23 UTC

Thanks very much this is exactly what I was needing. You guys all rock, I'm a Perl convert.

[reply]

Re: Newbie, need help extracting strings from two files and comparing
by karlgoethebier (Abbot) on Dec 03, 2014 at 18:36 UTC

I don't know if you are permitted to use modules (i fear not).

But ~~if so~~ if you are - and if i where in your shoes i would consider using IO::All and then apply a set operation using Set::Scalar to check for uniqueness.

N.B.: Assuming that i guessed your specs right.

As some of my predecessors already pointed out, it would be helpful if you provide some more details to avoid unnecessary reverse engineering ;-)

Update: tried to fix bad English.

Regards, Karl

ŤThe Crux of the Biscuit is the Apostropheť

[reply]