Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Re: Matching and combining two text files

by koolgirl (Hermit)
on Jan 23, 2012 at 16:29 UTC ( #949448=note: print w/replies, xml ) Need Help??

in reply to Matching and combining two text files

OK, so thanks to GrandFather, I was able to take his code, tinker with it a bit, as I didn't quite understand a few things, and get a good routine going, working well, with one exception, it's repeating itself, it prints out each line of output about 4 times.

use strict; use warnings; my %docs; my $currParcel; my $file; my $line; my $doc; my $key; opendir(DIR, 'F:\project_files') || die $!; foreach $file (readdir(DIR)) { if ($file =~ /docs(.*)/) { $currParcel = $1; open(IN2, $file) || die $!; while(<IN2>) { chomp; $line = $_; $docs{$line} = $currParcel; } # end while } # end if } # end foreach close IN2; open(IN, 'carson_county_abstract.txt') || die $!; open(OUT, '>cars_abstract.txt') || die $!; while (<IN>) { $line = $_; if ($line =~ /Document #: ([0-9]*)(.*)/) { print $docs{$1} . "\n"; print OUT "parcel# $docs{$1} doc num $1 $2\n"; } # end if #print OUT "parcel# $docs{$1} doc num $1 $2\n"; } # end while close IN;

As you can see from the commented out print statement, I tried moving the print statement out of the if statement, but same problem there. Obviously the whole thing wouldn't work if it was moved out of the while loop, any thoughts on stopping the repeat?

At this point, I'm willing to put up with it, but I'm dealing with such a large data set, I really need to stop the repeating if I can..oO

...still no sleep...seeing little pink elephants holding syntax cue cards....

Replies are listed 'Best First'.
Re^2: Matching and combining two text files
by choroba (Archbishop) on Jan 23, 2012 at 18:14 UTC
    Without seeing your actual input, I can only guess: maybe the documents are repeated in carson_county_abstracts.txt as well? You might need Data::Dumper to print Dumper \%docs to see what's really inside the hash (check for whitespace in keys and similar). Also, change line 33 to
    print "<$1:$docs{$1}>\n";
    to see whether you are matching the right thing.
Re^2: Matching and combining two text files
by GrandFather (Saint) on Jan 23, 2012 at 20:27 UTC

    Bad koolgirl. I gave you nice clean code and you nastised it. Here are a few guidelines to help make it nice again:

    1. define variables where they are first required
    2. use the three argument version of open
    3. use lexical file handles (ones declared with my)
    4. avoid nesting if
    5. don't slurp files and such (for loops implicitly slurp)
    6. show the name for failed opens along with the OS's error message
    7. don't comment block ends. If you avoid nesting, keep blocks short and indent nicely there is no need and it removes clutter

    and the niceised code:

    use strict; use warnings; my $dir = 'F:\project_files'; my $fInName = 'carson_county_abstract.txt'; my $fOutName = 'cars_abstract.txt'; my %docs; opendir my $dirScan, $dir or die "Failed to open $dir: $!\n"; while (defined(my $entry = readdir $dirScan)) { next if $entry !~ /docs(.*)/; my $currParcel = $1; my $filePath = "$dir\\$entry"; open my $inFile, '<', $filePath or die "Can't open $filePath: $!\n +"; while (defined (my $line = <$inFile>)) { chomp $line; $docs{$line} = $currParcel; } close $inFile; } open my $fIn, '<', $fInName or die "Failed to open $fInName: $!"; open my $fOut, '<', $fOutName or die "Failed to create $fOutName: $!"; while (defined (my $line = <$fIn>)) { next if $line !~ /Document #: ([0-9]*)(.*)/; print $docs{$1} . "\n"; print $fOut "parcel# $docs{$1} doc num $1 $2\n"; } close $fOut or die "Error closing $fOut: $!\n";

    To fix your duplication problem you might want to use another hash to check for duplicate document/parcel pairs.

    Ok, and if you don't understand stuff find a friendly web site dedicated to the area and ask for clarification ;).

    True laziness is hard work

      hahahaha, <--- that's real laughter, not nervous, I look like a slump and need to play it off laughter. ;-) OK, so, I hardly ever use any of the types of syntax you do. I'm finding this a lot. I guess it's because I learned from the first edition books, and never have learned much about new ways, for instance,

      open(IN, $file) || die $!

      is the only way I know how to open a file, and have never bothered using a different method, just as I have never used

      next if

      and I've always been taught that nice clean code, always has all variables declared before hand, never in the middle of the program. So, basically, my "nastiness" of the code you gave me :p, was to figure out what you were doing, where I didn't understand. I have sort of been thrown into the big leagues whether I belong there or not, so I guess I better pick up a bat...

      So, if I may, please help me understand why you're using the next if, instead of regular if's nested as I did, why nesting if statements is a bad idea, and what do you mean about slurping up files? I know I'm probably going to lose a million XP and get slammed for these questions, but oh well, I tried to avoid it and GrandFather threw me out in the open, so I guess I might as well ask now. ;-)

        Asking about things you don't understand is smart and good. Not asking is the dumbest thing you can do because it wastes everyone's time and you learn nothing and keep repeating the same mistakes.

        I've always been taught that nice clean code, always has all variables declared before hand

        I'd get a new teacher if I were you! Code with all the declarations lumped together may look pretty to some eyes, but you gain no advantage whatsoever from declaring variables like that. Always declare variables in the smallest sensible scope and initialise them at the same time. That helps avoid a whole slew of bugs including reusing a variable name and ending up with subtle heisenbugs as a result.

        First, to answer a couple of questions you didn't ask. Use the three parameter form of open (open handle, mode, target) because providing an explicit mode ('>' or '<' for example) is both clearer and safer. Use lexical file handles because they are clearer and safer - safer because their scope is limited to the current block and using strict you are more likely to catch typos.

        The point about avoiding nesting is that the deeper the nesting goes the harder it is to figure out what the code does. If you have a simple test and can bail (as in the next if line) you don't have to worry about that case any more, it's all done and dusted.

        Slurping is where you read everything into an array. If that is followed by looping over the array using a for loop then very likely you can remove the array and use a while loop instead. That has two advantages: 1/ you can see from the code that you loop while there is stuff in the file (which gets sorta obscured by the array), and 2/ you only read one line of the file into memory at a time. Most often the second point isn't all that important, but if the file is huge it can be a killer. Neither reason is absolutely compelling, but slurping seldom has an advantage over iteration using a while loop so you might as well go with clearer and use the wile loop (iteration) form.

        True laziness is hard work

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://949448]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2022-08-18 04:53 GMT
Find Nodes?
    Voting Booth?

    No recent polls found