Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Matching and combining two text files

by koolgirl (Hermit)
on Jan 23, 2012 at 03:47 UTC ( #949303=perlquestion: print w/replies, xml ) Need Help??

koolgirl has asked for the wisdom of the Perl Monks concerning the following question:

Dear Fellow Monks/Monkettes,

I have been working on a county site scraping project for about three months, and I'm exasperated in every way possible. I'm finally at the end, and almost there, but am seriously stuck in the middle of pulling the data into the .csv files. I've been "in the basement" the whole time on this one, I don't think anyone's seen me in about 4 months oO, and to say I have code blindness would be the understatement of the year.

So, here's my problem; I'm collecting information about thousands of properties, each property, has a parcel number, this is the distinguishing unique number I use for most of my code. Each parcel number, in turn, has a document history (title abstract, deed history) I have to collect. Each document in that history, has it's own unique document number, however, as I began to collect the document information, with each individual document number, I see there have been some data entry mistakes, and each of those document numbers, within their information, have no record of what parcel number they belong to. So now I have a file of all the info I need, each line representing a line of information for each document, and no way to tell which parcel number to which they are related.

I do have a file, listing each parcel number, and all the document numbers which belong to it, so now the only way I can think of to resolve the issue, is to compare the files, matching the list of doc numbers in one file, to it's corresponding info in the other, then putting both sets of info together in a new file, this one including the parcel number. I can not figure out how to do this. I read all through the lama, camel and the cook book (what is that a mountain goat?oO), and the only thing similar to my situation, was the technique of using Tie::File, to basically enable file handles to be operated on as an array, but that doesn't really cut it, and the comparing two files routine, but I need to match them against each other and operate from there, not just compare.

I haven't slept in about 3 or 4 days, so please don't kill me as I suspect the answer is very simple and basic *ducks to avoid flying keyboards*. I'm too tired to think, or apparently write code that makes sense, but my head will be chopped off, seriously, if I can't finish this by morning.

Does anyone have any ideas for solving this problem? Not looking for someone to do it for me, really, just a push in the right direction, I'm spinning my wheels here. I've tried attempting to tie the parcel and doc numbers together in the initial collection, which would be cumbersome anyhow, because it isn't even stripped at that point, but comparing the two files, then combining the data into a new output file seems to be the only way to go.

Bottom Line Coming Up In 3, 2, 1....

Basically, I need to keep track of the parcel number and each document number tied to it, and each set of data is in two separate files. I've made an example of what type of data each file holds, where it is related, and the output file I want to create by combining each set of data together with it's parcel number. I can not seem to even think of how to do that at this point.

FILE1 parcel# 12345 doc num 123 doc num 456 doc num 789 parcel# 67890 doc num 342 doc num 657 doc num 876 FILE2 doc num 342 data data data data data data data data doc num 657 data data data data data data data data doc num 876 data data data data data data data data doc num 123 data data data data data data data data doc num 456 data data data data data data data data doc num 789 data data data data data data data data
So that's an example of the structure of the data that each file holds and how it matches, and this is what I'm trying to output as a combination from matching each set up:
FILE3 parcel# 12345 doc num 123 data data data data data data data data parcel# 12345 doc num 456 data data data data data data data data parcel# 12345 doc num 789 data data data data data data data data parcel# 67890 doc num 342 data data data data data data data data parcel# 67890 doc num 657 data data data data data data data data parcel# 67890 doc num 876 data data data data data data data data

Sorry, I am a total brain dead zombie, so I hope I made sense, if anyone has any ideas, please help me, I am officially not able to compute. oO

Replies are listed 'Best First'.
Re: Matching and combining two text files
by GrandFather (Saint) on Jan 23, 2012 at 04:27 UTC

    The standard technique for looking stuff up is to use a hash:

    use strict; use warnings; my $file1 = <<FILE1; parcel# 12345 doc num 123 doc num 456 doc num 789 parcel# 67890 doc num 342 doc num 657 doc num 876 FILE1 my $file2 = <<FILE2; doc num 342 data data data data data data data data doc num 657 data data data data data data data data doc num 876 data data data data data data data data doc num 123 data data data data data data data data doc num 456 data data data data data data data data doc num 789 data data data data data data data data FILE2 my %docs; my $currParcel; open my $f1In, '<', \$file1; while (<$f1In>) { chomp; next if ! $_; if (/parcel#\s+(\d+)/) { $currParcel = $1; next; } next if ! defined $currParcel || ! /^doc num (\d+)/; $docs{$1} = $currParcel; } close $f1In; open my $f2In, '<', \$file2; while (<$f2In>) { chomp; next if ! /doc num\s+(\d+)\s+(.*)/; if (! exists $docs{$1}) { warn "Parcel not known for $1\n"; next; } print "parcel# $docs{$1} doc num $1 $2\n"; } close $f2In;

    Prints:

    parcel# 67890 doc num 342 data data data data data data data data parcel# 67890 doc num 657 data data data data data data data data parcel# 67890 doc num 876 data data data data data data data data parcel# 12345 doc num 123 data data data data data data data data parcel# 12345 doc num 456 data data data data data data data data parcel# 12345 doc num 789 data data data data data data data data

    However this task looks like it should really be using a database. If there are more than a few hundred entries in the files and the data is likely to be referenced more than a small number of times a database will make your life much happier (eventually).

    True laziness is hard work

      Thanks, GrandFather, I suspected as much, about the hash, but my experience is a bit limited with them, as such, I had a hard time envisioning how to match up the keys/values, although now it seems obvious. Yes, the company I'm working for is using a db, I'm actually writing the code to put it there (create a .csv out of all collected data), unfortunately in doing so, I have to deal with about a half a million records, even a small chunk of that to work on and test, is mind boggling.

      Half of the time, since I began working as a Perl programmer *sniff....koolgirl's growing up...*, I feel brilliant, the other half of the time, I feel like a complete dumb a$#. I guess it evens out eventually?

Re: Matching and combining two text files
by roboticus (Chancellor) on Jan 23, 2012 at 11:15 UTC

    koolgirl:

    The first thing I'd advise is get a good night's sleep. If you're trying to get a project done, you'll be a lot more efficient and think more clearly when you're rested. If you're too stressed to sleep, then go jogging to get your energy level up and clear your head. (I find that when I'm jogging I can solve a lot of programming problems after a time or two around the track.)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Matching and combining two text files
by koolgirl (Hermit) on Jan 23, 2012 at 16:29 UTC

    OK, so thanks to GrandFather, I was able to take his code, tinker with it a bit, as I didn't quite understand a few things, and get a good routine going, working well, with one exception, it's repeating itself, it prints out each line of output about 4 times.

    use strict; use warnings; my %docs; my $currParcel; my $file; my $line; my $doc; my $key; opendir(DIR, 'F:\project_files') || die $!; foreach $file (readdir(DIR)) { if ($file =~ /docs(.*)/) { $currParcel = $1; open(IN2, $file) || die $!; while(<IN2>) { chomp; $line = $_; $docs{$line} = $currParcel; } # end while } # end if } # end foreach close IN2; open(IN, 'carson_county_abstract.txt') || die $!; open(OUT, '>cars_abstract.txt') || die $!; while (<IN>) { $line = $_; if ($line =~ /Document #: ([0-9]*)(.*)/) { print $docs{$1} . "\n"; print OUT "parcel# $docs{$1} doc num $1 $2\n"; } # end if #print OUT "parcel# $docs{$1} doc num $1 $2\n"; } # end while close IN;

    As you can see from the commented out print statement, I tried moving the print statement out of the if statement, but same problem there. Obviously the whole thing wouldn't work if it was moved out of the while loop, any thoughts on stopping the repeat?

    At this point, I'm willing to put up with it, but I'm dealing with such a large data set, I really need to stop the repeating if I can..oO

    ...still no sleep...seeing little pink elephants holding syntax cue cards....

      Without seeing your actual input, I can only guess: maybe the documents are repeated in carson_county_abstracts.txt as well? You might need Data::Dumper to print Dumper \%docs to see what's really inside the hash (check for whitespace in keys and similar). Also, change line 33 to
      print "<$1:$docs{$1}>\n";
      to see whether you are matching the right thing.

      Bad koolgirl. I gave you nice clean code and you nastised it. Here are a few guidelines to help make it nice again:

      1. define variables where they are first required
      2. use the three argument version of open
      3. use lexical file handles (ones declared with my)
      4. avoid nesting if
      5. don't slurp files and such (for loops implicitly slurp)
      6. show the name for failed opens along with the OS's error message
      7. don't comment block ends. If you avoid nesting, keep blocks short and indent nicely there is no need and it removes clutter

      and the niceised code:

      use strict; use warnings; my $dir = 'F:\project_files'; my $fInName = 'carson_county_abstract.txt'; my $fOutName = 'cars_abstract.txt'; my %docs; opendir my $dirScan, $dir or die "Failed to open $dir: $!\n"; while (defined(my $entry = readdir $dirScan)) { next if $entry !~ /docs(.*)/; my $currParcel = $1; my $filePath = "$dir\\$entry"; open my $inFile, '<', $filePath or die "Can't open $filePath: $!\n +"; while (defined (my $line = <$inFile>)) { chomp $line; $docs{$line} = $currParcel; } close $inFile; } open my $fIn, '<', $fInName or die "Failed to open $fInName: $!"; open my $fOut, '<', $fOutName or die "Failed to create $fOutName: $!"; while (defined (my $line = <$fIn>)) { next if $line !~ /Document #: ([0-9]*)(.*)/; print $docs{$1} . "\n"; print $fOut "parcel# $docs{$1} doc num $1 $2\n"; } close $fOut or die "Error closing $fOut: $!\n";

      To fix your duplication problem you might want to use another hash to check for duplicate document/parcel pairs.

      Ok, and if you don't understand stuff find a friendly web site dedicated to the area and ask for clarification ;).

      True laziness is hard work

        hahahaha, <--- that's real laughter, not nervous, I look like a slump and need to play it off laughter. ;-) OK, so, I hardly ever use any of the types of syntax you do. I'm finding this a lot. I guess it's because I learned from the first edition books, and never have learned much about new ways, for instance,

        open(IN, $file) || die $!

        is the only way I know how to open a file, and have never bothered using a different method, just as I have never used

        next if

        and I've always been taught that nice clean code, always has all variables declared before hand, never in the middle of the program. So, basically, my "nastiness" of the code you gave me :p, was to figure out what you were doing, where I didn't understand. I have sort of been thrown into the big leagues whether I belong there or not, so I guess I better pick up a bat...

        So, if I may, please help me understand why you're using the next if, instead of regular if's nested as I did, why nesting if statements is a bad idea, and what do you mean about slurping up files? I know I'm probably going to lose a million XP and get slammed for these questions, but oh well, I tried to avoid it and GrandFather threw me out in the open, so I guess I might as well ask now. ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://949303]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2022-08-08 04:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?