Best way to compare my data?

Lavezzi has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Best way to compare my data? by Corion (Patriarch) on Mar 28, 2010 at 15:20 UTC
Maybe you want to split up the file into several files, then compare the differences between the several files? Along the way, you might find a way how to avoid the "split into several files" part and compare the differences from the same file. I'm not sure where you are having problems, so maybe if you show some code we can help you further. One possible approach would be to use "paragraph mode" to read in the sections of your file, then split those sections into lines, then find the differences between the sections. Maybe you could use Algorithm::Diff to give you a human readable overview of what changed.	[reply]
Re^2: Best way to compare my data? by Lavezzi (Initiate) on Mar 28, 2010 at 15:37 UTC
The only reason I would prefer not to split it up into different files is because there would then be 196 files * the 6 different logs that I have. I have uploaded my code to my scratchpad, but obviously it doesn't work, that's why I'm here. I'm a complete beginner with Perl (and coding in general, I only started in the last week), so please don't laugh at my effort, haha. Also I hope the formatting is OK since I'm new to that too!	[reply]
Re^3: Best way to compare my data? by Corion (Patriarch) on Mar 28, 2010 at 15:52 UTC
Why don't you post your code here instead? Instead of manually splitting up your source file, you could write a program to split up your source file and then compare the split up parts. You could also skip the part where you write out your input file into separate files and compare the split up parts in memory instead of writing them out to files and reading them back in. So far, I get the impression that you have not really put much thought into possible approaches. Maybe you shouldn't attack the problem as a whole, but instead simplify the problem first: If you have two files, how do you determine their differences? If you have one file with two sets of routes, how can you read that file into two memory structures? If you have one file with more than two sets of routes, how will you determine the differences?	[reply]
Re^3: Best way to compare my data? by planetscape (Chancellor) on Mar 29, 2010 at 04:41 UTC
I have uploaded my code to my scratchpad, but obviously it doesn't work, that's why I'm here. You mean this code?: #!/usr/bin/perl -w use strict; my $infile = 'JPStream.csv'; my $outfile = 'new1.csv'; open IN, "< $infile" or die "Can't open $infile : $!"; open OUT, "> $outfile" or die "Can't open $outfile : $!"; my %seen; my %seen2; while (<IN>) { next if /^$/; chomp; if ( ! $seen2{$_} ) { print OUT "$_ Not in the last Traceroute ^\n"; } last if /^$/; $seen{$_}++; %seen2 = (); } while (<IN>) { next if /^$/; chomp; if ( ! $seen{$_} ) { print OUT "$_ Not in the last Traceroute^^\n"; } last if /^$/; $seen2{$_}++; %seen = (); } } [download] Why not post your code in the same node as your question? It certainly isn't due to length. Please make it easier for us to help you help yourself. HTH, planetscape	[reply] [d/l]
Re: Best way to compare my data? by Perlbotics (Archbishop) on Mar 28, 2010 at 17:35 UTC
I understand, that you need a hop- and section-wise comparison, so the following might get you started. However, the naive - text based - approach below will fail when the order of hops/IP's changes - e.g. by means of topology changes or selection of an alternative route having more/less intermediate hops. IF that is also of concern to you, a real network model (nodes and edges (graphs)) would be better suited than a plain text comparison. HTH `IP#1 IP#2/2b IP#3 IP#4 Section D +iff ============================================ +================== OK (alternate route): HOP1 --- HOP2 ----------- HOP3 (#1) HOP1 --- HOP2b ----------- HOP3 (#2) --> +@hop2: IP#2-->IP#2b EEK! (non-equidistant): HOP1 --- HOP2 ----------- HOP3 (#1) a) shorter route HOP1 --------------------- HOP2 (#2a) --> +@hop2: IP#2-->IP#4 (err?) b) longer route HOP1 --- HOP2 --- HOP3 --- HOP4 (#2b) --> +@hop3: IP#4-->IP#3 (err?)` [download] use strict; use warnings; sub prettyip { my $ip = shift; $ip =~ s/ (\d+) / sprintf("%03d",$1) /smgex; return $ip; } my %last_seen_ip_from_hop; # last IP seen for key=HOP my $section = 1; # section within the file my $previous_hop = 0; # previous HOP / new section event while (my $line = <DATA>) { if ($line =~ /^(\d+),(\S+)/) { # extract HOP and IP my ($hop, $ip) = ($1, $2); my $last_ip_seen = $last_seen_ip_from_hop{$hop}; my ($changemark_pre, $changemark_pos) = ("", ""); # detect a new section (A/B/C) $section++, print "\n" if $previous_hop > $hop; # new file/section $previous_hop = $hop; # notify if a change occured for a given hop since last seen if (defined $last_ip_seen and $ip ne $last_ip_seen) { $changemark_pre = 'changed to'; $changemark_pos = '(was: ' . prettyip($last_ip_seen) . ')'; } $last_seen_ip_from_hop{$hop} = $ip; # init or update current HOP/I +P printf "sect.%2d / hop %2d: %15s %15s %s\n", $section, $hop, $changemark_pre, prettyip($ip), $changemark_pos; } } __DATA__ 13,4.69.137.70 14,4.69.134.70 15,4.69.134.113 16,4.69.135.185 17,4.69.134.246 18,4.68.18.75 19,4.59.0.10 20,124.211.34.129 21,203.181.100.61 22,118.155.197.140 23,124.211.10.66 24,163.139.130.138 25,163.139.124.57 26,202.215.179.1 27,202.215.179.11 13,4.69.137.74 14,4.69.134.70 15,4.69.134.113 16,4.69.135.185 17,4.69.134.246 18,4.68.18.11 19,4.59.0.10 20,124.211.34.121 21,203.181.100.61 22,118.155.197.140 23,124.211.10.66 24,163.139.130.138 25,163.139.124.57 26,202.215.179.1 27,202.215.179.11 13,4.69.137.70 14,4.69.134.78 15,4.69.134.125 16,4.69.135.185 17,4.69.134.250 18,4.68.18.139 19,4.59.0.10 20,124.211.34.121 21,203.181.100.189 22,118.155.197.140 23,124.211.10.66 24,163.139.130.138 25,163.139.124.57 26,202.215.179.1 27,202.215.179.11 [download] Output: sect. 1 / hop 13: 004.069.137.070 sect. 1 / hop 14: 004.069.134.070 sect. 1 / hop 15: 004.069.134.113 sect. 1 / hop 16: 004.069.135.185 sect. 1 / hop 17: 004.069.134.246 sect. 1 / hop 18: 004.068.018.075 sect. 1 / hop 19: 004.059.000.010 sect. 1 / hop 20: 124.211.034.129 sect. 1 / hop 21: 203.181.100.061 sect. 1 / hop 22: 118.155.197.140 sect. 1 / hop 23: 124.211.010.066 sect. 1 / hop 24: 163.139.130.138 sect. 1 / hop 25: 163.139.124.057 sect. 1 / hop 26: 202.215.179.001 sect. 1 / hop 27: 202.215.179.011 sect. 2 / hop 13: changed to 004.069.137.074 (was: 004.069.137.07 +0) sect. 2 / hop 14: 004.069.134.070 sect. 2 / hop 15: 004.069.134.113 sect. 2 / hop 16: 004.069.135.185 sect. 2 / hop 17: 004.069.134.246 sect. 2 / hop 18: changed to 004.068.018.011 (was: 004.068.018.07 +5) sect. 2 / hop 19: 004.059.000.010 sect. 2 / hop 20: changed to 124.211.034.121 (was: 124.211.034.12 +9) sect. 2 / hop 21: 203.181.100.061 sect. 2 / hop 22: 118.155.197.140 sect. 2 / hop 23: 124.211.010.066 sect. 2 / hop 24: 163.139.130.138 sect. 2 / hop 25: 163.139.124.057 sect. 2 / hop 26: 202.215.179.001 sect. 2 / hop 27: 202.215.179.011 sect. 3 / hop 13: changed to 004.069.137.070 (was: 004.069.137.07 +4) sect. 3 / hop 14: changed to 004.069.134.078 (was: 004.069.134.07 +0) sect. 3 / hop 15: changed to 004.069.134.125 (was: 004.069.134.11 +3) sect. 3 / hop 16: 004.069.135.185 sect. 3 / hop 17: changed to 004.069.134.250 (was: 004.069.134.24 +6) sect. 3 / hop 18: changed to 004.068.018.139 (was: 004.068.018.01 +1) sect. 3 / hop 19: 004.059.000.010 sect. 3 / hop 20: 124.211.034.121 sect. 3 / hop 21: changed to 203.181.100.189 (was: 203.181.100.06 +1) sect. 3 / hop 22: 118.155.197.140 sect. 3 / hop 23: 124.211.010.066 sect. 3 / hop 24: 163.139.130.138 sect. 3 / hop 25: 163.139.124.057 sect. 3 / hop 26: 202.215.179.001 sect. 3 / hop 27: 202.215.179.011 [download]	[reply] [d/l] [select]
Re: Best way to compare my data? by GrandFather (Saint) on Mar 28, 2010 at 19:48 UTC
Parse the file into individual route blocks so that you end up with an array (routes) of arrays (nodes in a route). Then use Algorithm::Diff to compare pairs of routes to pull out the difference information you require. If you'd have shown us your code I'd have shown you mine. True laziness is hard work	[reply]
Re^2: Best way to compare my data? by Lavezzi (Initiate) on Mar 29, 2010 at 16:26 UTC
Sorry, I read in the FAQ that it was best to post code in the scratchpad so it was easier for other users to see! Most of these suggestions are going way over my head, like I said I'm not that well versed with Perl at all!	[reply]
Re^3: Best way to compare my data? by planetscape (Chancellor) on Apr 01, 2010 at 10:43 UTC
Scratchpads are transitory. Code is best placed in its relevant thread for the benefit of those who come days, weeks, months after the scratchpad has been altered or cleared. I think if you carefully re-read the FAQ, it actually suggests posting code on your scratchpad if you have asked (or are about to ask) a longish question in the Chatterbox. HTH, planetscape	[reply]
Re: Best way to compare my data? by GrandFather (Saint) on Mar 29, 2010 at 20:42 UTC
The following uses Algorithm::Diff to do the heavy lifting: use strict; use warnings; use Algorithm::Diff; my @routes; local $/ = "\n\n"; push @routes, $_ while $_ = <DATA>; chomp @routes; @routes = map {[split "\n"]} @routes; my @reference = @{shift @routes}; my $lenChanges = 0; my $hopChanges = 0; for my $route (@routes) { my @diffs = Algorithm::Diff::diff(\@reference, \@$route); next if !@diffs; @reference != @$route ? ++$lenChanges : ++$hopChanges; } print "Length changes: $lenChanges\n"; print "Hop changes: $hopChanges\n"; __DATA__ 13,4.69.137.70 14,4.69.134.70 15,4.69.134.113 16,4.69.135.185 17,4.69.134.246 18,4.68.18.75 19,4.59.0.10 20,124.211.34.129 21,203.181.100.61 22,118.155.197.140 23,124.211.10.66 24,163.139.130.138 25,163.139.124.57 26,202.215.179.1 27,202.215.179.11 13,4.69.137.74 14,4.69.134.70 15,4.69.134.113 16,4.69.135.185 17,4.69.134.246 18,4.68.18.11 19,4.59.0.10 20,124.211.34.121 21,203.181.100.61 22,118.155.197.140 23,124.211.10.66 24,163.139.130.138 25,163.139.124.57 26,202.215.179.1 27,202.215.179.11 13,4.69.137.70 14,4.69.134.78 15,4.69.134.125 16,4.69.135.185 17,4.69.134.250 18,4.68.18.139 19,4.59.0.10 20,124.211.34.121 21,203.181.100.189 22,118.155.197.140 23,124.211.10.66 24,163.139.130.138 25,163.139.124.57 26,202.215.179.1 27,202.215.179.11 13,4.69.137.74 14,4.69.134.70 15,4.69.134.113 16,4.69.135.185 17,4.69.134.246 18,4.68.18.11 19,4.59.0.10 20,124.211.10.120 20,124.211.26.120 21,203.181.100.61 22,118.155.197.140 23,124.211.10.66 24,163.139.130.138 25,163.139.124.57 26,202.215.179.1 27,202.215.179.11 [download] Prints: `Length changes: 1 Hop changes: 2` [download] Note too the use of the $/ record separator special variable to ease the parsing of the file into records. True laziness is hard work	[reply] [d/l] [select]
Re^2: Best way to compare my data? by Lavezzi (Initiate) on Apr 01, 2010 at 22:39 UTC
Just wanted to say that I very much appreciate all of the help that has been given to me in this thread. The reason I haven't replied is that I don't have the time to look at this project at the moment as I have more urgent work to look at in the mean time! So thanks again!	[reply]