Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am hoping for your help with what should be quite a simple problem, but I am having issues. I have a large dataset showing connections between individuals in a community. What I am trying to do is print out only the data on unreciprocated connections. So, in the example below, I would not want the first two lines because personA and personB have a reciprocal connection. But I would want the final three lines because the connection only goes one way.

I used the following script and thought that the second "if clause" would mean that instances such as the personA and personB pairing (i.e. the first two lines) were not printed, but instead it prints all of my data lines.

use strict; use warnings; while(<DATA>) { my $comment_author1; my $comment_author2; my $interactions; if( $_ =~ /(.*?),(.*?),(.*?)$/gm) { $comment_author1 = $1; $comment_author2 = $2; $interactions = $3; if( $_ =~ "$comment_author2,$comment_author1") { print "no match\n"; } else { print "$comment_author1,$comment_author2,$interactions\n"; } } } __DATA__ personA,personB,10 personB,personA,190 personA,personC,23 personA,personD,43 personE,personF,10
I imagine that the solution will be very simple, but I keep going round in circles without getting anywhere, and a second pair of eyes would be very helpful.

Thank you!

Replies are listed 'Best First'.
Re: print lines which are not reverse duplicates
by haukex (Archbishop) on Nov 25, 2018 at 14:24 UTC

    What you've got sounds like a directed graph, and CPAN has the module Graph. I'm also using Text::CSV to make parsing and output more robust (also install Text::CSV_XS for speed).

    input.csv:

    personA,personB,10 personB,personA,190 personA,personC,23 personA,personD,43 personE,personF,10

    Code:

    #!/usr/bin/env perl use warnings; use strict; use Text::CSV; use Graph; my $filename = 'input.csv'; my $g = Graph->new(directed=>1); open my $fh, '<', $filename or die "$filename: $!"; my $csv = Text::CSV->new({ binary=>1, auto_diag=>2, eol=>$/ }); while ( my $row = $csv->getline($fh) ) { my ($author1, $author2, $interactions) = @$row; $g->set_edge_attribute( $author1, $author2, # auto-creates edge 'interactions', $interactions ); } $csv->eof or $csv->error_diag; close $fh; for my $e ($g->edges) { my ($author1, $author2) = @$e; next if $g->has_edge($author2, $author1); my $interactions = $g->get_edge_attribute( $author1, $author2, 'interactions' ); $csv->print(select, [ $author1, $author2, $interactions ]); }

    Output:

    personE,personF,10 personA,personD,43 personA,personC,23

    Update: The above does not handle the case of duplicates in the input. Is that a concern for you, and if yes, what happens with the "interactions"? Are they supposed to be summed up?

      Brilliant, thank you very much for this! Duplicates are not an issue for me this time, an earlier code summed them up and the values formed the interaction columns. Thanks again.
Re: print lines which are not reverse duplicates
by tybalt89 (Monsignor) on Nov 25, 2018 at 16:40 UTC
    #!/usr/bin/perl # https://perlmonks.org/?node_id=1226286 use strict; use warnings; my %keep; while( <DATA> ) { /(\w+),(\w+),/ or next; $keep{$2, $1} and delete $keep{$2, $1} or $keep{$1, $2} = $_; } print sort values %keep; __DATA__ personA,personB,10 personA,personC,23 personA,personD,43 personB,personA,190 personE,personF,10

    Or, of course, the regex version :)

    #!/usr/bin/perl # https://perlmonks.org/?node_id=1226286 use strict; use warnings; local $_ = do { local $/, <DATA> }; 1 while s/^(\w+),(\w+),.*\n((?:.*\n)*?)^\2,\1,.*\n/$3/gm; print; __DATA__ personA,personB,10 personA,personC,23 personA,personD,43 personB,personA,190 personE,personF,10
      1 while s/^(\w+),(\w+),.*\n((?:.*\n)*?)^\2,\1,.*\n/$3/gm;

      Awesome, as usual. Thanks for all your regex examples which I for one invariably marvel at.

      Thank you for this!
Re: print lines which are not reverse duplicates
by hippo (Archbishop) on Nov 25, 2018 at 11:47 UTC

    It should become pretty clear why the match fails if you add one diagnostic print to display both the search pattern and the string to be searched in.

    use strict; use warnings; while(<DATA>) { my $comment_author1; my $comment_author2; my $interactions; if( $_ =~ /(.*?),(.*?),(.*?)$/gm) { $comment_author1 = $1; $comment_author2 = $2; $interactions = $3; print "DEBUG: Looking for '$comment_author2,$comment_author1' in $ +_\n"; if( $_ =~ "$comment_author2,$comment_author1") { print "no match\n"; } else { print "$comment_author1,$comment_author2,$interactions\n"; } } } __DATA__ personA,personB,10 personB,personA,190 personA,personC,23 personA,personD,43 personE,personF,10

    Do you see now where your logic has gone wrong? I think if you had used consistent indentation it might have been clear from the outset:

    use strict; use warnings; while (<DATA>) { my $comment_author1; my $comment_author2; my $interactions; if ($_ =~ /(.*?),(.*?),(.*?)$/gm) { $comment_author1 = $1; $comment_author2 = $2; $interactions = $3; print "DEBUG: Looking for '$comment_author2,$comment_author1' +in $_\n"; if ($_ =~ "$comment_author2,$comment_author1") { print "no match\n"; } else { print "$comment_author1,$comment_author2,$interactions\n"; } } } __DATA__ personA,personB,10 personB,personA,190 personA,personC,23 personA,personD,43 personE,personF,10
      It seems like we interpreted the question differently. :)

      While you replied to the "code", I replied to the "data"

      When looking at the OP's code ...

      $_ =~ "$comment_author2,$comment_author1"

      it seems like the OP wants to exclude lines of the form personA,personA,10 , i.e. 1. and 2. person are equal

      But in the description he asked to exclude the first two lines:

      personA,personB,10 personB,personA,190

      where the relation was reversed.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      <puzzled>That still prints all the lines. How should $_ also contain the reverse of what it contains?</puzzled>


      The way forward always starts with a minimal test.

        The code in my post is not intended to be a solution, 1nickt, as it is functionally no different from that in the OP. It's an example of debugging/diagnosing the problem and is intended to set Maire on the path. Your question here is precisely what I was hoping that Maire would ask.

      Ah, I knew that I was doing something really silly, but I couldn't wrap my head around what was going wrong. Thank you for the clarification!
Re: print lines which are not reverse duplicates
by LanX (Saint) on Nov 25, 2018 at 10:23 UTC
    Maybe try putting something like this into your loop
    unless ($seen{$comment_author2}){ $seen{$comment_author1} = $comment_author2; } else { delete $seen{$comment_author2}; }

    at the end the hash should only contain unique pairs.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    Update

    You can put any data into the hash, something like =[$comment_author1,$interactions] would keep more informations if needed.

      Thanks!
Re: print lines which are not reverse duplicates
by 1nickt (Canon) on Nov 25, 2018 at 11:59 UTC

    Hi, here's one way*:

    use strict; use warnings; use feature 'say'; my %keep; for my $line ( <DATA> ) { chomp $line; my ( $P1, $P2, $I ) = split /,/, $line; $keep{"$P1$P2"} = $line; if ( $keep{"$P2$P1"} ) { delete $keep{"$P2$P1"}; delete $keep{"$P1$P2"}; } } say for values %keep; __DATA__ personG,personH,42 personA,personB,10 personB,personA,190 personA,personC,23 personA,personD,43 personE,personF,10
    Output:
    $ perl 1226286.pl personG,personH,42 personE,personF,10 personA,personD,43 personA,personC,23
    ... add sorting etc. as you wish ...

    Hope this helps!

    (*Disclaimer: pre-coffee!)


    The way forward always starts with a minimal test.
      Thank you very much!