jjohhn has asked for the wisdom of the Perl Monks concerning the following question:

I have a tabbed file; the first field is the primary key of the table ($descid), the second is different ID ($conid), and the third is a string with (possibly) non-ascii characters.

I am trying to capture my matches of non-ascii characters from the string field into a hash to count each of those matches. The list of all character matches from a single run through the inner loop (a single row of the table) is to be the value of another hash keyed on $descid. Finally, $conid is to be the key to a hash whose value is the list of $descid in the table as a whole.

In other words, I want an inner hash counting character matches, a list of hashes, a hash of lists keyed on $descid, and a hash of hashes keyed on $conid

I can capture the characters in the code shown. How do I capture the list of characters matched from a single row of the table to put them into an annonymous array to become the value of the outer hash? I have looked through the camel, the lol tutorial, and other places, but I don't see the answer. I tried making the inner loop:

while(@rowmatches = /[^\x{1}-\x{7f}]/go){ ...
thinking that capturing the matches in list context would do it, but it didn't.
my %chars; # my $descid; # my $conid; while (<>) { while (/[^\x{1}-\x{7f}]/go){ ++$chars{$&}; } } foreach my $char (keys %chars){ print "$char found $chars{$char} times\n"; } print "found ". keys(%chars) . " distinct non-ascii chars\n";

Replies are listed 'Best First'.
Re: capturing the numerous hits from a global match into nested data
by BrowserUk (Patriarch) on Mar 04, 2003 at 01:22 UTC

    Congratulations! You win the prize for the Most-Confusing-Question Award:)

    If I understand you correctly, this may get you started.

    #! perl -slw use strict; use Data::Dumper; sub rndStr{local $"=''; "@_[map{rand @_} 0 .. shift]"; } #!" my @lines; push @lines, "desc$_\tconid@{[int rand 10]}\t". rndStr(30, map chr, 32 + ..255) for 1..100; my (%lineChars, %totalChars, %conIDs); for (@lines) { my ($descID, $conID, $string) = split /\t/; print "$descID, $conID, '$string'"; push @{$conIDs{$conID}}, $descID; my %nonASCII; $nonASCII{$1}++ while $string =~ m[([^\x01-\x7f])]cog; $lineChars{$descID} = [ keys %nonASCII ]; $totalChars{$_} += $nonASCII{$_} for keys %nonASCII; } print Dumper \%conIDs, \%totalChars, \%lineChars;

    I've just generated some random data into an array to test it with.

    The output of the for (which would probably be a while(<>) loop in your case), is three hashes.

    • %conIDs which relates the each conID to an array of descID's it was associated with.
    • %totalChars which gives the total counts for each non-ASCII char found in the strings
    • %lineChars giving an array of the chars found in each string indexed by descID.

    Of course it's quite probable that I've completely misread you and this is nothing like what your after, but it might suggest some ideas to you.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
      Bingo and thank you! What I precisely needed was
      push @{conIDs{$conid}}, $descid;\
      after pulling those two fields from the table. I can can print everything I want to know, so the rest of the nesting must be working.

      In less than 50 times the number of minutes it took you to figure out what I was asking, dummy up a random array, and write the code, I was able to solve this :)

Re: capturing the numerous hits from a global match into nested data
by zengargoyle (Deacon) on Mar 04, 2003 at 02:58 UTC

    you're close!

    my $string = "abcd\x66\x69\x8f\xfe"; print "string is ",length($string), " characters long.", $/; my @matches = $string =~ /([^\x01-\x7f])/go; print "string has ",scalar(@matches), " characters in \\x80-\\xff.", $ +/; print "the string$/"; hex_print($string); my $count; printf("match %02d has value %02X$/", $count++, ord $_) for @matches; sub hex_print { my $string = shift; for my $a ( 0, 1 ) { print( map( { substr $_, $a, 1 } map { sprintf "%02X", ord $_ } split //, $string ), $/ ); } }
Re: capturing the numerous hits from a global match into nested data
by Jaap (Curate) on Mar 04, 2003 at 00:02 UTC
    Your question is too complex. Try to break it down in to several smaller questions like:
    1. How can i match any character but a tab?
    2. How can i find unique strings in a hash?
    Or something like that.
    Short, to-the-point questions will be answered faster and better on perlmonks.
      Thank you for your patience. I have a large table relating "synoym_ids", "concept_ids", and "synonym_strings". The table is ordered by synonym_id. One concept can have many synonyms spread throughout the table

      I am interested in 400 of the 1 million synonym_strings. I loop through the table and find one of the strings I want:

      while (<>){ if(/pattern_I _want/){ ($syn_id, $con_id) = /^(\d+)\t\d\t\(\d+)/; # now that I'm here, pull out pieces of the strings while (/(pattern_I_want/)g){ ++chars{$1}; push @line, $1; ...
      I want to make a hash of concept_id -> synonym_id. The concept_ids are scattered throughout the table. That hash is the smallest piece of the problem I can describe right now. Eventually all of these elements I have describe will be in a nested structure; the top level of that nested structure is the hash of distinct con_ids to syn_ids.