comment on

wow!

This was trivial now that it's done:

#require 5.6;
use strict;
use warnings;
use utf8;

my %chars;
my %descids;
while (<>) {
   while ( /[^\x{1}-\x{7f}]/g)  {
      ++$chars{$&};
      }
   }
foreach my $char (keys %chars){
    print "$char found $chars{$char} times\n";
}
print "found ". keys(%chars) . " distinct non-ascii chars\n";
[download]

I had to comment out the version of perl: I have 5.6.1, and the compiler complained at that (that should be easy to fix). I next checked if any characters lie outside of ISO-8859-1 by changing the regex range to look up to \x{ff}, and got zero. That is the practical result of this whole exercise; this 58k tabbed text file will be easier to import into various DBMS systems if the user knows that the characters lie in the range of Latin-1

The use utf8 directive is absolutely essential; the unicode hex notation is not allowed in the regex without it.

The inner while loop (vs an if statement) around the regex is a little unclear; I guess the match with the /g modifier returns a list, and the if statement would only chec the scalar return. Would something like this capture all the matches in a single line into a list?

while(<>){
  while (my @matches = /[^\x{1}-\x{7f}]/g){
     $conid = /patten-to-find-this-column/;
     $hash_of_lists{$conid} =[@matches];
# linking this with inner hash of found characters is fuzzy but near..
+.
     ++$chars{$&};
  }
}
[download]

My next task is to make some data structures; at the top level are concept_ids (one of the fields in this table). Each concept-id is associated with numerous description_ids (the primary key of this table). Each row of this table (each description_id) could have numerous non-ascii characters, each associated with a frequency.

I intend to collect this all into a hash of lists of hashes of hashes.

The inner hash is the non-ascii characters and their frequency. The list of hashes is the row of the table with its non-ascii characters; each row could have a number of distinct non-ascii characters in it. And the hash of lists is the unique concept_id associated with numerous description_ids. After I have that, I'll want the individual words with the characters also collected and reported somehow, but that will come last.

This will take some thinking; I'm taking a company trip tomorrow and can work this out in the hotel. I might not be able to post for about a week, depending on internet access.

Your help is much appreciated.

John

In reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by jjohhn
in thread regex for utf-8 by jjohhn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.