wow!

This was trivial now that it's done:

#require 5.6; use strict; use warnings; use utf8; my %chars; my %descids; while (<>) { while ( /[^\x{1}-\x{7f}]/g) { ++$chars{$&}; } } foreach my $char (keys %chars){ print "$char found $chars{$char} times\n"; } print "found ". keys(%chars) . " distinct non-ascii chars\n";
I had to comment out the version of perl: I have 5.6.1, and the compiler complained at that (that should be easy to fix). I next checked if any characters lie outside of ISO-8859-1 by changing the regex range to look up to \x{ff}, and got zero. That is the practical result of this whole exercise; this 58k tabbed text file will be easier to import into various DBMS systems if the user knows that the characters lie in the range of Latin-1

The use utf8 directive is absolutely essential; the unicode hex notation is not allowed in the regex without it.

The inner while loop (vs an if statement) around the regex is a little unclear; I guess the match with the /g modifier returns a list, and the if statement would only chec the scalar return. Would something like this capture all the matches in a single line into a list?

while(<>){ while (my @matches = /[^\x{1}-\x{7f}]/g){ $conid = /patten-to-find-this-column/; $hash_of_lists{$conid} =[@matches]; # linking this with inner hash of found characters is fuzzy but near.. +. ++$chars{$&}; } }
My next task is to make some data structures; at the top level are concept_ids (one of the fields in this table). Each concept-id is associated with numerous description_ids (the primary key of this table). Each row of this table (each description_id) could have numerous non-ascii characters, each associated with a frequency.

I intend to collect this all into a hash of lists of hashes of hashes.

The inner hash is the non-ascii characters and their frequency. The list of hashes is the row of the table with its non-ascii characters; each row could have a number of distinct non-ascii characters in it. And the hash of lists is the unique concept_id associated with numerous description_ids. After I have that, I'll want the individual words with the characters also collected and reported somehow, but that will come last.

This will take some thinking; I'm taking a company trip tomorrow and can work this out in the hotel. I might not be able to post for about a week, depending on internet access.

Your help is much appreciated.

John


In reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by jjohhn
in thread regex for utf-8 by jjohhn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.