in reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
in thread regex for utf-8
This was trivial now that it's done:
I had to comment out the version of perl: I have 5.6.1, and the compiler complained at that (that should be easy to fix). I next checked if any characters lie outside of ISO-8859-1 by changing the regex range to look up to \x{ff}, and got zero. That is the practical result of this whole exercise; this 58k tabbed text file will be easier to import into various DBMS systems if the user knows that the characters lie in the range of Latin-1#require 5.6; use strict; use warnings; use utf8; my %chars; my %descids; while (<>) { while ( /[^\x{1}-\x{7f}]/g) { ++$chars{$&}; } } foreach my $char (keys %chars){ print "$char found $chars{$char} times\n"; } print "found ". keys(%chars) . " distinct non-ascii chars\n";
The use utf8 directive is absolutely essential; the unicode hex notation is not allowed in the regex without it.
The inner while loop (vs an if statement) around the regex is a little unclear; I guess the match with the /g modifier returns a list, and the if statement would only chec the scalar return. Would something like this capture all the matches in a single line into a list?
My next task is to make some data structures; at the top level are concept_ids (one of the fields in this table). Each concept-id is associated with numerous description_ids (the primary key of this table). Each row of this table (each description_id) could have numerous non-ascii characters, each associated with a frequency.while(<>){ while (my @matches = /[^\x{1}-\x{7f}]/g){ $conid = /patten-to-find-this-column/; $hash_of_lists{$conid} =[@matches]; # linking this with inner hash of found characters is fuzzy but near.. +. ++$chars{$&}; } }
I intend to collect this all into a hash of lists of hashes of hashes.
The inner hash is the non-ascii characters and their frequency. The list of hashes is the row of the table with its non-ascii characters; each row could have a number of distinct non-ascii characters in it. And the hash of lists is the unique concept_id associated with numerous description_ids. After I have that, I'll want the individual words with the characters also collected and reported somehow, but that will come last.
This will take some thinking; I'm taking a company trip tomorrow and can work this out in the hotel. I might not be able to post for about a week, depending on internet access.
Your help is much appreciated.
John
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
by John M. Dlugosz (Monsignor) on Mar 03, 2003 at 23:30 UTC | |
by jjohhn (Scribe) on Mar 04, 2003 at 06:39 UTC |