>> I had to comment out the version of perl: I have 5.6.1, and the compiler complained at that (that should be easy to fix).

That's odd. I put it in to make it clear that you didn't need 5.8, and that older than 5.6 wouldn't work. It's a good idea in posts for that reason, even if you don't use it in your own scripts that you don't share with anyone or put on multiple machines.

Look up the syntax of require and use in the perlfunc manpage, and see what the problem is. Use perl -v on the command line to see what version is actually running.

>> The use utf8 directive is absolutely essential; the unicode hex notation is not allowed in the regex without it.

Correct. That's one reason I used \x{7f} instead of \x7f which would mean the same thing: to make sure it was compiled with Unicode support turned on. Then taking out the use like I suggested would produce a meaningful error and tell you what else use utf8 affects.

The /g modifier doesn't make the regex return a list of all matches. The context controls that, and it's being called in scalar context.

Rather, the /g will return one at a time, and each time through the while loop will get the next one. See, the regex stuff is remembering the state in an object that's created when you use the matching syntax. This is important to understand, and you'll see it in a few other places in Perl, too.

The idea of calling context is important in Perl. I see you wrote print "found ". keys(%chars) . " distinct non-ascii chars\n";. What would happen if you used commas instead of the two dots? Why does it change its meaning?

Your variation using my @matches= will indeed get all the matches at once. So why use a while loop? the rest of it doesn't make sence. Use it this way:

my $_= "sdfdsf,65dfdf**3#ooijoi4asdfdsf."; my @matches= /[a-z]/g; print "I found: @matches\n";
When you get back, try a new top-level thread, and be detailed since it will be picked up by new readers.

Parsing each line in the file into fields? What format is it in? If it's comma-delimited text or something like that, there is a module which does that. If it's a unique character that doesn't appear within the field itself, just use the split function.

—John


In reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by John M. Dlugosz
in thread regex for utf-8 by jjohhn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.