in reply to Re^5: Begginer's question: If loops one after the other. Is that code correct?
in thread Begginer's question: If loops one after the other. Is that code correct?

You can simplify the code by removing the special cases for the two-character combinations and just use a regex. Just make sure you try to match the longer "characters" first, so their parts aren't matched instead.

Also, I used XML::LibXML to parse the structure.

#!/usr/bin/perl use warnings; use strict; use utf8; use XML::LibXML; my $file = shift; my %to_cyrilic = ( # Insert the hash definition here, see below. ); my $regex = join '|', sort { length $b <=> length $a } keys %to_cyrili +c; my $dom = 'XML::LibXML'->load_html( location => $file ); for my $text ($dom->findnodes('//text()')) { my $etext = $text; $text->setData($etext) if $etext =~ s/($regex)/$to_cyrilic{$1}/g; } print $dom;

Note that PRE tags preserve non-latin1 characters.

    F => 'Ф',
    H => 'Х',
    N => ':',
    Nj => 'Њ',
    b => 'б',
    c => 'ц',
    d => 'д',
    e => 'e',
    h => 'х',
    i => 'и',
    l => 'л',
    m => 'м',
    n => 'н',
    nj => 'њ',
    p => 'п',
    r => 'р',
    s => 'с',
    t => 'т',
    u => 'у',
    v => 'в',
    z => 'з',
    ć => 'ћ',
    Č => 'Ч',
    č => 'ч',
    Đ => 'Ђ',
    đ => 'ђ',
    Š => 'Ш',
    š => 'ш',
    Ž => 'Ж',
    ž => 'ж',
    # etc., this is enough to run the example.
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^7: Begginer's question: If loops one after the other. Is that code correct?
by predrag (Scribe) on Jan 11, 2017 at 19:36 UTC

    choroba, thank you very much for the fast and good help. Yes, I see now I could use regex for two character combinations. I have some experience with regex, not too much, but I can do that myself.

    I will try your code soon, just have to install that module. The code is really short, can't be shorter. I don't understand all yet but hope will be clear while working with the code.

    Sorry, I am not sure that I understand well what You wrote:

    "Note that PRE tags preserve non-latin1 characters"

    Does it mean that I have to put in hash all Cirillyc letters, including these that are the same as in Latin (a, o, e, k…)?

      The comment about PRE tags is about this site: as you've seen, you can't include some characters into CODE tags, as the site changes them into entities, but you can include them into PRE tags.

      But your question is a good ont - yes, cyrilic alphabet exists as a whole in the UTF-8, even the letters that are the same as the latin ones. Cf:

       ~ $ perl -CS -lwe 'print chr for 65, 1040'
      A
      А
       ~ $ perl -CS -we 'print chr for 65, 1040' | xxd
      0000000: 41d0 90                                  A..
      

      See Cyrillic capital letter A.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        choroba, maybe you don't have a sense how great your help is :) Hopefully I've put that question.

        When I started my work on this project of converting, the first idea was to try with Unicode and that way, I've got some results but I've felt the way was too primitive. Soon after, I've got the better idea of using hash, that resulted in the code I've sent. During the work, of course, I was looking at Unicode table on the web, but somehow, obviously, that was not the whole table, so I didn't notice that for example Cyrillic "A" has different unicode then Latin equivalent A. So, Your web link is a huge help

        Although I've made a wrong conclusion in the beginning, my intuition told me that it may cause a mess on the output. In fact it is one of main reasons why I've sent my question to perlmonks. I lack bigger and wider pucture and more structured knowledge, so comments from that point are far more important for me than to be given a code. In fact I feel, I have to learn how to build a good foundation for future learning and future work, after years and decades of being pretty common computer user or slightly above that.

        When I put your example one line command I receive "Use of uninitialized value $_ in print at -e line 1.", it is maybe of older Perl version? 5.10.1 or older bash shell, never mind, but when I make a script with that command, it works. In the beginning of my work I did similar tests, but didn't know about the Latin/Cyrillic difference I've wrote above.

        I understand now what you explained about your comment about PRE tag.