Re^5: Begginer's question: If loops one after the other. Is that code correct?

Here is the code of the converter. It is for my non commercial beekeeping website that is on Serbian Latin alphabet. I am working on its new design and would like to have it converted into Cyrillic too. It is not small site, maybe few hundreds pages. I know there is a software for that but somehow, I don't like it. Till recently I never expected I could even have site in Cyrillic, or even could try to do it myself, but with Perl I think it is possible, even for such beginner like me.

It works for simple html pages. If I have external CSS files, maybe it will work with CSS pages too, didn't try yet. So, I ask monks just for the comments on my approach.

It reads a html file, converts the text into Cyrillic, leaves code untouched, and creates new html file in Cyrillic. Next steps are to read whole directory or whole website, and a lot of other things to be done, but it is not a part of my question now.

I read that input file/string part by part, where one part is either string code between <> or string with text > < To determine where is a code and where is a text, I have a parameter k that after "<" receives value 1 and after ">" value 2.

Subroutine converts strings. A hash contains dictionary of one to one equivalents. Letters that are the same (for example "a", "e" etc.) are omitted and I wonder if is it ok, for example are Latin and Cyrillic letter "a" are the same in html file and coding?

script prints output file on standard output too

#!/usr/bin/perl 
use strict;
use warnings;

use utf8;
binmode(STDOUT, ":utf8");

use open ':encoding(utf8)'; # input/output default encoding will be
                                # UTF-8

my $infile;            # reads input file into string $infile

open INPUT, "<index_latin.html";
undef $/;
$infile =<INPUT>;
close INPUT;


my $k;                    # parameter =1 between < > , =2 between > <
my $string;                #  "<code between>" 
my $txtstring = '';        #  >"text between"<
my $outcode = '';       # output: code and converted text together
my $for_conv;          # string to be converted by sub
my $char;              # chatacter from input file
my $convert;          # converted string by sub


            # splits input file into characters 
foreach $char (split//, $infile) {

    if ($char eq "<") {
        $k = 1;
    }

    if ($k ==2)  {
        $txtstring= $txtstring . $char;
    }

    else {
        $string = $string .$char;
    }


    if ($char eq ">") {

        if (substr($txtstring, 0, 1) eq "&" ){   
                # &nbsp will not be converted
            $string =$txtstring.$string;  #goes to string code
            $txtstring = '';  ## 
        }

    $for_conv = $txtstring;

    $convert = konverter($for_conv);

    $outcode = $outcode .$convert.$string;

    $k = 2;    
    $string = '';
    $txtstring = '';
    
    }          # of if char eq ">"

}       # of foreach

                # writing to  file
    my $filename = "index_cyrilic.htm";    
    open(FH, '>', $filename) or die $!;
    print FH $outcode ;
    close(FH);
<readmore>

print "\n";
print "code on the output:\n";
print "\n";
print "$outcode\n";

# converting string into Cyrillic

sub konverter  {
              # dictionary

my %dict = ( "b"=> "&#1073;","B"=> "&#1041;","c"=> "&#1094;","C"=> "&#
+1062;","&#269;"=> "&#1095;","&#268;"=> "&#1063;","&#263;"=> "&#1115;"
+,"&#262;"=> "&#1035;","d"=> "&#1076;","D"=> "&#1044;","&#273;"=> "&#1
+106;","&#272;"=> "&#1026;","f"=> "&#1092;","F"=> "&#1060;","g"=> "&#1
+075;","G"=> "&#1043;","h"=> "&#1093;","H"=> "&#1061;","i"=> "&#1080;"
+,"I"=> "&#1048;","l"=> "&#1083;","L"=> "&#1051;","m"=> "&#1084;","n"=
+> "&#1085;","N"=> "&#1053;","p"=> "&#1087;","P" => "&#1055;","r" => "
+&#1088;","R" => "&#1056;","s"=> "&#1089;","S"=> "&#1057;","š"=> "&#10
+96;","Š"=> "&#1064;","t"=> "&#1090;","u"=> "&#1091;","U"=> "&#1059;",
+"v"=> "&#1074;","V" => "&#1042;","z"=> "&#1079;", "Z"  => "&#1047;","
+ž"=> "&#1078;","Ž"=> "&#1046;"); 

my @conv_arr = split (//, $for_conv);   # splits input string for conv
+ersion

my $ind = 0;    # index of array element
my $out = "";   # output, converted string
my $str_char;   # string character
my $next;       # next string character
my $nj;  # Latin two character letters to be replaced with one Cyrilli
+c 
my $Nj;
my $lj;
my $Lj;
my $dz;
my $Dz;

while ($ind <= $#conv_arr){
    $str_char = $conv_arr[$ind];   # current character 

    if ($ind ==$#conv_arr) {
        $next ="";  # there are no more characters
    }
    else {
        $next =$conv_arr[$ind+1];    # next character
    }
    
    if (exists ($dict{$str_char})) {  

                            # combination nj gives $nj = "&#1114;"
        if (($str_char eq "n") && ($next eq "j")){
            $nj = "&#1114;";
            $out = $out.$nj;
            $ind = $ind+1;
        }

        elsif (($str_char eq "N") && ($next eq "j")){
            $Nj = "&#1034;";
            $out = $out.$Nj;
            $ind = $ind+1;
        }

        elsif (($str_char eq "l") && ($next eq "j")){
            $lj = "&#1113;";
            $out = $out.$lj;
            $ind = $ind+1;
        }

        elsif (($str_char eq "L") && ($next eq "j")){
            $Lj = "&#1033;";
            $out = $out.$Lj;
            $ind = $ind+1;
        }

        elsif (($str_char eq "d") && ($next eq "ž")){
            $dz = "&#1119;";
            $out = $out.$dz;
            $ind = $ind+1;
        }

        elsif (($str_char eq "D") && ($next eq "ž")){
            $Dz = "&#1039;";
            $out = $out.$Dz;
            $ind = $ind+1;
        }
        else {   # one character letters
            $out = $out.$dict{$str_char};
        } 

    $ind++;
    } # of if exists

    else {      
        $out = $out.$str_char;
        $ind++;
    } 
}          # of while

return $out;
}         # of sub
</readmore>
[download]

Here is the html code of input file index_latin.html for testing


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

<body>
<p>primer <font color="#003300"><strong><font color="#006600">sajta</f
+ont></strong></font></p>
<table width="124" border="1" cellspacing="2" cellpadding="2">
  <tr>
    <td width="41">Fhšž</td>
    <td width="63">Hj&#263;</td>
  </tr>
  <tr>
    <td>abcd</td>
    <td>145</td>
  </tr>
</table>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><em>konverzija</em> iz latinice u &#263;irilicu šŠ &#269;&#268; &#2
+73;&#272; žŽ nj Nj</p>
<p>poslednji red</p>
</body>
</html>
[download]

it is the output code, I hope that I was successful with readmore

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>&#1090;e&#1089;&#1090;</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

<body>
<p>&#1087;&#1088;&#1080;&#1084;e&#1088; <font color="#003300"><strong>
+<font color="#006600">&#1089;aj&#1090;a</font></strong></font></p>
<table width="124" border="1" cellspacing="2" cellpadding="2">
  <tr>
    <td width="41">&#1060;&#1093;&#1096;&#1078;</td>
    <td width="63">&#1061;j&#1115;</td>
  </tr>
  <tr>
    <td>a&#1073;&#1094;&#1076;</td>
    <td>145</td>
  </tr>
</table>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><em>ko&#1085;&#1074;e&#1088;&#1079;&#1080;ja</em> &#1080;&#1079; &#
+1083;a&#1090;&#1080;&#1085;&#1080;&#1094;e &#1091; &#1115;&#1080;&#10
+88;&#1080;&#1083;&#1080;&#1094;&#1091; &#1096;&#1064; &#1095;&#1063; 
+&#1106;&#1026; &#1078;&#1046; &#1114; &#1034;</p>
<p>&#1087;o&#1089;&#1083;e&#1076;&#1114;&#1080; &#1088;e&#1076;</p>
</body>
</html>
[download]

Comment on Re^5: Begginer's question: If loops one after the other. Is that code correct? Select or Download Code

Replies are listed 'Best First'.
Re^6: Begginer's question: If loops one after the other. Is that code correct? by choroba (Cardinal) on Jan 10, 2017 at 23:21 UTC
You can simplify the code by removing the special cases for the two-character combinations and just use a regex. Just make sure you try to match the longer "characters" first, so their parts aren't matched instead. Also, I used XML::LibXML to parse the structure. `#!/usr/bin/perl use warnings; use strict; use utf8; use XML::LibXML; my $file = shift; my %to_cyrilic = ( # Insert the hash definition here, see below. ); my $regex = join '\|', sort { length $b <=> length $a } keys %to_cyrili +c; my $dom = 'XML::LibXML'->load_html( location => $file ); for my $text ($dom->findnodes('//text()')) { my $etext = $text; $text->setData($etext) if $etext =~ s/($regex)/$to_cyrilic{$1}/g; } print $dom;` [download] Note that PRE tags preserve non-latin1 characters. F => 'Ф', H => 'Х', N => ':', Nj => 'Њ', b => 'б', c => 'ц', d => 'д', e => 'e', h => 'х', i => 'и', l => 'л', m => 'м', n => 'н', nj => 'њ', p => 'п', r => 'р', s => 'с', t => 'т', u => 'у', v => 'в', z => 'з', ć => 'ћ', Č => 'Ч', č => 'ч', Đ => 'Ђ', đ => 'ђ', Š => 'Ш', š => 'ш', Ž => 'Ж', ž => 'ж', # etc., this is enough to run the example. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^7: Begginer's question: If loops one after the other. Is that code correct? by predrag (Scribe) on Jan 11, 2017 at 19:36 UTC
choroba, thank you very much for the fast and good help. Yes, I see now I could use regex for two character combinations. I have some experience with regex, not too much, but I can do that myself. I will try your code soon, just have to install that module. The code is really short, can't be shorter. I don't understand all yet but hope will be clear while working with the code. Sorry, I am not sure that I understand well what You wrote: "Note that PRE tags preserve non-latin1 characters" Does it mean that I have to put in hash all Cirillyc letters, including these that are the same as in Latin (a, o, e, k…)?	[reply]
Re^8: Begginer's question: If loops one after the other. Is that code correct? by choroba (Cardinal) on Jan 11, 2017 at 20:47 UTC
The comment about PRE tags is about this site: as you've seen, you can't include some characters into CODE tags, as the site changes them into entities, but you can include them into PRE tags. But your question is a good ont - yes, cyrilic alphabet exists as a whole in the UTF-8, even the letters that are the same as the latin ones. Cf: ~ $ perl -CS -lwe 'print chr for 65, 1040' A А ~ $ perl -CS -we 'print chr for 65, 1040' \| xxd 0000000: 41d0 90 A.. See Cyrillic capital letter A. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^9: Begginer's question: If loops one after the other. Is that code correct? by predrag (Scribe) on Jan 12, 2017 at 15:05 UTC
Re^10: Begginer's question: If loops one after the other. Is that code correct? by choroba (Cardinal) on Jan 12, 2017 at 15:10 UTC
Some notes below your chosen depth have not been shown here
Re^6: Begginer's question: If loops one after the other. Is that code correct? by haukex (Archbishop) on Jan 13, 2017 at 19:13 UTC
Hi predrag, choroba has already shown you a more "perlish" approach to the problem, using a hash table and a regular expression. However, note that just because you're writing in Perl doesn't mean you have to do it that way, since in Perl, TIMTOWTDI - There Is More Than One Way To Do It. (My comments that it's better to parse HTML with a module still apply, though.) I had a look at your code, and even though I haven't tested it myself since you said that it works, it does look like the logic is fairly sound. I'm not entirely clear yet on the order of operations in the `foreach $char` loop, but as I said before the best way to go about checking it is with enough sample input that exercises all the logic branches. The one thing that I'm a little confused about is the placement of the `if (substr($txtstring, 0, 1) eq "&" ){` statement. It seems to me like this is only handling & characters at certain points in the input string, instead of anywhere in the input string. This might be a place where either index or a regular expression might be more appropriate (or, of course, a full HTML parser `:-)`). Anyway, I thought I might give some general comments about your code: Instead of `binmode(STDOUT, ":utf8"); use open ':encoding(utf8)';`, I believe you can just say `use open qw/:std :utf8/;` (this also affects STDIN and STDERR). `open INPUT, "<index_latin.html";` - I'd recommend the three-argument form of open, with error handling, as well as lexical filehandles (`my $infh` instead of `INPUT`): `open my $infh, '<', 'index_latin.html' or die "open: $!";` `undef $/;` - the effect of a change to the $/ variable will be global, throughout the whole program. A common way to do this is to use local inside of a do block; the effect of local will then cause the variable to be reset to its original value when the block is exited. You'll see this often in Perl to read an entire file at once ("slurp"): `my $infile = do { local $/; <INPUT> };` You have quite a few variable declarations (`my ...`) before the code starts. Note that it's usually better to wait with declaring variables until the scope where they're needed, as otherwise there might be conflicts if you accidentally re-use a variable or forget which scope you're working in. For example, instead of `my $char; foreach $char ...` it's usually better to say `foreach my $char ...` (unless of course you specifically need `$char` after the loop ends). `my $k;` - I'd recommend to use textual representations instead of magic numbers here. For example, you can use constant: `use constant { INSIDE_TAG=>1, OUTSIDE_TAG=>2 };` and then use the two values `INSIDE_TAG` and `OUTSIDE_TAG` instead of the numbers. `my $Nj; ... $Nj = "Њ"; $out = $out.$Nj;` can also be written much shorter as `$out = $out."Њ";` (since each of those variables like `$Nj` is used only once). As for your question here about ` `, you're right, my code didn't handle that. The solution is to change the `'text'` to `'dtext'` (decoded text) in `$p->handler(text => sub { ... }, 'text');`. Also, I didn't have full UTF-8 handling in that code, I should have said `open my $out, '>:utf8', $outfile or die ...` to open the output file, and for parsing the input file I should have done this: `open my $infh, '<:utf8', $infile or die "open $infile: $!"; $p->parse_file($infh);` (this is mentioned in the HTML::Parser documentation). As you've noticed, PerlMonks isn't perfect in regards to Unicode. Even though Perl itself handles it fine, I just wanted to point out that there are other ways to represent Unicode characters in Perl where the source file can be left in ASCII (and that won't cause trouble when posting to PerlMonks). For example, instead of "č"=>"ч", you can write `"\x{010D}"=>"\x{0447}"` or `"\N{LATIN SMALL LETTER C WITH CARON}"=>"\N{CYRILLIC SMALL LETTER CHE}"` (depending on the Perl version, for the latter you may have to add `use charnames ':full';` at the top of your code). These forms certainly don't look as nice, so you don't have to use them if Unicode works for you, but it's also noteworthy that this will make the difference between "A" and "А" more obvious (one of them is actually `"\N{CYRILLIC CAPITAL LETTER A}"`). Hope this helps, -- Hauke D	[reply] [d/l] [select]
Re^7: Begginer's question: If loops one after the other. Is that code correct? by predrag (Scribe) on Jan 13, 2017 at 22:33 UTC
Hauke D, thanks so much for such detailed answer, comments and suggestions. I accepted your suggestion regarding the order of operations in the `foreach $char` loop, but didn't have time to rewrite this yet, and will try the way you've suggested, that I also see is the most logical. Also, I understand your suggestions about using modules for parsing HTML, instead of doing the way I did. I've already installed XML::LibXML module and tried choroba's code but I didn't get good result yet. Never mind, I tried one other, simplest example with that module, without any conversion, just to have some experience. It worked well and I've noticed it works for ` ` The example I've sent doesn't have that test, but I've tried on other example. You are absolutely right in your third paragraph (handling & characters) when noticed that my block of the code is a too limited solution. I knew that and I've wrote that for ` ` just for a test, and was very happy to see that way I could maybe even handle some more complicated and mixed HTML pages. That doesn't mean that I will use that way, it was just a phase in practicing But I think that I should wait for new design of my site and see what HTML code will have and then, it will be much easier to finish a complete converter. It is because I could have some CSS, or something other that have to be additionaly handled. But anyway, even if I would have some pages with something very special, it will not be so big problem if I work on these conversion "manually". I've already successfully tried some examples with creating PDF files (a choice for printing some articles), working with directories etc. and that way I am preparing myself for final work on my website design. I often prefer to learn in phases, "in circles", first time just to touch, then deeper etc. instead of going straight to the essence in one step So, I have to convert all Latin letters, not just "different ones", but other characters (such interpunction etc. will stay untouched? Hopefully, it is not a programmers' site where I would have all possible characters in the text area. :) All your comments, general and other are really very useful for me and I am learning from these better then just to passively read somewhere. Regarding my code `binmode(STDOUT, ":utf8"); use open ':encoding(utf8)';` I will try your suggestion too. I had to put that code (found on the web) because I practiced to print some output on STDOUT and without that Cyrillic letters were not visible. Somehow, I love TIMTOWTDI, maybe because I love this principle in general life. Regarding this my project, the most important for me is that code works perfectly, so the output will be good too. I do not need too fast code or something too fancy, but of course, I understand that code must be correct and clean. One separate task for my site in Cyrillic will be IDN encoding, it is something completely new for me and I've recently learned just a little about that from our national domain service. It is because I will have the url that is on Cyrillic too (national domain) and maybe other pages will have urls in Cyrillic too (more complicated option). I am maybe a bit slow in work and learning, but I don't have always free time and as I've wrote in a previous post, I try to build a good foundation for my future learning, not just to strive for fast solutions. Also, I have many other interests but is is really amazing for me to see that when I've found Perl, I even don't need to go any further in programming and that Perl could be useful for me in some other fields too	[reply] [d/l] [select]
Re^8: Begginer's question: If loops one after the other. Is that code correct? by haukex (Archbishop) on Jan 14, 2017 at 13:58 UTC
Hi predrag, One separate task for my site in Cyrillic will be IDN encoding I just wanted to point out the power of CPAN. Perl and CPAN have been around for a long time and two of several areas where Perl excels is text processing and web development. I've already linked you to several HTML and XML processing modules, and a quick search on CPAN for "translit" is what gave me, among other things, Lingua::Translit, and a quick search for "IDN" shows me Net::IDN::Encode, again just one module among several. use warnings; use strict; use open qw/:std :utf8/; use Lingua::Translit; my $tr = new Lingua::Translit("ISO/R 9"); my $txt = "\x{0441}\x{0440}\x{043F}\x{0441}\x{043A}\x{0438}"; my $latin = $tr->translit($txt); my $cyrillic = $tr->translit_reverse($latin); die "text mismatch" unless $txt eq $cyrillic; print "$latin <-> $cyrillic\n"; use Net::IDN::Encode qw/domain_to_ascii domain_to_unicode/; my $idn = "\x{0442}\x{0435}\x{0441}\x{0442}.\x{0441}\x{0440}\x{0431}"; my $asc = domain_to_ascii($idn); my $dom = domain_to_unicode("xn--e1aybc.xn--90a3ac"); die "domain mismatch" unless $idn eq $dom; print "$asc <-> $dom\n"; [download] Output: srpski <-> српски xn--e1aybc.xn--90a3ac <-> тест.срб Note: I did not verify that the "ISO/R 9" transliteration table is identical to the Serbian / Cyrillic transliteration table you're using, but at least Wikipedia says it's suitable. Regards, -- Hauke D	[reply] [d/l]
Re^9: Begginer's question: If loops one after the other. Is that code correct? by predrag (Scribe) on Jan 14, 2017 at 17:01 UTC
Re^10: Begginer's question: If loops one after the other. Is that code correct? by haukex (Archbishop) on Jan 17, 2017 at 12:23 UTC
Some notes below your chosen depth have not been shown here
Re^10: Begginer's question: If loops one after the other. Is that code correct? by predrag (Scribe) on Jan 14, 2017 at 17:31 UTC
Some notes below your chosen depth have not been shown here