Here is the code of the converter. It is for my non commercial beekeeping website that is on Serbian Latin alphabet. I am working on its new design and would like to have it converted into Cyrillic too. It is not small site, maybe few hundreds pages. I know there is a software for that but somehow, I don't like it. Till recently I never expected I could even have site in Cyrillic, or even could try to do it myself, but with Perl I think it is possible, even for such beginner like me.

It works for simple html pages. If I have external CSS files, maybe it will work with CSS pages too, didn't try yet. So, I ask monks just for the comments on my approach.

It reads a html file, converts the text into Cyrillic, leaves code untouched, and creates new html file in Cyrillic. Next steps are to read whole directory or whole website, and a lot of other things to be done, but it is not a part of my question now.

I read that input file/string part by part, where one part is either string code between <> or string with text > < To determine where is a code and where is a text, I have a parameter k that after "<" receives value 1 and after ">" value 2.

Subroutine converts strings. A hash contains dictionary of one to one equivalents. Letters that are the same (for example "a", "e" etc.) are omitted and I wonder if is it ok, for example are Latin and Cyrillic letter "a" are the same in html file and coding?

script prints output file on standard output too

#!/usr/bin/perl use strict; use warnings; use utf8; binmode(STDOUT, ":utf8"); use open ':encoding(utf8)'; # input/output default encoding will be # UTF-8 my $infile; # reads input file into string $infile open INPUT, "<index_latin.html"; undef $/; $infile =<INPUT>; close INPUT; my $k; # parameter =1 between < > , =2 between > < my $string; # "<code between>" my $txtstring = ''; # >"text between"< my $outcode = ''; # output: code and converted text together my $for_conv; # string to be converted by sub my $char; # chatacter from input file my $convert; # converted string by sub # splits input file into characters foreach $char (split//, $infile) { if ($char eq "<") { $k = 1; } if ($k ==2) { $txtstring= $txtstring . $char; } else { $string = $string .$char; } if ($char eq ">") { if (substr($txtstring, 0, 1) eq "&" ){ # &nbsp will not be converted $string =$txtstring.$string; #goes to string code $txtstring = ''; ## } $for_conv = $txtstring; $convert = konverter($for_conv); $outcode = $outcode .$convert.$string; $k = 2; $string = ''; $txtstring = ''; } # of if char eq ">" } # of foreach # writing to file my $filename = "index_cyrilic.htm"; open(FH, '>', $filename) or die $!; print FH $outcode ; close(FH); <readmore> print "\n"; print "code on the output:\n"; print "\n"; print "$outcode\n"; # converting string into Cyrillic sub konverter { # dictionary my %dict = ( "b"=> "&#1073;","B"=> "&#1041;","c"=> "&#1094;","C"=> "&# +1062;","&#269;"=> "&#1095;","&#268;"=> "&#1063;","&#263;"=> "&#1115;" +,"&#262;"=> "&#1035;","d"=> "&#1076;","D"=> "&#1044;","&#273;"=> "&#1 +106;","&#272;"=> "&#1026;","f"=> "&#1092;","F"=> "&#1060;","g"=> "&#1 +075;","G"=> "&#1043;","h"=> "&#1093;","H"=> "&#1061;","i"=> "&#1080;" +,"I"=> "&#1048;","l"=> "&#1083;","L"=> "&#1051;","m"=> "&#1084;","n"= +> "&#1085;","N"=> "&#1053;","p"=> "&#1087;","P" => "&#1055;","r" => " +&#1088;","R" => "&#1056;","s"=> "&#1089;","S"=> "&#1057;","š"=> "&#10 +96;","Š"=> "&#1064;","t"=> "&#1090;","u"=> "&#1091;","U"=> "&#1059;", +"v"=> "&#1074;","V" => "&#1042;","z"=> "&#1079;", "Z" => "&#1047;"," +ž"=> "&#1078;","Ž"=> "&#1046;"); my @conv_arr = split (//, $for_conv); # splits input string for conv +ersion my $ind = 0; # index of array element my $out = ""; # output, converted string my $str_char; # string character my $next; # next string character my $nj; # Latin two character letters to be replaced with one Cyrilli +c my $Nj; my $lj; my $Lj; my $dz; my $Dz; while ($ind <= $#conv_arr){ $str_char = $conv_arr[$ind]; # current character if ($ind ==$#conv_arr) { $next =""; # there are no more characters } else { $next =$conv_arr[$ind+1]; # next character } if (exists ($dict{$str_char})) { # combination nj gives $nj = "&#1114;" if (($str_char eq "n") && ($next eq "j")){ $nj = "&#1114;"; $out = $out.$nj; $ind = $ind+1; } elsif (($str_char eq "N") && ($next eq "j")){ $Nj = "&#1034;"; $out = $out.$Nj; $ind = $ind+1; } elsif (($str_char eq "l") && ($next eq "j")){ $lj = "&#1113;"; $out = $out.$lj; $ind = $ind+1; } elsif (($str_char eq "L") && ($next eq "j")){ $Lj = "&#1033;"; $out = $out.$Lj; $ind = $ind+1; } elsif (($str_char eq "d") && ($next eq "ž")){ $dz = "&#1119;"; $out = $out.$dz; $ind = $ind+1; } elsif (($str_char eq "D") && ($next eq "ž")){ $Dz = "&#1039;"; $out = $out.$Dz; $ind = $ind+1; } else { # one character letters $out = $out.$dict{$str_char}; } $ind++; } # of if exists else { $out = $out.$str_char; $ind++; } } # of while return $out; } # of sub </readmore>

Here is the html code of input file index_latin.html for testing

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>test</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <p>primer <font color="#003300"><strong><font color="#006600">sajta</f +ont></strong></font></p> <table width="124" border="1" cellspacing="2" cellpadding="2"> <tr> <td width="41">Fhšž</td> <td width="63">Hj&#263;</td> </tr> <tr> <td>abcd</td> <td>145</td> </tr> </table> <p>&nbsp;</p> <p>&nbsp;</p> <p><em>konverzija</em> iz latinice u &#263;irilicu šŠ &#269;&#268; &#2 +73;&#272; žŽ nj Nj</p> <p>poslednji red</p> </body> </html>

it is the output code, I hope that I was successful with readmore

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>&#1090;e&#1089;&#1090;</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <p>&#1087;&#1088;&#1080;&#1084;e&#1088; <font color="#003300"><strong> +<font color="#006600">&#1089;aj&#1090;a</font></strong></font></p> <table width="124" border="1" cellspacing="2" cellpadding="2"> <tr> <td width="41">&#1060;&#1093;&#1096;&#1078;</td> <td width="63">&#1061;j&#1115;</td> </tr> <tr> <td>a&#1073;&#1094;&#1076;</td> <td>145</td> </tr> </table> <p>&nbsp;</p> <p>&nbsp;</p> <p><em>ko&#1085;&#1074;e&#1088;&#1079;&#1080;ja</em> &#1080;&#1079; &# +1083;a&#1090;&#1080;&#1085;&#1080;&#1094;e &#1091; &#1115;&#1080;&#10 +88;&#1080;&#1083;&#1080;&#1094;&#1091; &#1096;&#1064; &#1095;&#1063; +&#1106;&#1026; &#1078;&#1046; &#1114; &#1034;</p> <p>&#1087;o&#1089;&#1083;e&#1076;&#1114;&#1080; &#1088;e&#1076;</p> </body> </html>

In reply to Re^5: Begginer's question: If loops one after the other. Is that code correct? by predrag
in thread Begginer's question: If loops one after the other. Is that code correct? by predrag

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.