comment on

Here is the code of the converter. It is for my non commercial beekeeping website that is on Serbian Latin alphabet. I am working on its new design and would like to have it converted into Cyrillic too. It is not small site, maybe few hundreds pages. I know there is a software for that but somehow, I don't like it. Till recently I never expected I could even have site in Cyrillic, or even could try to do it myself, but with Perl I think it is possible, even for such beginner like me.

It works for simple html pages. If I have external CSS files, maybe it will work with CSS pages too, didn't try yet. So, I ask monks just for the comments on my approach.

It reads a html file, converts the text into Cyrillic, leaves code untouched, and creates new html file in Cyrillic. Next steps are to read whole directory or whole website, and a lot of other things to be done, but it is not a part of my question now.

I read that input file/string part by part, where one part is either string code between <> or string with text > < To determine where is a code and where is a text, I have a parameter k that after "<" receives value 1 and after ">" value 2.

Subroutine converts strings. A hash contains dictionary of one to one equivalents. Letters that are the same (for example "a", "e" etc.) are omitted and I wonder if is it ok, for example are Latin and Cyrillic letter "a" are the same in html file and coding?

script prints output file on standard output too

#!/usr/bin/perl 
use strict;
use warnings;

use utf8;
binmode(STDOUT, ":utf8");

use open ':encoding(utf8)'; # input/output default encoding will be
                                # UTF-8

my $infile;            # reads input file into string $infile

open INPUT, "<index_latin.html";
undef $/;
$infile =<INPUT>;
close INPUT;


my $k;                    # parameter =1 between < > , =2 between > <
my $string;                #  "<code between>" 
my $txtstring = '';        #  >"text between"<
my $outcode = '';       # output: code and converted text together
my $for_conv;          # string to be converted by sub
my $char;              # chatacter from input file
my $convert;          # converted string by sub


            # splits input file into characters 
foreach $char (split//, $infile) {

    if ($char eq "<") {
        $k = 1;
    }

    if ($k ==2)  {
        $txtstring= $txtstring . $char;
    }

    else {
        $string = $string .$char;
    }


    if ($char eq ">") {

        if (substr($txtstring, 0, 1) eq "&" ){   
                # &nbsp will not be converted
            $string =$txtstring.$string;  #goes to string code
            $txtstring = '';  ## 
        }

    $for_conv = $txtstring;

    $convert = konverter($for_conv);

    $outcode = $outcode .$convert.$string;

    $k = 2;    
    $string = '';
    $txtstring = '';
    
    }          # of if char eq ">"

}       # of foreach

                # writing to  file
    my $filename = "index_cyrilic.htm";    
    open(FH, '>', $filename) or die $!;
    print FH $outcode ;
    close(FH);
<readmore>

print "\n";
print "code on the output:\n";
print "\n";
print "$outcode\n";

# converting string into Cyrillic

sub konverter  {
              # dictionary

my %dict = ( "b"=> "&#1073;","B"=> "&#1041;","c"=> "&#1094;","C"=> "&#
+1062;","&#269;"=> "&#1095;","&#268;"=> "&#1063;","&#263;"=> "&#1115;"
+,"&#262;"=> "&#1035;","d"=> "&#1076;","D"=> "&#1044;","&#273;"=> "&#1
+106;","&#272;"=> "&#1026;","f"=> "&#1092;","F"=> "&#1060;","g"=> "&#1
+075;","G"=> "&#1043;","h"=> "&#1093;","H"=> "&#1061;","i"=> "&#1080;"
+,"I"=> "&#1048;","l"=> "&#1083;","L"=> "&#1051;","m"=> "&#1084;","n"=
+> "&#1085;","N"=> "&#1053;","p"=> "&#1087;","P" => "&#1055;","r" => "
+&#1088;","R" => "&#1056;","s"=> "&#1089;","S"=> "&#1057;","š"=> "&#10
+96;","Š"=> "&#1064;","t"=> "&#1090;","u"=> "&#1091;","U"=> "&#1059;",
+"v"=> "&#1074;","V" => "&#1042;","z"=> "&#1079;", "Z"  => "&#1047;","
+ž"=> "&#1078;","Ž"=> "&#1046;"); 

my @conv_arr = split (//, $for_conv);   # splits input string for conv
+ersion

my $ind = 0;    # index of array element
my $out = "";   # output, converted string
my $str_char;   # string character
my $next;       # next string character
my $nj;  # Latin two character letters to be replaced with one Cyrilli
+c 
my $Nj;
my $lj;
my $Lj;
my $dz;
my $Dz;

while ($ind <= $#conv_arr){
    $str_char = $conv_arr[$ind];   # current character 

    if ($ind ==$#conv_arr) {
        $next ="";  # there are no more characters
    }
    else {
        $next =$conv_arr[$ind+1];    # next character
    }
    
    if (exists ($dict{$str_char})) {  

                            # combination nj gives $nj = "&#1114;"
        if (($str_char eq "n") && ($next eq "j")){
            $nj = "&#1114;";
            $out = $out.$nj;
            $ind = $ind+1;
        }

        elsif (($str_char eq "N") && ($next eq "j")){
            $Nj = "&#1034;";
            $out = $out.$Nj;
            $ind = $ind+1;
        }

        elsif (($str_char eq "l") && ($next eq "j")){
            $lj = "&#1113;";
            $out = $out.$lj;
            $ind = $ind+1;
        }

        elsif (($str_char eq "L") && ($next eq "j")){
            $Lj = "&#1033;";
            $out = $out.$Lj;
            $ind = $ind+1;
        }

        elsif (($str_char eq "d") && ($next eq "ž")){
            $dz = "&#1119;";
            $out = $out.$dz;
            $ind = $ind+1;
        }

        elsif (($str_char eq "D") && ($next eq "ž")){
            $Dz = "&#1039;";
            $out = $out.$Dz;
            $ind = $ind+1;
        }
        else {   # one character letters
            $out = $out.$dict{$str_char};
        } 

    $ind++;
    } # of if exists

    else {      
        $out = $out.$str_char;
        $ind++;
    } 
}          # of while

return $out;
}         # of sub
</readmore>
[download]

Here is the html code of input file index_latin.html for testing


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

<body>
<p>primer <font color="#003300"><strong><font color="#006600">sajta</f
+ont></strong></font></p>
<table width="124" border="1" cellspacing="2" cellpadding="2">
  <tr>
    <td width="41">Fhšž</td>
    <td width="63">Hj&#263;</td>
  </tr>
  <tr>
    <td>abcd</td>
    <td>145</td>
  </tr>
</table>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><em>konverzija</em> iz latinice u &#263;irilicu šŠ &#269;&#268; &#2
+73;&#272; žŽ nj Nj</p>
<p>poslednji red</p>
</body>
</html>
[download]

it is the output code, I hope that I was successful with readmore

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>&#1090;e&#1089;&#1090;</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

<body>
<p>&#1087;&#1088;&#1080;&#1084;e&#1088; <font color="#003300"><strong>
+<font color="#006600">&#1089;aj&#1090;a</font></strong></font></p>
<table width="124" border="1" cellspacing="2" cellpadding="2">
  <tr>
    <td width="41">&#1060;&#1093;&#1096;&#1078;</td>
    <td width="63">&#1061;j&#1115;</td>
  </tr>
  <tr>
    <td>a&#1073;&#1094;&#1076;</td>
    <td>145</td>
  </tr>
</table>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><em>ko&#1085;&#1074;e&#1088;&#1079;&#1080;ja</em> &#1080;&#1079; &#
+1083;a&#1090;&#1080;&#1085;&#1080;&#1094;e &#1091; &#1115;&#1080;&#10
+88;&#1080;&#1083;&#1080;&#1094;&#1091; &#1096;&#1064; &#1095;&#1063; 
+&#1106;&#1026; &#1078;&#1046; &#1114; &#1034;</p>
<p>&#1087;o&#1089;&#1083;e&#1076;&#1114;&#1080; &#1088;e&#1076;</p>
</body>
</html>
[download]

In reply to Re^5: Begginer's question: If loops one after the other. Is that code correct? by predrag
in thread Begginer's question: If loops one after the other. Is that code correct? by predrag

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.