Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^3: Begginer's question: If loops one after the other. Is that code correct?

by haukex (Archbishop)
on Jan 10, 2017 at 16:30 UTC ( [id://1179336]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Begginer's question: If loops one after the other. Is that code correct? (updated)
in thread Begginer's question: If loops one after the other. Is that code correct?

Hi predrag,

It sounds like the trickiest part of your current solution is probably figuring out whether you're in some part of the HTML code or whether you're in the text, since obviously tags shouldn't be converted to Cyrillic. Unfortunately, parsing HTML is a pretty difficult task (a humorous post about the topic). So I'd like to encourage you to look at one of the parser modules again.

Two classic modules are HTML::Parser and HTML::TreeBuilder, but there are several others, such as Mojo::DOM. If the input is always XHTML, there's XML::Twig and many more XML-based modules. These modules generally break down the HTML into their structure, including elements (<tags>) with their attributes, comments, or text. Some of the modules then represent the HTML as a Document Object Model (DOM), which is also worth reading a little about. It sounds like you only want to operate on text, and maybe on some elements' attributes (such as title="..." attributes).

Operating only on text is relatively easy: for example, in a HTML::Parser solution, you could register a handler on the text event, which does the appropriate conversions, and register a default handler which just outputs everything else unchanged:

use warnings; use strict; use HTML::Parser; my $p = HTML::Parser->new( api_version => 3, unbroken_text => 1 ); $p->handler(text => sub { my ($text) = @_; # ### Your filter here ### $text=~s/foo/bar/g; print $text; }, 'text'); $p->handler(default => sub { print shift; }, 'text'); my $infile = '/tmp/in.html'; my $outfile = '/tmp/out.html'; open my $out, '>', $outfile or die "open $outfile: $!"; # "select" redirects the "print"s my $previous = select $out; $p->parse_file($infile); close $out; select $previous; print "$infile -> $outfile\n";

Operating on attributes will require you to handle opening elements (tags) as well. Note also that the same basic principle I described above applies to the other modules: they all break the HTML down into its components, so that you can operate on only the textual parts, leaving the others unchanged.

BTW, have you seen Lingua::Translit?

Hope this helps,
-- Hauke D

Replies are listed 'Best First'.
Re^4: Begginer's question: If loops one after the other. Is that code correct?
by predrag (Scribe) on Jan 10, 2017 at 17:34 UTC

    Thanks a lot Hauke D. Yes, it was my problem to resolve - position. No, I didn't know about Lingua::Translit

    When converting into Cyrilic, it is important to know that not all Cyrilic letters have one to one Latin letter correspondence, there are few that have two Latin letters correspondence.

    Oh, You already posted a code for converting with Parsel, thank you so much. I will save that code and try later. I would show my work although :) if for nothing else, it may be a fun for real programmers to look at it. I am maybe too cruel to my work. I resolved problem with html &nbsp also, so my script puts that into code and doesn't convert it.

    I think I have instaled HTML::TreeBuilder also. For now I work Perl in Virtual Machine where I have CentOS 6.7 and an old Ubuntu and I was pretty lucky to install modules in CentOS, I was a bit scared before. I have instaled Perl for Windows too

      Here is the code of the converter. It is for my non commercial beekeeping website that is on Serbian Latin alphabet. I am working on its new design and would like to have it converted into Cyrillic too. It is not small site, maybe few hundreds pages. I know there is a software for that but somehow, I don't like it. Till recently I never expected I could even have site in Cyrillic, or even could try to do it myself, but with Perl I think it is possible, even for such beginner like me.

      It works for simple html pages. If I have external CSS files, maybe it will work with CSS pages too, didn't try yet. So, I ask monks just for the comments on my approach.

      It reads a html file, converts the text into Cyrillic, leaves code untouched, and creates new html file in Cyrillic. Next steps are to read whole directory or whole website, and a lot of other things to be done, but it is not a part of my question now.

      I read that input file/string part by part, where one part is either string code between <> or string with text > < To determine where is a code and where is a text, I have a parameter k that after "<" receives value 1 and after ">" value 2.

      Subroutine converts strings. A hash contains dictionary of one to one equivalents. Letters that are the same (for example "a", "e" etc.) are omitted and I wonder if is it ok, for example are Latin and Cyrillic letter "a" are the same in html file and coding?

      script prints output file on standard output too

      #!/usr/bin/perl use strict; use warnings; use utf8; binmode(STDOUT, ":utf8"); use open ':encoding(utf8)'; # input/output default encoding will be # UTF-8 my $infile; # reads input file into string $infile open INPUT, "<index_latin.html"; undef $/; $infile =<INPUT>; close INPUT; my $k; # parameter =1 between < > , =2 between > < my $string; # "<code between>" my $txtstring = ''; # >"text between"< my $outcode = ''; # output: code and converted text together my $for_conv; # string to be converted by sub my $char; # chatacter from input file my $convert; # converted string by sub # splits input file into characters foreach $char (split//, $infile) { if ($char eq "<") { $k = 1; } if ($k ==2) { $txtstring= $txtstring . $char; } else { $string = $string .$char; } if ($char eq ">") { if (substr($txtstring, 0, 1) eq "&" ){ # &nbsp will not be converted $string =$txtstring.$string; #goes to string code $txtstring = ''; ## } $for_conv = $txtstring; $convert = konverter($for_conv); $outcode = $outcode .$convert.$string; $k = 2; $string = ''; $txtstring = ''; } # of if char eq ">" } # of foreach # writing to file my $filename = "index_cyrilic.htm"; open(FH, '>', $filename) or die $!; print FH $outcode ; close(FH); <readmore> print "\n"; print "code on the output:\n"; print "\n"; print "$outcode\n"; # converting string into Cyrillic sub konverter { # dictionary my %dict = ( "b"=> "&#1073;","B"=> "&#1041;","c"=> "&#1094;","C"=> "&# +1062;","&#269;"=> "&#1095;","&#268;"=> "&#1063;","&#263;"=> "&#1115;" +,"&#262;"=> "&#1035;","d"=> "&#1076;","D"=> "&#1044;","&#273;"=> "&#1 +106;","&#272;"=> "&#1026;","f"=> "&#1092;","F"=> "&#1060;","g"=> "&#1 +075;","G"=> "&#1043;","h"=> "&#1093;","H"=> "&#1061;","i"=> "&#1080;" +,"I"=> "&#1048;","l"=> "&#1083;","L"=> "&#1051;","m"=> "&#1084;","n"= +> "&#1085;","N"=> "&#1053;","p"=> "&#1087;","P" => "&#1055;","r" => " +&#1088;","R" => "&#1056;","s"=> "&#1089;","S"=> "&#1057;","š"=> "&#10 +96;","Š"=> "&#1064;","t"=> "&#1090;","u"=> "&#1091;","U"=> "&#1059;", +"v"=> "&#1074;","V" => "&#1042;","z"=> "&#1079;", "Z" => "&#1047;"," +ž"=> "&#1078;","Ž"=> "&#1046;"); my @conv_arr = split (//, $for_conv); # splits input string for conv +ersion my $ind = 0; # index of array element my $out = ""; # output, converted string my $str_char; # string character my $next; # next string character my $nj; # Latin two character letters to be replaced with one Cyrilli +c my $Nj; my $lj; my $Lj; my $dz; my $Dz; while ($ind <= $#conv_arr){ $str_char = $conv_arr[$ind]; # current character if ($ind ==$#conv_arr) { $next =""; # there are no more characters } else { $next =$conv_arr[$ind+1]; # next character } if (exists ($dict{$str_char})) { # combination nj gives $nj = "&#1114;" if (($str_char eq "n") && ($next eq "j")){ $nj = "&#1114;"; $out = $out.$nj; $ind = $ind+1; } elsif (($str_char eq "N") && ($next eq "j")){ $Nj = "&#1034;"; $out = $out.$Nj; $ind = $ind+1; } elsif (($str_char eq "l") && ($next eq "j")){ $lj = "&#1113;"; $out = $out.$lj; $ind = $ind+1; } elsif (($str_char eq "L") && ($next eq "j")){ $Lj = "&#1033;"; $out = $out.$Lj; $ind = $ind+1; } elsif (($str_char eq "d") && ($next eq "ž")){ $dz = "&#1119;"; $out = $out.$dz; $ind = $ind+1; } elsif (($str_char eq "D") && ($next eq "ž")){ $Dz = "&#1039;"; $out = $out.$Dz; $ind = $ind+1; } else { # one character letters $out = $out.$dict{$str_char}; } $ind++; } # of if exists else { $out = $out.$str_char; $ind++; } } # of while return $out; } # of sub </readmore>

      Here is the html code of input file index_latin.html for testing

      it is the output code, I hope that I was successful with readmore

        You can simplify the code by removing the special cases for the two-character combinations and just use a regex. Just make sure you try to match the longer "characters" first, so their parts aren't matched instead.

        Also, I used XML::LibXML to parse the structure.

        #!/usr/bin/perl use warnings; use strict; use utf8; use XML::LibXML; my $file = shift; my %to_cyrilic = ( # Insert the hash definition here, see below. ); my $regex = join '|', sort { length $b <=> length $a } keys %to_cyrili +c; my $dom = 'XML::LibXML'->load_html( location => $file ); for my $text ($dom->findnodes('//text()')) { my $etext = $text; $text->setData($etext) if $etext =~ s/($regex)/$to_cyrilic{$1}/g; } print $dom;

        Note that PRE tags preserve non-latin1 characters.

            F => 'Ф',
            H => 'Х',
            N => ':',
            Nj => 'Њ',
            b => 'б',
            c => 'ц',
            d => 'д',
            e => 'e',
            h => 'х',
            i => 'и',
            l => 'л',
            m => 'м',
            n => 'н',
            nj => 'њ',
            p => 'п',
            r => 'р',
            s => 'с',
            t => 'т',
            u => 'у',
            v => 'в',
            z => 'з',
            ć => 'ћ',
            Č => 'Ч',
            č => 'ч',
            Đ => 'Ђ',
            đ => 'ђ',
            Š => 'Ш',
            š => 'ш',
            Ž => 'Ж',
            ž => 'ж',
            # etc., this is enough to run the example.
        
        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        Hi predrag,

        choroba has already shown you a more "perlish" approach to the problem, using a hash table and a regular expression. However, note that just because you're writing in Perl doesn't mean you have to do it that way, since in Perl, TIMTOWTDI - There Is More Than One Way To Do It. (My comments that it's better to parse HTML with a module still apply, though.)

        I had a look at your code, and even though I haven't tested it myself since you said that it works, it does look like the logic is fairly sound. I'm not entirely clear yet on the order of operations in the foreach $char loop, but as I said before the best way to go about checking it is with enough sample input that exercises all the logic branches.

        The one thing that I'm a little confused about is the placement of the if (substr($txtstring, 0, 1) eq "&" ){ statement. It seems to me like this is only handling & characters at certain points in the input string, instead of anywhere in the input string. This might be a place where either index or a regular expression might be more appropriate (or, of course, a full HTML parser :-)).

        Anyway, I thought I might give some general comments about your code:

        • Instead of binmode(STDOUT, ":utf8"); use open ':encoding(utf8)';, I believe you can just say use open qw/:std :utf8/; (this also affects STDIN and STDERR).
        • open INPUT, "<index_latin.html"; - I'd recommend the three-argument form of open, with error handling, as well as lexical filehandles (my $infh instead of INPUT): open my $infh, '<', 'index_latin.html' or die "open: $!";
        • undef $/; - the effect of a change to the $/ variable will be global, throughout the whole program. A common way to do this is to use local inside of a do block; the effect of local will then cause the variable to be reset to its original value when the block is exited. You'll see this often in Perl to read an entire file at once ("slurp"): my $infile = do { local $/; <INPUT> };
        • You have quite a few variable declarations (my ...) before the code starts. Note that it's usually better to wait with declaring variables until the scope where they're needed, as otherwise there might be conflicts if you accidentally re-use a variable or forget which scope you're working in. For example, instead of my $char; foreach $char ... it's usually better to say foreach my $char ... (unless of course you specifically need $char after the loop ends).
        • my $k; - I'd recommend to use textual representations instead of magic numbers here. For example, you can use constant: use constant { INSIDE_TAG=>1, OUTSIDE_TAG=>2 }; and then use the two values INSIDE_TAG and OUTSIDE_TAG instead of the numbers.
        • my $Nj; ... $Nj = "&#1034;"; $out = $out.$Nj; can also be written much shorter as $out = $out."&#1034;"; (since each of those variables like $Nj is used only once).

        As for your question here about &nbsp;, you're right, my code didn't handle that. The solution is to change the 'text' to 'dtext' (decoded text) in $p->handler(text => sub { ... }, 'text');. Also, I didn't have full UTF-8 handling in that code, I should have said open my $out, '>:utf8', $outfile or die ... to open the output file, and for parsing the input file I should have done this: open my $infh, '<:utf8', $infile or die "open $infile: $!"; $p->parse_file($infh); (this is mentioned in the HTML::Parser documentation).

        As you've noticed, PerlMonks isn't perfect in regards to Unicode. Even though Perl itself handles it fine, I just wanted to point out that there are other ways to represent Unicode characters in Perl where the source file can be left in ASCII (and that won't cause trouble when posting to PerlMonks). For example, instead of "č"=>"ч", you can write "\x{010D}"=>"\x{0447}" or "\N{LATIN SMALL LETTER C WITH CARON}"=>"\N{CYRILLIC SMALL LETTER CHE}" (depending on the Perl version, for the latter you may have to add use charnames ':full'; at the top of your code). These forms certainly don't look as nice, so you don't have to use them if Unicode works for you, but it's also noteworthy that this will make the difference between "A" and "А" more obvious (one of them is actually "\N{CYRILLIC CAPITAL LETTER A}").

        Hope this helps,
        -- Hauke D

Re^4: Begginer's question: If loops one after the other. Is that code correct?
by predrag (Scribe) on Jan 11, 2017 at 12:54 UTC

    Hauke D, I've tried your code with Parser module, works very well, thanks again. I understand the use of  s/foo/bar/g; now. It is a next important step for me, I am really encouraged to try different uses of that module on my site, as well as other modules you suggested. Maybe even for some simple search machines etc.

    It seems Parser doesn't work on non braking space

     <p>&nbsp;</p>

    but I think I can resolve that

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1179336]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-23 20:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found