comment on

Hi predrag,

choroba has already shown you a more "perlish" approach to the problem, using a hash table and a regular expression. However, note that just because you're writing in Perl doesn't mean you have to do it that way, since in Perl, TIMTOWTDI - There Is More Than One Way To Do It. (My comments that it's better to parse HTML with a module still apply, though.)

I had a look at your code, and even though I haven't tested it myself since you said that it works, it does look like the logic is fairly sound. I'm not entirely clear yet on the order of operations in the foreach $char loop, but as I said before the best way to go about checking it is with enough sample input that exercises all the logic branches.

The one thing that I'm a little confused about is the placement of the if (substr($txtstring, 0, 1) eq "&" ){ statement. It seems to me like this is only handling & characters at certain points in the input string, instead of anywhere in the input string. This might be a place where either index or a regular expression might be more appropriate (or, of course, a full HTML parser :-)).

Anyway, I thought I might give some general comments about your code:

Instead of binmode(STDOUT, ":utf8"); use open ':encoding(utf8)';, I believe you can just say use open qw/:std :utf8/; (this also affects STDIN and STDERR).
open INPUT, "<index_latin.html"; - I'd recommend the three-argument form of open, with error handling, as well as lexical filehandles (my $infh instead of INPUT): open my $infh, '<', 'index_latin.html' or die "open: $!";
undef $/; - the effect of a change to the $/ variable will be global, throughout the whole program. A common way to do this is to use local inside of a do block; the effect of local will then cause the variable to be reset to its original value when the block is exited. You'll see this often in Perl to read an entire file at once ("slurp"): my $infile = do { local $/; <INPUT> };
You have quite a few variable declarations (my ...) before the code starts. Note that it's usually better to wait with declaring variables until the scope where they're needed, as otherwise there might be conflicts if you accidentally re-use a variable or forget which scope you're working in. For example, instead of my $char; foreach $char ... it's usually better to say foreach my $char ... (unless of course you specifically need $char after the loop ends).
my $k; - I'd recommend to use textual representations instead of magic numbers here. For example, you can use constant: use constant { INSIDE_TAG=>1, OUTSIDE_TAG=>2 }; and then use the two values INSIDE_TAG and OUTSIDE_TAG instead of the numbers.
my $Nj; ... $Nj = "Њ"; $out = $out.$Nj; can also be written much shorter as $out = $out."Њ"; (since each of those variables like $Nj is used only once).

As for your question here about  , you're right, my code didn't handle that. The solution is to change the 'text' to 'dtext' (decoded text) in $p->handler(text => sub { ... }, 'text');. Also, I didn't have full UTF-8 handling in that code, I should have said open my $out, '>:utf8', $outfile or die ... to open the output file, and for parsing the input file I should have done this: open my $infh, '<:utf8', $infile or die "open $infile: $!"; $p->parse_file($infh); (this is mentioned in the HTML::Parser documentation).

As you've noticed, PerlMonks isn't perfect in regards to Unicode. Even though Perl itself handles it fine, I just wanted to point out that there are other ways to represent Unicode characters in Perl where the source file can be left in ASCII (and that won't cause trouble when posting to PerlMonks). For example, instead of "č"=>"ч", you can write "\x{010D}"=>"\x{0447}" or "\N{LATIN SMALL LETTER C WITH CARON}"=>"\N{CYRILLIC SMALL LETTER CHE}" (depending on the Perl version, for the latter you may have to add use charnames ':full'; at the top of your code). These forms certainly don't look as nice, so you don't have to use them if Unicode works for you, but it's also noteworthy that this will make the difference between "A" and "А" more obvious (one of them is actually "\N{CYRILLIC CAPITAL LETTER A}").

Hope this helps,
-- Hauke D

In reply to Re^6: Begginer's question: If loops one after the other. Is that code correct? by haukex
in thread Begginer's question: If loops one after the other. Is that code correct? by predrag

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.