I am parsing an HTML file for a word in a foreign language. If the word is found, it will add it to a file.Your code expects you have a file called "test.txt" that contains one or more (Urdu?) words. BTW, is the text in that file UTF-8 encoded? Have you made sure that your script is reading it correctly?
Then your code expects you have a file called "platts_wkg.html", which is assumed to contain one or more matches to the words in test.txt. Is the html file encoded the same way as test.txt? Have you checked whether some of the words that "ought" to match might be using numeric character entities (e.g. ب or ء for the Urdu letter "b")?
But there are some problems with the OP logic:
In other words, your "if" statement is not failing - it's doing what the logic says it should do. The problem is that the logic is wrong.
I think you should start by reading all the contents of "test.txt" before you open the html file. Combine all the target words into a single regex, and then do just one pass over the html data - like this:
If there's something you want to do with lines that don't match for any of the target words, you can put an "else" clause in the while loop that reads from the html file. But in that case, I would again recommend that you avoid doing anything that involves manual input to the script for each html line - put stuff into an array or hash, print it to a separate file, and deal with it in some way that's likely to be easier and less error-prone.my $targets_file = '/Users/me/test.txt'; # (I'd rather get this from +@ARGV) open( my $urdu_words, '<', $targets_file ) # (2nd arg might need ':ut +f8' too) or die "$targets_file: $!\n"; my @target_strings = <$urdu_words>; close $urdu_words; chomp @target_strings; my $target_regex = join( '|', @target_strings ); # Now open and read from the html file # Use a hash to kept track of matches, so you can sort them later: my $html_file = '/Users/me/platts_wkg.html'; # could get that from @AR +GV too open( my $platts, '<', $html_file ) or die "$html_file: $!\n"; my %matches; while (<$platts>) { if ( /^.*?<p>.*? ($target_regex) / ) { # note: spaces are now OUT +SIDE parens $matches{$1} .= " $_"; } } # At this point, it would be easy to dump all the matches to a file, a +nd # then edit that file manually, if you want: my $matched_file = '/Users/me/matches_found.txt'; open( my $output, '>', $matched_file ) or die "$matched_file: $!\n"; for my $match ( sort keys %matches ) { print $output "Matches found for target: $match\n $matches{$match} +\n"; }
In reply to Re: My "if" is failing!
by graff
in thread My "if" is failing!
by Sumtingwong
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |