complex string matching

freebsdboy has asked for the wisdom of the Perl Monks concerning the following question:

I'm in the process of writing a small Perl script that will take an input file of one line per string value, then open the second file to be updated add entries to it from the input file if entry doesn't exist already, then send output to the third file.

The example use is adding entries to Firefox user.js files. Though I would like to make this generic enough to reuse for other types of files that need updating. We want to define a few entries if they don't already exist. Like: add user_pref("app.update.auto", false); As we have a central software update system for FF.

The input file update.txt would contain.

user_pref("app.update.auto", false); 
user_pref("app.update.enabled", false);
user_pref("autoupdate.enabled", false);
[download]

The user.js file, ( file to be updated) looks like

user_pref("browser.bookmarks.file", "h:\\Netscape\\bookmark.htm");
..
[download]

The complex string comparison issue comes into play when trying to compare the input string against that is already there. It Never matches. An exact match that exists in both files never match. So I suspect the special characters found in both the input and file to be updated files. Should I replace the special characters with non special but unique characters then compare? Whats the best way to do the comparison while trying to keep the utility general enough so I can use it against other types of files? Here is the main code:

use warnings;
my $fh;
my $myfile = 'update.txt';
unless (open($fh,"<",$myfile)) {
  die "Can't open $myfile: $!\n";
}
my %update_words = ();

## Read in update file strings
##################################
while (<$fh>) {
  chomp;
  $update_words{$_}++;
  print "$_\n";
}
close($fh);
#################################

# Find files that need updating
#################################
use File::Path;
@files = <"c:/documents and settings/*">;
foreach $file (@files) {
  print $file . "\n";
  clean_extension($file);
}

sub clean_extension {
#  For each of the mozilla profiles in this users profile
@exfiles = <"@_/Application Data/Mozilla/Firefox/Profiles/*">;
        foreach $exfiles (@exfiles) {
          if ( -e "$exfiles/user.js" ) {
          print "$exfiles/user.js\n";
            open(USERPREF, "$exfiles/user.js") or die "Profile open fa
+iled.";
            # Lines look like:   user_pref("app.update.auto", false);
            while(<USERPREF>) {
                chomp;
                    
                    if (exists($update_words{$_})) {
                       print "found $_ \n";
                       # Write 
                    }
                    else {
                       print "not found $_ \n";
                    # write $update_words{$_} to file Outfile
                    }
            #}
        }
       }
    }
                
          }
[download]

Comment on complex string matching Select or Download Code

Replies are listed 'Best First'.
Re: complex string matching by jethro (Monsignor) on Nov 04, 2010 at 12:21 UTC
1) You can escape regex special characters with quotemeta or inside a regex with `\Q$myliteralstring\E`. But that is not neccessary here because you try to match the string via hash comparision 2) Hash comparision is only useful if you can make exact matches, otherwise you need regexes. Make sure you have exactly the line, with spaces and all 3) You try to find out if a config option is NOT in the file and then append it. Obviously you can't know that it isn't in the file until you read and compared ALL lines of the file. Consequently you can't print "not found" and add the line inside the loop that loops through the file. THAT observation is only possible after the loop. A correct while loop would look like this: `my $found=0; while(<USERPREF>) { chomp; if (exists($update_words{$_})) { print "found $_ \n"; $found=1; } } if (not $found) { print "not found $_ \n"; #write }` [download] 4) What happens if the option line is already in the file, but with a different value, i.e. `user_pref("app.update.auto", true);` instead of false ? If the later user_pref line trumps the earlier, you are fine. If it is the other way round, you would need to insert the lines at the beginning of the output file. If the same option twice would be unwanted or cause of error or warning messages, you can't just compare with a hash, you would need to use regexes to first find the option line (without the option value), then replace the line with the correct option value Or, if you don't want to use regexes, change your hash to have `user_pref("app.update.auto"` as key and the complete line as value. In the loop cut away the option value from every line (i.e. `my $comparevalue= split(/,/,$_);` might do the trick if no ',' is ever in an option name) and compare that with the hash: `while (<$fh>) { chomp; my $comparevalue= split(/,/,$_); $update_words{$comparevalue}= $_; print "$_\n"; } ... while(<USERPREF>) { chomp; my $comparevalue= split(/,/,$_); if (exists($update_words{$comparevalue})) { print "found $_ \n"; $found=1; print OUTFILE $update_words{$comparevalue} } else { print OUTFILE $_; } }` [download]	[reply] [d/l] [select]
Re^2: complex string matching by freebsdboy (Novice) on Nov 04, 2010 at 12:42 UTC
Thanks - it seems that it would be best to split the strings into key value pairs, so that: makes comparison easier to deal with non exact matches, keys that may have a different value and thus would have been seen as a different key all together. Yea realize the found not found issue - was trying to get the foundational compare to work properly. I had tried the m/$string/ type of comparison earlier but that didn't work, tried the hash compare but didn't realize its limits. Good to see the escape regexp option too.	[reply]
Re: complex string matching by raybies (Chaplain) on Nov 04, 2010 at 12:18 UTC
if you're convinced there are special characters (though I'm always a little skeptical when people tell me it's "special characters"), then you should definitely get them out of your strings, but you need to id them. Do you have binary dump program? Something that will show you the chars? (in linux I use od and look at a hexdump). Once you know the octal values, use `s/\0(insert octal number here)//g;` to strip them from the string. It may be as simple as nonuniform whitespace or tabs. In such case, strip all extra whitespace `s/\s+/ /g;` so that you have single spaces. Even that might be problematic with leading and trailing spaces in various places. You might consider splitting your search space and extracting only the userpref lines from each file, and storing as keys only the app.update.auto (or whatever field they are), using a regex and split. good luck	[reply] [d/l] [select]
Re: complex string matching by thargas (Deacon) on Nov 04, 2010 at 13:07 UTC
The way I see it, you don't have a complex string-matching problem. The configuration file is effectively (key,value) pairs with syntatic sugar wrapped around them to make them into valid javascript. What you have is a hash-lookup problem. The way I would approach it is to read in your new configuration (key,value) pairs from wherever and put it in a hash. Then you read in your user config file, a line at a time, pulling out the key and value from the line. Update the hash following whatever rules you want for collisions. Finish by writing your hash as the new user config with its javascript wrapping. SMOP. I leave the coding as an exercise for the reader. :-)	[reply]
Re^2: complex string matching by thargas (Deacon) on Nov 04, 2010 at 13:14 UTC
Oh yes. About making this work for other types of files: I wouldn't bother until I had a second type to compare. However, it would still be straightforward. Your munging class would have `parse()` and `write` methods which would be overridden for each type. It'd be a bit more difficult if the files could contain more than one kind of line, but the solution I'd use would simply go from a (key,value) pair to a (type, key, value) triple and add another layer to the hash.	[reply] [d/l] [select]
Re: complex string matching by locked_user sundialsvc4 (Abbot) on Nov 04, 2010 at 12:40 UTC
And if you are looking for “a disk file,” don’t overlook SQLite. http://www.sqlite.org It is an extremely fast, public domain, flat-file database system. (The only “gotcha” might be that you probably want to do things within a transaction, because when you don’t, SQLite carefully ensures that everything has been written to disk. Exactly as it should, of course, but it slows down bulk-operations considerably.) So... where you might code your own logic, have a `tie`d hash and so on, you might be able to use a query ... the result being that it is just as fast as it was for the computer, but considerably faster for you.
Re^2: complex string matching by Your Mother (Archbishop) on Nov 04, 2010 at 13:49 UTC
but it slows down bulk-operations considerably. As I understand it, this isn't quite right. It is slower for multiple transactions but single bulk transactions are faster than Pg and MySQL and since most operations are faster, SQLite comes off as the clear winner for speed: SQLite speed (see disclaimer on age of benchmarks and IANADBA).	[reply]
Re^3: complex string matching by locked_user sundialsvc4 (Abbot) on Nov 04, 2010 at 16:06 UTC
Oh, no question at all about that. And, no question that SQLite is doing exactly what it should. But the difference can be quite dramatic (and of course, they stress to say as much.) SQLite is naturally very fast because it is writing directly to a file. There is no IPC-protocol overhead no matter how slight. But it takes a very cautious about ensuring that disk-writes really have happened. And this slows the processing down to a speed that is determined by the rotation-time of the disk platter. Whereas, if a transaction is in effect, it knows that it doesn’t have to do that. Since client-server databases might handle write-commits in a different way (vis-a-vis what they do and don’t oblige the client to stop and wait for). I meant it just as a little heads-up... SQLite is one of those software projects that makes you sit up and bow down. A real, “holy smokes!!” piece of software. (Y’know, like Perl... and I mean that.)
Re: complex string matching by TomDLux (Vicar) on Nov 04, 2010 at 14:01 UTC
Top left of the Perlmonks page is the search box ... search for 'debugger'. or go to CPAN and search for perldebug in the Perl core documentation, or type 'perldoc perldebug' into an xterm window, if you have access to a Unix system. Then you'll be able to see for yourself what funny characters are showing up in your strings. You call clean_extension() with an argument, but you never use it, never even unload it from @_. As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply]