RegEx Problem?

OverlordQ has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: RegEx Problem? by BrowserUk (Patriarch) on Feb 25, 2003 at 06:36 UTC
Unless your definition of "regex" fits with whatever version of SQL your using's LIKE syntax, your going to have to read every line from the DB into memory in order to use perls regex engine. It is almost enevitable that reading from a DB will be less economical than reading a flat file. Another problems is that, as far as I am aware, no DB will find matches that are split across record boundaries. I personally think that unless your data has some structure to it that would allow you to make use of the DB facilities beyong storage and retreival, it doesn't make much sense to put it into a DB, but YMMV. ..and remember there are a lot of things monks are supposed to be but lazy is not one of them Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply]
Re: RegEx Problem? by Abigail-II (Bishop) on Feb 25, 2003 at 09:37 UTC
It depends on the reasons you want to put it in a database. If it's just for faster queries, I doubt that MySQL is able to generate indices to speed up searches you want to do. Which means that both MySQL, and Perl have to scan all records. In that case, it's just a matter of whose ability to read for disk and speed of the regex engine is faster. There could of course be other reasons why keeping your data in a database has advantages over a flatfile. But for many of those reasons, MySQL wouldn't give advantages over a flatfile. Other databases would, though. Abigail	[reply]
Re: RegEx Problem? by Popcorn Dave (Abbot) on Feb 25, 2003 at 06:21 UTC
I think it's going to depend on the size of your file, but I think that if the file isn't too large, slurping it in and doing a foreach isn't going to be the worst. Since all your data is seperated by newlines, why would you bother with stuffing all that data in to a database? You could do something like: `use strict; my $line; open FH, $ARGV[0] or die "Couldn't open input file" while ($line = <FH>){ print $line if $line =~ /Ilikefoo/; } close FH;` [download] If you're on a nix system, and all you want to do is print those lines, wouldn't it be easier to just use the grep utility? There is no emoticon for what I'm feeling now.*	[reply] [d/l]
Re: Re: RegEx Problem? by bart (Canon) on Feb 25, 2003 at 08:03 UTC
I think the regex should actually look like `/I.like.foo/` [download] which can be constructed out of the searchstring using `$re = join '.*', map quotemeta, split ' ', $search;` [download]	[reply] [d/l] [select]
Re: Re: Re: RegEx Problem? by Popcorn Dave (Abbot) on Feb 25, 2003 at 16:31 UTC
You're rght, it probably should. However I wasn't sure if the person who posted was looking for that exact string or using it as a pseudo string. There is no emoticon for what I'm feeling now.	[reply]
Re: Re: RegEx Problem? by OverlordQ (Hermit) on Feb 25, 2003 at 06:41 UTC
Since this is just a piece of a larger program, I'd rather keep it all in Perl if possible. And as for file sizes, a similar file that is 6328 lines is about 564k, so figuring ~7.3 million lines, that's about 666075395 bytes, so about 635 MB all together (correct me on my math if I'm wrong).	[reply]
Re: Re: Re: RegEx Problem? by hardburn (Abbot) on Feb 25, 2003 at 14:53 UTC
As a general rule-of-thumb, a database that is more than 1 MB in size should be kept out of a flat-file and into a real database. Some people suggest even less than that. And when in a discussion about flat-file vs. real databases, you really ought to mention DBD::SQLite. ---- Reinvent a rounder wheel. Note: All code is untested, unless otherwise stated	[reply]
DBD::SQLite by waxmop (Beadle) on Feb 25, 2003 at 21:10 UTC
•Re: DBD::SQLite by merlyn (Sage) on Feb 25, 2003 at 22:01 UTC
Re: RegEx Problem? by CountZero (Bishop) on Feb 25, 2003 at 07:37 UTC
Watch out for that regex: it is probably not going to do what you want it to do in this form. For starters the '' quantifier at the beginning of the regex is rather odd: you should put something in front of it, now it says 'zero or more of nothing'. And 'I' means 'zero or more of I'. Also watch out for the greediness of the '' quantifier. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler."* - Conway's Law	[reply]
Re: Re: RegEx Problem? by OverlordQ (Hermit) on Feb 25, 2003 at 19:20 UTC
I kinda mislead what I was looking for with the asterisks, basically I was using the 'windows' version of the asterisk, if that makes sense.	[reply]
Re: RegEx Problem? by Tomte (Priest) on Feb 25, 2003 at 06:32 UTC
It's in almost any case more economical (read: faster in the long run) to use a database, if you can use the DBs functions and other possibilities to select what you want. However: it depends on the amount of data you expect, and how often you intend to run this script. If there are more than a few lines of data and the script will be in heavy use, then use a database and use string-functions/regexp-capabilities of your db of choice in dynamic build prepared statements. OTH 'eceonomical' depends besides other factors on how often the script will be run and how long it takes you to adapt it to use a DB. Update:I assumed you had in some way structured data, if you're dealing with unstructured data in reasonable amounts, go with FoxtrotUniforms answer. If you have really a lot of data, MySQL with a two-column table (auto-increment-pk-column and a text-column with your data) with fulltext-index will be faster though regards, tomte	[reply]