edge99off has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have started a program that creates a new file with extension CSV; after comparing two CSV files. First it checks which file is the oldest and which file is the newest by date stamp. Then it parses data columns into an array. The problem is comparing ; I'm not sure if I should use REGEX or pattern matching to go thru 20,000 lines of data in each CSV file in less than 2 minutes without causing any damage to the files. I know how to open the files and maybe assign columns into an array then create a query or foreach loop or maybe a while loop with if then else. Basically this file gets updated twice a month so in the same month I need to compare whether Part Number's status change or quantity or price or maybe part number description which I found very unlikely. I do need a quick way to accomplish this work without making this program any longer and hard to trace. Any suggestions.

  • Comment on Compare CSVs FILES using REGEX or pattern matching

Replies are listed 'Best First'.
Re: Compare CSVs FILES using REGEX or pattern matching
by GrandFather (Saint) on Dec 02, 2015 at 05:19 UTC
    Any suggestions

    Use a real database. Track changes to parts and stock levels in an audit table and generate reports by dumping the current period's changes from the audit table.

    ... without making this program any longer and hard to trace.

    We can't see "this program", or even the bits relevant to the question. As described you should only need half a dozen lines of code to compare the two files. Maybe there is important stuff you forgot to describe that makes the job more interesting?

    Premature optimization is the root of all job security
Re: Compare CSVs FILES using REGEX or pattern matching
by kcott (Archbishop) on Dec 02, 2015 at 05:46 UTC

    G'day edge99off,

    Welcome to the Monastery.

    Unfortunately, a prosaic description, such as you have here, doesn't provide us with sufficient information to offer much help.

    What we really need to see is: a minimal piece of code that's shows what you're currently doing; a small, yet representative, sample of your data; what your current results are and how they differ from what you want; and, possibly any messages (e.g. warnings or errors) that you're currently receiving. You'll find details of this in "How do I post a question effectively?".

    On the basis of what you've posted, here's some potential help.

    Use the Text::CSV module to parse your CSV files.

    "REGEX or pattern matching" doesn't make a lot of sense (at least, not in the context you've used here). You use a regex for pattern matching; not as an alternative. Maybe, when you post some code, your meaning will become clearer. Perhaps take a look at "perlintro: Regular expressions" and follow the links therein for more information.

    Reading a file will not damage it. Parsing its data will not damage the file. The way you open your files could be an issue. Again, this is another example where seeing your code will clear this up for us.

    — Ken

Re: Compare CSVs FILES using REGEX or pattern matching
by Tux (Canon) on Dec 02, 2015 at 07:45 UTC

    As the others already stated, do not parse CSV with regexes. Ever. If your data is extremely simple (and guaranteed to stay simple), a split might be the way to go, but usually simple CSV doesn't stay simple and embedded newlines, separation characters or quotation will lead into unmaintainable code. Stick with Text::CSV_XS (or Text::CSV if you don't care about the speed) for parsing CSV.

    The Text::CSV_XS distribution has an examples section that hosts the csvdiff script, which might be exactly what you are looking for. Feel free to get that and amend it to your needs.


    Enjoy, Have FUN! H.Merijn
Re: Compare CSVs FILES using REGEX or pattern matching
by Laurent_R (Canon) on Dec 02, 2015 at 10:51 UTC
    As previously said by other monks, you don't provide enough information, but depending on the approximate ratio of lines that get updated between two runs, you might just start by comparing the full lines and decide to split the lines and compare the individual columns only for those lines which are different.

    Otherwise, we don't have enough details about your procedure and your data, but, in general, comparing 20,000 lines in less than 2 minutes seems to be a very realistic aim (if coded reasonably efficiently). I am quite often comparing 30 million lines in 10 to 15 minutes or even less if the comparison to be performed is simple or the lines relatively short, on a platform which is far from being a racing horse.

    Finally, as already pointed out, if you open your files in read mode, there is no danger to alter them. But show your code to confirm this as well as my previous (quite general) comments.

Re: Compare CSVs FILES using REGEX or pattern matching
by FreeBeerReekingMonk (Deacon) on Dec 02, 2015 at 20:49 UTC

    If you only need to know what changed, why not use first Comm -23 NEW.csv OLD.csv.
    This way you know which identifiers to check, and skip the ones that are identical.