perlperlperl has asked for the wisdom of the Perl Monks concerning the following question:

I am not a Perl expert, but have a question about the join operator. Lets say I want to delete all multiline comments in a c/c++ source file. Also suppose my record separator is the default newline. This is how I would do it: read all the lines from the file into a list. The lists elements represent the lines in the source separated by \n. I then use join to combine all the lines into one big string. Then I use regex substitution to replace all /*.....*/ with blank. At least that is how I would do it. My main concern is whether it is sensible to use join in this way with respect to memory usage. Is join inherently inefficient? I have not seen many C source files > 10 MB; I might have to do something similar for larger XML files perhaps. Any better ideas to remove multi line comments from a c source?

Replies are listed 'Best First'.
Re: joining lines efficiency?
by CountZero (Bishop) on Jun 10, 2013 at 06:17 UTC
    Matching C-style comments is very tricky: How do I use a regular expression to strip C style comments from a file?

    Or go easy on yourself and use File::Comments (but it says about C: "Implemented with regular expressions, only works for easy cases until real parsers are employed").

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: joining lines efficiency? (slurp)
by Anonymous Monk on Jun 10, 2013 at 01:47 UTC
    Well, its more efficient to slurp the whole file into one string instead of splitting the file into lines then joining the lines, one way is File::Slurp, the other way is my $wholefile = do { local $/; scalar readline $filehandle };
Re: joining lines efficiency?
by hbm (Hermit) on Jun 10, 2013 at 02:32 UTC

    Another option is the flip-flop operator: Read the file line-by-line, and don't print if between an opening expression and closing expression.

    The general idea is print unless /<expr1>/ .. /<expr2>/, but here the slash and star need to be escaped.

    while(<$fh>){ print unless m{/[*]} .. m{[*]/}; }

      While your proposal needs less memory, it can go wrong as well. Should the comment start or end on a line with code outside of the comments, the whole line and the code will be removed. The regex also needs to be more sophisticated to ignore "/*" ie should the characters be quoted.

        The regex also needs to ... ignore "/*" ie should the characters be quoted.

        It should also handle stuff like this, which compiles and runs just fine:

        #include <stdio.h> #include <assert.h> void main (int argc, char ** argv) { int x = 4; int y = 2; int * p = &y; assert(x/*p == 12345 /* p points to y */); printf("everything looks just fine \n"); }
Re: joining lines efficiency?
by JockoHelios (Scribe) on Jun 10, 2013 at 02:34 UTC
    I've been working with large text data files recently. I do most of my test-bed Perl work on a 10-year-old WinXP PC with 1 GB RAM.
    The largest single file I've processed in one gulp was over 85 MB; the old XP handled it. My scripts pull it all in with code like

    @TextArray = <TEXTFILE>;

    The RegEx substitution you mention should work fine if it can handle a single string that long. I've never tried it that way, so I can't vouch for it.

    From what I've been doing, I'd suggest reading in the whole file, as you mentioned, and as indicated above. Then process each line in a foreach loop, copying lines into a separate array if they aren't the multi-line comments you don't want.

    You would use a variable, perhaps $IsCommentLine, to start and stop the ommission of lines. Set the variable to true when the "/*" is found, set the variable to false when the "*/" is found. When the variable is false, copy the line into the separate array. When the variable is true, don't copy the line. Everything between the delimiters gets omitted, because the variable is true until the "*/" is found. Like your RegEx idea would do, but line-by-line instead.
    Dyslexics Untie !!!