srikrishnan has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I want to read very big (approximatelymore than 200 mb) postscript files using perl. The reason for the size of the postscript files is TIFF images are loaded in my postscript files. Because of this perl keeps on reading the files, more than hours, at the end also everytime I forcely kill the script

Is there any option to overcome this problem?

Please help us to solve this problem, this would be a great help

Thanks

Srikrishnan R.

Replies are listed 'Best First'.
Re: how to read big postscript files
by Corion (Patriarch) on May 25, 2010 at 07:00 UTC

    Perl has no problem reading big files, so it must be something that your program is doing with the input that makes the program slow. As you don't show any code and input data, it is quite hard for us to help you in a more concrete fashion. Consider testing whether the files open in other programs quickly and correctly. The embedded TIFF images are likely directly dumped as compressed binary data and not as ASCII data, so reading the file line-by-line will likely not work.

      Hi Corion

      Thanks for your response

      Below I have pasted my code

      use strict; use warnings; use Cwd; my $filename; my $filepath; if($ARGV[0]=~m/((.*)[\\\/])?(.*?)\.ps$/i) { $filename=$3; if(defined($1)) { $filepath=$1; } else { $filepath=cwd(); $filepath=~s!/!\\!gi; $filepath.="\\"; } } else { Win32::MsgBox("Incorrect argument, Please check", 0, ""); exit; } open(F1, "$ARGV[0]") or Win32::MsgBox("Input File cannot be opened", 1 +6, "Error Message"); undef $/; my $line = <F1>; close F1; my @imgrem; my $imgno = 0; while($line =~ s/\n\%\%BeginObject\: image(.*?)\n\%\%EndObject/<img$im +gno>/msi) { my $tmp = $&; push(@imgrem, $tmp); $imgno++; } $line =~ s/\(\\266\)D r\n/\(\)D r\n/msgi; while($line =~ m/\[\/Action \<\< \/Subtype \/URI \/URI \((.+?)\) \>\> +\/Rect \[(\d+) (\d+) (\d+) (\d+)\] \/Border \[0 0 0\] \/LNK pdfmark\n +/gi) { my $temp = "$&"; my $contents = $1; my $originalcontents = $contents; my $x1 = $2; my $y1 = $3; my $x2 = $4; my $y2 = $5; $y1 = $y1 - 100; if($contents !~ /^(http|www|mailto)/i) { $contents =~ s/&ndash;/\-/gi; $contents =~ s/&equals;/\=/gi; $contents =~ s/&percnt;/\\%/gi; $contents =~ s/&ast;/\*/gi; $contents =~ s/&(l|r)squo;/\'/gi; $line =~ s/\[\/Action \<\< \/Subtype \/URI \/URI \((.+ +?)\) \>\> \/Rect \[(\d+) (\d+) (\d+) (\d+)\] \/Border \[0 0 0\] \/LNK + pdfmark\n/\[\/Action \<\< \/Subtype \/Caret \/Contents \($contents\) + \/Rect \[$x1 $y1 $x2 $y2\] \/Title \(Original Text\) \/Subj \(Insert +ed Text\) \/Border \[0 0 0\] \/Color \[0 0 1\] \/ANN pdfmark\n/i; } } #while($line =~ s/<img([0-9]+)>/$imgrem[$1]/si){}; open(F2, ">$filepath$filename-out.ps"); print F2 $line; close F2; print "\n\nEnd time ", time() - $^T;

      The above coding run successfully in the files upto some sizes. for eg. it runs on 100mb file

      Thanks

      srikrishnan R.

        So you don't have a problem with reading large Postscript files, you have a problem with processing them.

        Maybe it would be less hard on your machine if you didn't process the whole file in one go. For example, you could write the images to disk instead of keeping them around in memory. Also, you can do the replacements on the parts of the file instead of doing the replacements on the file at once.

        Also, I'm quite unclear what the replacement loop is supposed to be doing, but maybe you can rewrite that code using /ge from perlre. It seems to make heavy use of $&, which tends to be slow.

Re: how to read big postscript files
by dineed (Scribe) on May 25, 2010 at 07:53 UTC

    I don't think I've worked with a postscript file - apologies if the below doesn't make sense.

    Is there some marker that indicates the beginning and/or end of the tiff image? If so, then perhaps you can use a test for the marker(s) and skip the record containing the tiff image. It sounds to me like you need to either strip out the tiff images prior to entering your read loop (using a regex) or account for the binary data within your read loop. You might even be able to respond to the binary data itself - I did something similar a year or two ago with hex data on *nix* system.

      Hi Corion/Dineed,

      Thanks for your immediate responses. Actually our requirement is we need to change the uri links to Annotations throughout the postscript file. Because the original software which uses for creating the postscript not supports Annotated Postscript files natively. So we try to achieve that by writing a perl script for read and modify the original Postscript file

      Thanks

      Srikrishnan R.